How to use just the evaluation pipeline on predictions made by another model?

Hi, I’m a researcher using YOLOv11 for a detection task in my paper.

We are essentially comparing human vs. AI performance. The COCO-style metrics output from Ultralytics checks all the boxes for us. The AI part of the research is done.

However, the COCO metrics for the humans were calculated manually, using no framework. We are considering using Ultralytics instead to analyze the human annotations and auto generate the graphs so they look similar visually and to ease the task of corroborating the data in the same graph.

So, I have two questions:

  1. Is it possible to use JUST the evaluation pipeline to get the graphs if I already have the “predictions” (made by humans)?

  2. How can I get the Ultralytics metric graphs in .csv format so I could plot both the humans and machine COCO metrics in the same graph? (obs. im not talking about the training “results.csv”, I need the points for the PR_curve, for example)

  1. You can use this.
  2. You can access the curve values in the results object that’s returned after validation in Python:
>>> results = model.val(data="coco8.yaml")
>>> results.curves
['Precision-Recall(B)', 'F1-Confidence(B)', 'Precision-Confidence(B)', 'Recall-Confidence(B)']
>>> results.curves_results[0][0].shape  # x-axis points
(1000,)
>>> results.curves_results[0][1].shape  # y-axis points
(6, 1000)
>>> results.curves_results[0][2]  # x-axis lsbel
'Recall'
>>> results.curves_results[0][3]  # y-axis label 
'Precision'

You should also apply the fix here to fix the mAP calculation in Ultralytics.

1 Like

Thanks a lot for the response, I think I now understand how to use just the evaluation pipeline from pregenerated inferences.

However your comment about the mAP fix sent me in a rabbithole search and I don’t really understand why I’m getting significantly different mAP values if I use different libs. However it doesn’t really matter why, I just want to use the correct one.

So as a sanity check I calculated mAP using pycocotools, like this

# COCO eval sanity check
import os
import pickle
import json
import torch
from ultralytics import YOLO
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval


# Path to dataset (ensure test annotations are in COCO format)
coco_gt_path = "dataset/test+valid_COCO_format.json"
image_dir = "dataset/images/test+valid"  # Path to test images

# Load COCO ground truth
coco_gt = COCO(coco_gt_path)

# List to store predictions in COCO format
coco_predictions = []

if not os.path.exists("inferences/results.pkl"):
    # Run inference on test images
    
    # Load trained YOLO model
    model = YOLO("runs/detect/train_yolo11m/weights/best.pt").to("cuda")
    results = model.predict(source=image_dir, conf=0.5, save=False)  
    pickle.dump(results, open("inferences/results.pkl", "wb"))
else:
    results = pickle.load(open("inferences/results.pkl", "rb"))

# Convert YOLO predictions to COCO format
image_id_map = {img['file_name']: img['id'] for img in coco_gt.dataset["images"]}  # Map filenames to COCO image IDs

for result in results:
    image_name = result.path.split("/")[-1]  # Extract filename
    image_id = image_id_map.get(image_name, -1)  # Get COCO image ID

    if image_id == -1:
        continue  # Skip if image is not in ground truth


    for box, score, cls in zip(result.boxes.xyxy, result.boxes.conf, result.boxes.cls):
        x_min, y_min, x_max, y_max = box.cpu().numpy()
        width = x_max - x_min
        height = y_max - y_min

        coco_predictions.append({
            "image_id": image_id,
            "category_id": int(cls.item()),  # COCO class ID (should match dataset)
            "bbox": [x_min, y_min, width, height],  # Convert to COCO format
            "score": float(score.item())  # Confidence score
        })

# Save predictions to JSON
coco_pred_path = "inferences/coco_predictions.json"


with open(coco_pred_path, "w") as f:
    json.dump(str(coco_predictions), f)

# Load predictions into COCOEval
coco_pred = coco_gt.loadRes(coco_predictions)

# Initialize COCO evaluation
coco_eval = COCOeval(coco_gt, coco_pred, "bbox")
coco_eval.evaluate()
coco_eval.accumulate()
coco_eval.summarize()

Results show there is a BIG discrepancy between ultralytics and COCOEval calculations. Which one is the most trustworthy?

Edit: just found out the post that suggested to change from the hardcoded “interp” to “continuous”, changing it didn’t make make a difference unfortunately.

You can save the predictions as JSON using save_json argument in model val(). You don’t need to write custom code.

Also predictions shouldn’t have confidence threshold applied for mAP calculatoon, otherwise you break the mAP calculation. That’s why the conf threshold in model.val() is set to 0.001.

You should be using the save_json functionality. And COCO JSON indices are ahead by 1.

1 Like