Hi, I’m a researcher using YOLOv11 for a detection task in my paper.
We are essentially comparing human vs. AI performance. The COCO-style metrics output from Ultralytics checks all the boxes for us. The AI part of the research is done.
However, the COCO metrics for the humans were calculated manually, using no framework. We are considering using Ultralytics instead to analyze the human annotations and auto generate the graphs so they look similar visually and to ease the task of corroborating the data in the same graph.
So, I have two questions:
Is it possible to use JUST the evaluation pipeline to get the graphs if I already have the “predictions” (made by humans)?
How can I get the Ultralytics metric graphs in .csv format so I could plot both the humans and machine COCO metrics in the same graph? (obs. im not talking about the training “results.csv”, I need the points for the PR_curve, for example)
Thanks a lot for the response, I think I now understand how to use just the evaluation pipeline from pregenerated inferences.
However your comment about the mAP fix sent me in a rabbithole search and I don’t really understand why I’m getting significantly different mAP values if I use different libs. However it doesn’t really matter why, I just want to use the correct one.
So as a sanity check I calculated mAP using pycocotools, like this
# COCO eval sanity check
import os
import pickle
import json
import torch
from ultralytics import YOLO
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
# Path to dataset (ensure test annotations are in COCO format)
coco_gt_path = "dataset/test+valid_COCO_format.json"
image_dir = "dataset/images/test+valid" # Path to test images
# Load COCO ground truth
coco_gt = COCO(coco_gt_path)
# List to store predictions in COCO format
coco_predictions = []
if not os.path.exists("inferences/results.pkl"):
# Run inference on test images
# Load trained YOLO model
model = YOLO("runs/detect/train_yolo11m/weights/best.pt").to("cuda")
results = model.predict(source=image_dir, conf=0.5, save=False)
pickle.dump(results, open("inferences/results.pkl", "wb"))
else:
results = pickle.load(open("inferences/results.pkl", "rb"))
# Convert YOLO predictions to COCO format
image_id_map = {img['file_name']: img['id'] for img in coco_gt.dataset["images"]} # Map filenames to COCO image IDs
for result in results:
image_name = result.path.split("/")[-1] # Extract filename
image_id = image_id_map.get(image_name, -1) # Get COCO image ID
if image_id == -1:
continue # Skip if image is not in ground truth
for box, score, cls in zip(result.boxes.xyxy, result.boxes.conf, result.boxes.cls):
x_min, y_min, x_max, y_max = box.cpu().numpy()
width = x_max - x_min
height = y_max - y_min
coco_predictions.append({
"image_id": image_id,
"category_id": int(cls.item()), # COCO class ID (should match dataset)
"bbox": [x_min, y_min, width, height], # Convert to COCO format
"score": float(score.item()) # Confidence score
})
# Save predictions to JSON
coco_pred_path = "inferences/coco_predictions.json"
with open(coco_pred_path, "w") as f:
json.dump(str(coco_predictions), f)
# Load predictions into COCOEval
coco_pred = coco_gt.loadRes(coco_predictions)
# Initialize COCO evaluation
coco_eval = COCOeval(coco_gt, coco_pred, "bbox")
coco_eval.evaluate()
coco_eval.accumulate()
coco_eval.summarize()
Results show there is a BIG discrepancy between ultralytics and COCOEval calculations. Which one is the most trustworthy?
Edit: just found out the post that suggested to change from the hardcoded “interp” to “continuous”, changing it didn’t make make a difference unfortunately.
You can save the predictions as JSON using save_json argument in model val(). You don’t need to write custom code.
Also predictions shouldn’t have confidence threshold applied for mAP calculatoon, otherwise you break the mAP calculation. That’s why the conf threshold in model.val() is set to 0.001.
You should be using the save_json functionality. And COCO JSON indices are ahead by 1.