The study’s findings, based on the ODverse33 benchmark (introduced by this publication), challenge the assumption that newer YOLO versions are always superior. The authors observe fluctuations in performance across domain-specific applications, with some older versions outperforming newer ones in certain scenarios.
The ODverse33 benchmark, contains 33 datasets across 11 distinct domains: Autonomous Driving, Agricultural, Underwater, Medical, Videogame, Industrial, Aerial, Wildlife, Retail, Microscopic, and Security. The aim is to present a benchmark method that is a step above the existing COCO dataset and can be used to provide insight for practitioners to aid in selecting a model based on performance for a dataset more representative of their use case.
The study aggregated performance across all datasets using the standard mAP_{50} and mAP_{50:95} performance metrics, in addition to the mAP evaluations for small (_{sm}), medium (_{md}), and large (_{lg}) object sizes.
metric | YOLOv5 | YOLOv6 | YOLOv7 | YOLOv8 | YOLOv9 | YOLOv10 | YOLOv11 |
---|---|---|---|---|---|---|---|
mAP_{50} | 0.7991 | 0.7799 | 0.7969 | 0.7954 | 0.8053 | 0.7866 | 0.8072 |
mAP_{50:95} | 0.5904 | 0.5592 | 0.5766 | 0.5881 | 0.5853 | 0.5828 | 0.5983 |
mAP_{sm} | 0.3684 | 0.3112 | 0.3560 | 0.3689 | 0.3814 | 0.3555 | 0.3794 |
mAP_{md} | 0.5512 | 0.5007 | 0.5461 | 0.5459 | 0.5568 | 0.5427 | 0.5588 |
mAP_{lg} | 0.6708 | 0.6273 | 0.6687 | 0.6735 | 0.6770 | 0.6681 | 0.6769 |
Notably, the authors also drew the following conclusion:
Overall, the ODverse33 benchmark reveals the fluctuation of model performance across different YOLO versions and professional domains, emphasizing that the newer YOLO versions are not always guaranteed to outperform their predecessors. The fluctuation in performance highlights that, despite advancements in model architecture and training strategies, improvements may not always translate into better results across all domains. This observation challenges the common assumption that the latest versions are universally superior and suggests that careful evaluation across diverse contexts is essential.
The comparison between different YOLO versions also underscores the influence of the development teams behind each model. Notably, models released by the same team often exhibit a consistent trajectory of improvement. For example, YOLOv5, YOLOv8, and YOLOv11, all developed by Ultralytics, showcase a clear and steady advancement in performance, reflecting the team’s strong focus on refining and optimizing their models.