New Release: Ultralytics v8.4.37

Ultralytics v8.4.37 is out! :rocket:

We’re excited to release Ultralytics v8.4.37, a quality + workflow-focused update that improves tuning, training robustness, evaluation reliability, and documentation clarity. The standout change is NDJSON-based hyperparameter tuning for multi-dataset workflows, alongside new support for handling class imbalance during training. :raising_hands:

If you’re training, tuning, or deploying Ultralytics YOLO, this release should make your workflow smoother and more reliable.

Highlights

:warning: WARNING
The mAP calculation has been revised in this release. Reported mAP may be slightly lower than in previous Ultralytics versions, but now more closely matches pycocotools’ COCOEval

:brain: Better hyperparameter tuning for multi-dataset workflows

A major upgrade in PR #24179 by @Laughing-q moves tuning logs from CSV to tune_results.ndjson, making experiment tracking more flexible and robust.

This includes:

  • Per-dataset fitness tracking for multi-dataset tuning
  • Updated output/plot naming such as tune_fitness.png
  • Improved MongoDB sync behavior aligned with local NDJSON logs

For teams running larger tuning experiments across multiple datasets, this is the biggest improvement in v8.4.37. :chart_increasing:

:balance_scale: New class imbalance support in training

With PR #23565 by @ahmet-f-gumustas, detection training now supports a new hyperparameter: cls_pw.

This allows you to give more weight to underrepresented classes during training:

  • New hyperparameter: cls_pw
  • Default value: 0.0
  • Existing behavior remains unchanged unless you enable it

This is especially useful for long-tail datasets where rare classes need more learning emphasis. :bullseye:

:shield: More reliable training and checkpointing

Training robustness got a nice boost in this release:

These changes help reduce avoidable interruptions and make recovery safer in edge cases. :repeat_button:

Improvements

:white_check_mark: More robust evaluation and CI behavior

A precision edge case in AP computation was fixed in PR #24175 by @Laughing-q, improving compute_ap reliability.

Additional CI and benchmark improvements include:

:broom: Cleaner distributed training logs

PR #24177 by @Laughing-q reduces duplicate model info printing in DDP and multi-process training, making logs easier to read and debug.

Docs and Platform updates

This release also improves the docs and overall UX:

Release tag note

The release-tag PR itself, PR #24192 by @glenn-jocher, is a version bump from 8.4.36 to 8.4.37. The runtime and workflow improvements come from the PRs above. :package:

New contributor shoutout :tada:

A warm welcome to @easyrider11, who made their first contribution in PR #24182! Thank you for helping improve the YOLO ecosystem.

Why this release matters

v8.4.37 is especially helpful if you:

  • Run hyperparameter tuning across multiple datasets
  • Train on imbalanced datasets with rare classes
  • Need safer checkpointing during unstable early epochs
  • Want cleaner DDP logs and fewer flaky CI/benchmark issues
  • Prefer clearer docs and more accurate examples

Try it out

Update with:

pip install -U ultralytics

Then explore the full details in the v8.4.37 release page or browse the full changelog.

If you test the release, we’d love to hear how it works for your training and tuning workflows. Feedback, bug reports, and PRs are always welcome! :speech_balloon:

Hi,

I wanted to flag a significant side effect of PR #24175 (“Fix AP calculation precision in compute_ap”) introduced in v8.4.37.

After upgrading from 8.4.36 to 8.4.37, I observed a consistent −0.018 drop in mAP50-95 (e.g. 0.44719 → 0.42922) and −0.015 drop in mAP50 across all my validation runs — with absolutely no changes to the model, training code, dataset, or weights.

I was able to confirm this is purely a metrics computation change by running the exact same PASS2 fine-tuning configuration on the same checkpoint under both versions:

Ultralytics version train/box_loss val/box_loss Precision Recall mAP50 mAP50-95
8.4.36 0.86178 1.13583 0.74674 0.63764 0.67372 0.44719
8.4.37 0.86178 1.13583 0.74674 0.63764 0.65841 0.42922

All losses, precision, and recall are byte-for-byte identical — only mAP values differ.

While I understand the intent was to fix an edge case in AP calculation, this change has a few consequences worth noting:

  1. All previously published benchmarks (YOLO11, YOLO12, YOLO26 model cards) were computed under the old metric — they are no longer reproducible under 8.4.37+

  2. The change is silent — users comparing results across versions may mistakenly attribute the drop to a model or training regression

  3. A −0.018 shift on mAP50-95 is substantial — it’s larger than many hyperparameter tuning gains that researchers spend days chasing

I believe a change of this magnitude in a core evaluation metric deserves explicit mention in the release notes as a breaking change, so users are aware and can re-baseline their experiments accordingly.

Thank you for the great work on Ultralytics — just wanted to make sure this doesn’t catch others off guard as it did for me.

@Livlo1970 Thanks for flagging.

For the official models, Ultralytics uses COCOEval to get the mAP scores. COCOEval is the standard and what’s recommended for research for consistency with other works. Ultralytics’ native mAP calculation is faster compared to COCOEval, but it doesn’t produce exactly the same mAP as COCOEval. This PR in particular tries to bring the Ultralytics mAP closer to what’s obtained by COCOEval.

We will update the release notes to make it explicit.

1 Like

Thank you for the quick and clear response.

That’s very helpful to know that the official benchmarks use COCOEval and are therefore unaffected. It makes sense to align the native calculation with the standard — it’s a good change in the long run.

I appreciate that you’ll update the release notes. That will definitely help users who, like me, rely on the native metrics for iterative comparisons during training.

Thanks again for the transparency and the great work on Ultralytics.

Absolutely, and thanks again for surfacing it early.

We’ve now made the v8.4.37 note explicit on the release page so shifts in native mAP50 and mAP50-95 between 8.4.36 and 8.4.37+ are less likely to be mistaken for a real Ultralytics YOLO regression.

Your takeaway is exactly right: re-baselining is the right move for native metric comparisons, and for strict cross-version comparability COCOEval remains the best reference. Appreciate the careful validation here.