New Release: Ultralytics v8.3.218

Ultralytics v8.3.218 :rocket: โ€” True multi-GPU validation, contiguous sampler, and accurate cross-GPU metrics

Ultralytics v8.3.218 delivers reliable, faster multi-GPU training. This release enables true multi-GPU validation during training with correct cross-GPU metric aggregation and a new contiguous distributed sampler for stable evaluation. Upgrade with pip install -U ultralytics and enjoy a smoother DDP experience.

:glowing_star: Summary

  • Multi-GPU validation now runs correctly on all ranks with proper aggregation.
  • New ContiguousDistributedSampler preserves sample order and batch alignment across GPUs.
  • Cleaner trainer flow and synchronized EMA for consistent metrics.

:new_button: New Features

  • Multi-GPU validation during training
    • Validation DataLoader and Validator are created on all ranks for proper DDP execution.
    • Rank-aware device selection ensures each process validates on its own GPU.
  • ContiguousDistributedSampler
    • Contiguous, batch-aligned chunks per GPU preserve dataset order and determinism.
    • Automatically used when shuffle=False (e.g., rect=True or size-grouped evaluation); falls back to PyTorch DistributedSampler when shuffle=True.

Learn more by reviewing the implementing PR in Enable multi-GPU validation during training (#22377) by Y-T-G.

:gear: Improvements

  • Correct cross-GPU metric aggregation
    • Validation losses are properly reduced across GPUs.
    • Detection and classification validators gather stats from all ranks and compute results on rank 0 only.
    • EMA buffers are synchronized from rank 0 to all GPUs for consistent validation.
  • Trainer flow cleanup
    • Validation is executed outside the inner training step for cleaner DDP behavior.
    • Final evaluation does only the necessary work on rank 0 with safe synchronization for others.
  • Documentation
    • Reference docs now include ContiguousDistributedSampler.

These changes address reported issues including multi-GPU validation during training, cross-GPU aggregation correctness, and sampler ordering consistency.

:bullseye: Why it matters

  • More reliable multi-GPU results
    • Proper aggregation ensures metrics reflect the full distributed dataset instead of per-rank fragments.
  • Faster and more stable validation
    • Contiguous sampling reduces padding/overhead and improves determinism, especially with rect=True.
  • Seamless distributed training
    • No extra setup required; single-GPU behavior is unchanged.

:rocket: Quick start (DDP)

  • CLI (recommended YOLO11):
    yolo detect train data=coco128.yaml model=yolo11n.pt devices=0,1,2,3
    
  • Python:
    from ultralytics import YOLO
    
    model = YOLO("yolo11n.pt")
    model.train(data="coco128.yaml", devices=[0, 1], imgsz=640, epochs=50)
    

:package: Whatโ€™s Changed

You can browse the highlights in the v8.3.218 release notes or dive into the details in the full changelog between v8.3.217 and v8.3.218.

:raising_hands: Try it and share feedback

Please update, run your multi-GPU workflows, and let us know how it goes. Open a discussion or issue with your findingsโ€”your feedback helps the YOLO community and the Ultralytics team keep improving. Happy training and validating across GPUs! :tada: