Ultralytics v8.3.218
โ True multi-GPU validation, contiguous sampler, and accurate cross-GPU metrics
Ultralytics v8.3.218 delivers reliable, faster multi-GPU training. This release enables true multi-GPU validation during training with correct cross-GPU metric aggregation and a new contiguous distributed sampler for stable evaluation. Upgrade with pip install -U ultralytics and enjoy a smoother DDP experience.
Summary
- Multi-GPU validation now runs correctly on all ranks with proper aggregation.
- New
ContiguousDistributedSamplerpreserves sample order and batch alignment across GPUs. - Cleaner trainer flow and synchronized EMA for consistent metrics.
New Features
- Multi-GPU validation during training
- Validation
DataLoaderandValidatorare created on all ranks for proper DDP execution. - Rank-aware device selection ensures each process validates on its own GPU.
- Validation
- ContiguousDistributedSampler
- Contiguous, batch-aligned chunks per GPU preserve dataset order and determinism.
- Automatically used when
shuffle=False(e.g.,rect=Trueor size-grouped evaluation); falls back to PyTorchDistributedSamplerwhenshuffle=True.
Learn more by reviewing the implementing PR in Enable multi-GPU validation during training (#22377) by Y-T-G.
Improvements
- Correct cross-GPU metric aggregation
- Validation losses are properly reduced across GPUs.
- Detection and classification validators gather stats from all ranks and compute results on rank 0 only.
- EMA buffers are synchronized from rank 0 to all GPUs for consistent validation.
- Trainer flow cleanup
- Validation is executed outside the inner training step for cleaner DDP behavior.
- Final evaluation does only the necessary work on rank 0 with safe synchronization for others.
- Documentation
- Reference docs now include
ContiguousDistributedSampler.
- Reference docs now include
These changes address reported issues including multi-GPU validation during training, cross-GPU aggregation correctness, and sampler ordering consistency.
Why it matters
- More reliable multi-GPU results
- Proper aggregation ensures metrics reflect the full distributed dataset instead of per-rank fragments.
- Faster and more stable validation
- Contiguous sampling reduces padding/overhead and improves determinism, especially with
rect=True.
- Contiguous sampling reduces padding/overhead and improves determinism, especially with
- Seamless distributed training
- No extra setup required; single-GPU behavior is unchanged.
Quick start (DDP)
- CLI (recommended YOLO11):
yolo detect train data=coco128.yaml model=yolo11n.pt devices=0,1,2,3 - Python:
from ultralytics import YOLO model = YOLO("yolo11n.pt") model.train(data="coco128.yaml", devices=[0, 1], imgsz=640, epochs=50)
Whatโs Changed
- ultralytics 8.3.218: Multi-GPU validation during training implemented in Enable multi-GPU validation during training (#22377) by Y-T-G.
You can browse the highlights in the v8.3.218 release notes or dive into the details in the full changelog between v8.3.217 and v8.3.218.
Try it and share feedback
Please update, run your multi-GPU workflows, and let us know how it goes. Open a discussion or issue with your findingsโyour feedback helps the YOLO community and the Ultralytics team keep improving. Happy training and validating across GPUs! ![]()