Multi−GPU train - NVIDIA 5090

Windows Multi−GPU / YOLO11 from Python API won’t start multi-GPU — always runs on cuda:0

GPUs: 4× NVIDIA GeForce RTX 5090 (32 GB)
Driver: 577.00 (CUDA driver 12.9)
OS: Windows 11 Pro
Python: 3.11.11 (Conda)
PyTorch: 2.8.0+cu128
Ultralytics: 8.3.127 (CLI suggests updating to 8.3.206)

Problem: Training via the Python API with device=[0,1,2,3] always runs on a single GPU (cuda:0) — DDP never initializes.What I’m looking for:How to start multi-GPU (DDP) from the Python API on Windows in this version?Is this a known issue in 8.3.127, and does updating to 8.3.206 fix API-side DDP on Windows?

Can you post the training logs?

Can you provide the output after running this command in terminal: yolo checks?

yolo checks output:

Ultralytics 8.3.127 Python-3.11.11 torch-2.8.0+cu128 CUDA:0 (NVIDIA GeForce RTX 5090, 32607MiB)
Setup complete (32 CPUs, 127.8 GB RAM, 181.8/378.5 GB disk)

OS Windows-10-10.0.26100-SP0
Environment Windows
Python 3.11.11
Install pip
Path
RAM 127.79 GB
Disk 181.8/378.5 GB
CPU AMD EPYC 9124 16-Core Processor
CPU count 32
GPU NVIDIA GeForce RTX 5090, 32607MiB
GPU count 4
CUDA 12.8

numpy 1.26.4>=1.23.0
matplotlib 3.10.6>=3.3.0
opencv-python 4.11.0.86>=4.6.0
pillow 11.3.0>=7.1.2
pyyaml 6.0.3>=5.3.1
requests 2.32.5>=2.23.0
scipy 1.16.2>=1.4.1
torch 2.8.0+cu128>=1.8.0
torch 2.8.0+cu128!=2.4.0,>=1.8.0; sys_platform == “win32”
torchvision 0.23.0+cu128>=0.9.0
tqdm 4.67.1>=4.64.0
psutil 7.1.0
py-cpuinfo 9.0.0
pandas 2.3.3>=1.1.4
seaborn 0.13.2>=0.11.0
ultralytics-thop 2.0.17>=2.0.0

I wish I had a setup like that to test with!

A quick two questions.

  1. If you try training with device=1 or device=2 does it train on that GPU or does it still choose the first GPU?
  2. What’s the output of nvidia-smi show?
  1. Yes, learning takes place on the selected GPU.

What’s the training logs with multi GPU training? Can you post the training logs? Does it show DDP?