I have been training a YoloV11 model for object detection of 5 classes in an image. I was originally training with this dual GPU setup:
Ultralytics 8.3.28 π Python-3.8.18 torch-2.1.2+cu121 CUDA:0 (NVIDIA GeForce RTX 3080 Ti, 12288MiB)
CUDA:1 (NVIDIA GeForce RTX 3080 Ti, 12287MiB)
But we recently made the change to this single GPU setup:
Ultralytics 8.3.28 π Python-3.8.18 torch-2.1.2+cu121 CUDA:0 (NVIDIA RTX 6000 Ada Generation, 49140MiB)
As far as I know the only difference is the hardware, yet while trying to reproduce the results on the dual GPU setup with the single GPU setup there was a large change in model performance: 0.778 Map50 on single GPU from 0.622 Map50 on dual GPU.
We have double checked all parameters and the datasets are identical. The only differences seem to be the hardware. Set seeds and package versions in the environments are all the same.
I would like to know if this change is only due to hardware, and if so why does the hardware make such an improvement?
Thanks for reading, I will include more environment and parameter info below:
Dual GPU:
####load yolo11#####
New https://pypi.org/project/ultralytics/8.3.115 available π Update with 'pip install -U ultralytics'
Ultralytics 8.3.28 π Python-3.8.18 torch-2.1.2+cu121 CUDA:0 (NVIDIA GeForce RTX 3080 Ti, 12288MiB)
CUDA:1 (NVIDIA GeForce RTX 3080 Ti, 12287MiB)
engine\trainer: task=detect, mode=train, model=./yolo11l.pt, data=./data.yaml, epochs=1000, time=None, patience=400, batch=2, imgsz=2048, save=True, save_period=10, cache=False, device=0,1, workers=8, project=runs, name=cynthiaBoxesKM_2025_02_14_Dell2(04232025)_model11l, exist_ok=False, pretrained=False, optimizer=Adam, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.2, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=True, lr0=0.001, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.0, hsv_s=0.0, hsv_v=0.0, degrees=180.0, translate=0.2, scale=0.5, shear=0.0, perspective=0.0, flipud=0.5, fliplr=0.5, bgr=0.0, mosaic=0.2, mixup=0.2, copy_paste=0.0, copy_paste_mode=flip, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=runs\cynthiaBoxesKM_2025_02_14_Dell2(04232025)_model11l
Overriding model.yaml nc=80 with nc=5
from n params module arguments
0 -1 1 1856 ultralytics.nn.modules.conv.Conv [3, 64, 3, 2]
1 -1 1 73984 ultralytics.nn.modules.conv.Conv [64, 128, 3, 2]
2 -1 2 173824 ultralytics.nn.modules.block.C3k2 [128, 256, 2, True, 0.25]
3 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
4 -1 2 691712 ultralytics.nn.modules.block.C3k2 [256, 512, 2, True, 0.25]
5 -1 1 2360320 ultralytics.nn.modules.conv.Conv [512, 512, 3, 2]
6 -1 2 2234368 ultralytics.nn.modules.block.C3k2 [512, 512, 2, True]
7 -1 1 2360320 ultralytics.nn.modules.conv.Conv [512, 512, 3, 2]
8 -1 2 2234368 ultralytics.nn.modules.block.C3k2 [512, 512, 2, True]
9 -1 1 656896 ultralytics.nn.modules.block.SPPF [512, 512, 5]
10 -1 2 1455616 ultralytics.nn.modules.block.C2PSA [512, 512, 2]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 ultralytics.nn.modules.conv.Concat [1]
13 -1 2 2496512 ultralytics.nn.modules.block.C3k2 [1024, 512, 2, True]
14 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
15 [-1, 4] 1 0 ultralytics.nn.modules.conv.Concat [1]
16 -1 2 756736 ultralytics.nn.modules.block.C3k2 [1024, 256, 2, True]
17 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
18 [-1, 13] 1 0 ultralytics.nn.modules.conv.Concat [1]
19 -1 2 2365440 ultralytics.nn.modules.block.C3k2 [768, 512, 2, True]
20 -1 1 2360320 ultralytics.nn.modules.conv.Conv [512, 512, 3, 2]
21 [-1, 10] 1 0 ultralytics.nn.modules.conv.Concat [1]
22 -1 2 2496512 ultralytics.nn.modules.block.C3k2 [1024, 512, 2, True]
23 [16, 19, 22] 1 1414879 ultralytics.nn.modules.head.Detect [5, [256, 512, 512]]
YOLO11l summary: 631 layers, 25,314,335 parameters, 25,314,319 gradients, 87.3 GFLOPs
DDP: debug command C:\anaconda3_public\anaconda3\envs\yolov9_haoyu\python.exe -m torch.distributed.run --nproc_per_node 2 --master_port 56047 C:\Users\km\AppData\Roaming\Ultralytics\DDP\_temp_kv7m11u22479005911168.py
[2025-04-23 16:57:33,353] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
Ultralytics 8.3.28 π Python-3.8.18 torch-2.1.2+cu121 CUDA:0 (NVIDIA GeForce RTX 3080 Ti, 12288MiB)
CUDA:1 (NVIDIA GeForce RTX 3080 Ti, 12287MiB)
Overriding model.yaml nc=80 with nc=5
Transferred 1009/1015 items from pretrained weights
Freezing layer 'model.23.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks...
AMP: checks passed β
train: Scanning C:\Users\km\Documents\YoloV11Data\cynthiaBoxesKM_2025_02_14_OldVal\training\labels.cache... 448 i
val: Scanning C:\Users\km\Documents\YoloV11Data\cynthiaBoxesKM_2025_02_14_OldVal\trainVal\labels.cache... 32 imag
Plotting labels to runs\cynthiaBoxesKM_2025_02_14_Dell2(04232025)_model11l\labels.jpg...
optimizer: Adam(lr=0.001, momentum=0.937) with parameter groups 167 weight(decay=0.0), 174 weight(decay=0.0005), 173 bias(decay=0.0)
Image sizes 2048 train, 2048 val
Using 16 dataloader workers
Logging results to runs\cynthiaBoxesKM_2025_02_14_Dell2(04232025)_model11l
Starting training for 1000 epochs...
EarlyStopping: Training stopped early as no improvement observed in last 400 epochs. Best results observed at epoch 71, best model saved as best.pt.
To update EarlyStopping(patience=400) pass a new patience value, i.e. `patience=300` or use `patience=0` to disable EarlyStopping.
471 epochs completed in 18.977 hours.
Optimizer stripped from runs\cynthiaBoxesKM_2025_02_14_Dell2(04232025)_model11l\weights\last.pt, 51.7MB
Optimizer stripped from runs\cynthiaBoxesKM_2025_02_14_Dell2(04232025)_model11l\weights\best.pt, 51.7MB
Validating runs\cynthiaBoxesKM_2025_02_14_Dell2(04232025)_model11l\weights\best.pt...
Ultralytics 8.3.28 π Python-3.8.18 torch-2.1.2+cu121 CUDA:0 (NVIDIA GeForce RTX 3080 Ti, 12288MiB)
CUDA:1 (NVIDIA GeForce RTX 3080 Ti, 12287MiB)
YOLO11l summary (fused): 464 layers, 25,283,167 parameters, 0 gradients, 86.6 GFLOPs
Class Images Instances Box(P R mAP50 mAP50-95): 100%|ββββββββββ| 16/16 [00:03
all 32 1588 0.624 0.626 0.622 0.177
1 31 662 0.463 0.61 0.521 0.135
2 31 381 0.761 0.543 0.688 0.226
3 14 386 0.774 0.726 0.769 0.195
4 7 87 0.698 0.612 0.662 0.205
5 21 72 0.424 0.639 0.468 0.124
Speed: 7.0ms preprocess, 60.7ms inference, 0.0ms loss, 21.0ms postprocess per image
Single GPU setup:
####load yolo11#####
New https://pypi.org/project/ultralytics/8.3.113 available π Update with 'pip install -U ultralytics'
Ultralytics 8.3.28 π Python-3.8.18 torch-2.1.2+cu121 CUDA:0 (NVIDIA RTX 6000 Ada Generation, 49140MiB)
engine\trainer: task=detect, mode=train, model=./yolo11l.pt, data=./data.yaml, epochs=1000, time=None, patience=400, batch=2, imgsz=2048, save=True, save_period=10, cache=False, device=0, workers=8, project=runs, name=cynthiaBoxesKM_2025_02_14_Dell3(04172025)_model11l, exist_ok=False, pretrained=False, optimizer=Adam, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.2, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=True, lr0=0.001, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.0, hsv_s=0.0, hsv_v=0.0, degrees=180.0, translate=0.2, scale=0.5, shear=0.0, perspective=0.0, flipud=0.5, fliplr=0.5, bgr=0.0, mosaic=0.2, mixup=0.2, copy_paste=0.0, copy_paste_mode=flip, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=runs\cynthiaBoxesKM_2025_02_14_Dell3(04172025)_model11l
Overriding model.yaml nc=80 with nc=5
from n params module arguments
0 -1 1 1856 ultralytics.nn.modules.conv.Conv [3, 64, 3, 2]
1 -1 1 73984 ultralytics.nn.modules.conv.Conv [64, 128, 3, 2]
2 -1 2 173824 ultralytics.nn.modules.block.C3k2 [128, 256, 2, True, 0.25]
3 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
4 -1 2 691712 ultralytics.nn.modules.block.C3k2 [256, 512, 2, True, 0.25]
5 -1 1 2360320 ultralytics.nn.modules.conv.Conv [512, 512, 3, 2]
6 -1 2 2234368 ultralytics.nn.modules.block.C3k2 [512, 512, 2, True]
7 -1 1 2360320 ultralytics.nn.modules.conv.Conv [512, 512, 3, 2]
8 -1 2 2234368 ultralytics.nn.modules.block.C3k2 [512, 512, 2, True]
9 -1 1 656896 ultralytics.nn.modules.block.SPPF [512, 512, 5]
10 -1 2 1455616 ultralytics.nn.modules.block.C2PSA [512, 512, 2]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 ultralytics.nn.modules.conv.Concat [1]
13 -1 2 2496512 ultralytics.nn.modules.block.C3k2 [1024, 512, 2, True]
14 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
15 [-1, 4] 1 0 ultralytics.nn.modules.conv.Concat [1]
16 -1 2 756736 ultralytics.nn.modules.block.C3k2 [1024, 256, 2, True]
17 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
18 [-1, 13] 1 0 ultralytics.nn.modules.conv.Concat [1]
19 -1 2 2365440 ultralytics.nn.modules.block.C3k2 [768, 512, 2, True]
20 -1 1 2360320 ultralytics.nn.modules.conv.Conv [512, 512, 3, 2]
21 [-1, 10] 1 0 ultralytics.nn.modules.conv.Concat [1]
22 -1 2 2496512 ultralytics.nn.modules.block.C3k2 [1024, 512, 2, True]
23 [16, 19, 22] 1 1414879 ultralytics.nn.modules.head.Detect [5, [256, 512, 512]]
YOLO11l summary: 631 layers, 25,314,335 parameters, 25,314,319 gradients, 87.3 GFLOPs
Freezing layer 'model.23.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks...
AMP: checks passed β
train: Scanning C:\Users\kmathers\Documents\YoloV11Data\cynthiaBoxesKM_2025_02_14_OldVal\training\labels.cache... 448 i
val: Scanning C:\Users\kmathers\Documents\YoloV11Data\cynthiaBoxesKM_2025_02_14_OldVal\trainVal\labels.cache... 32 imag
Plotting labels to runs\cynthiaBoxesKM_2025_02_14_Dell3(04172025)_model11l\labels.jpg...
optimizer: Adam(lr=0.001, momentum=0.937) with parameter groups 167 weight(decay=0.0), 174 weight(decay=0.0005), 173 bias(decay=0.0)
Image sizes 2048 train, 2048 val
Using 8 dataloader workers
Logging results to runs\cynthiaBoxesKM_2025_02_14_Dell3(04172025)_model11l
Starting training for 1000 epochs...
EarlyStopping: Training stopped early as no improvement observed in last 400 epochs. Best results observed at epoch 158, best model saved as best.pt.
To update EarlyStopping(patience=400) pass a new patience value, i.e. `patience=300` or use `patience=0` to disable EarlyStopping.
558 epochs completed in 9.810 hours.
Optimizer stripped from runs\cynthiaBoxesKM_2025_02_14_Dell3(04172025)_model11l\weights\last.pt, 51.7MB
Optimizer stripped from runs\cynthiaBoxesKM_2025_02_14_Dell3(04172025)_model11l\weights\best.pt, 51.7MB
Validating runs\cynthiaBoxesKM_2025_02_14_Dell3(04172025)_model11l\weights\best.pt...
Ultralytics 8.3.28 π Python-3.8.18 torch-2.1.2+cu121 CUDA:0 (NVIDIA RTX 6000 Ada Generation, 49140MiB)
YOLO11l summary (fused): 464 layers, 25,283,167 parameters, 0 gradients, 86.6 GFLOPs
Class Images Instances Box(P R mAP50 mAP50-95): 100%|ββββββββββ| 8/8 [00:02<0
all 32 1588 0.718 0.749 0.778 0.423
1 31 662 0.564 0.696 0.631 0.307
2 31 381 0.804 0.745 0.813 0.43
3 14 386 0.865 0.925 0.951 0.586
4 7 87 0.71 0.77 0.782 0.454
5 21 72 0.646 0.608 0.713 0.337
Speed: 2.0ms preprocess, 23.9ms inference, 0.0ms loss, 22.5ms postprocess per image
I am aware that the number of dataloader workers is different due to the number of GPUβs used, but I have done a trial and even with the same number of data loaders the results still differ by 0.1 or more.