I am seeing major improvements in my model and the only change has been the machine it is trained on

I have been training a YoloV11 model for object detection of 5 classes in an image. I was originally training with this dual GPU setup:

Ultralytics 8.3.28 πŸš€ Python-3.8.18 torch-2.1.2+cu121 CUDA:0 (NVIDIA GeForce RTX 3080 Ti, 12288MiB)
                                                      CUDA:1 (NVIDIA GeForce RTX 3080 Ti, 12287MiB)

But we recently made the change to this single GPU setup:

Ultralytics 8.3.28 πŸš€ Python-3.8.18 torch-2.1.2+cu121 CUDA:0 (NVIDIA RTX 6000 Ada Generation, 49140MiB)

As far as I know the only difference is the hardware, yet while trying to reproduce the results on the dual GPU setup with the single GPU setup there was a large change in model performance: 0.778 Map50 on single GPU from 0.622 Map50 on dual GPU.

We have double checked all parameters and the datasets are identical. The only differences seem to be the hardware. Set seeds and package versions in the environments are all the same.

I would like to know if this change is only due to hardware, and if so why does the hardware make such an improvement?

Thanks for reading, I will include more environment and parameter info below:

Dual GPU:

####load yolo11#####
New https://pypi.org/project/ultralytics/8.3.115 available πŸ˜ƒ Update with 'pip install -U ultralytics'
Ultralytics 8.3.28 πŸš€ Python-3.8.18 torch-2.1.2+cu121 CUDA:0 (NVIDIA GeForce RTX 3080 Ti, 12288MiB)
                                                      CUDA:1 (NVIDIA GeForce RTX 3080 Ti, 12287MiB)
engine\trainer: task=detect, mode=train, model=./yolo11l.pt, data=./data.yaml, epochs=1000, time=None, patience=400, batch=2, imgsz=2048, save=True, save_period=10, cache=False, device=0,1, workers=8, project=runs, name=cynthiaBoxesKM_2025_02_14_Dell2(04232025)_model11l, exist_ok=False, pretrained=False, optimizer=Adam, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.2, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=True, lr0=0.001, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.0, hsv_s=0.0, hsv_v=0.0, degrees=180.0, translate=0.2, scale=0.5, shear=0.0, perspective=0.0, flipud=0.5, fliplr=0.5, bgr=0.0, mosaic=0.2, mixup=0.2, copy_paste=0.0, copy_paste_mode=flip, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=runs\cynthiaBoxesKM_2025_02_14_Dell2(04232025)_model11l
Overriding model.yaml nc=80 with nc=5

                   from  n    params  module                                       arguments
  0                  -1  1      1856  ultralytics.nn.modules.conv.Conv             [3, 64, 3, 2]
  1                  -1  1     73984  ultralytics.nn.modules.conv.Conv             [64, 128, 3, 2]
  2                  -1  2    173824  ultralytics.nn.modules.block.C3k2            [128, 256, 2, True, 0.25]
  3                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]
  4                  -1  2    691712  ultralytics.nn.modules.block.C3k2            [256, 512, 2, True, 0.25]
  5                  -1  1   2360320  ultralytics.nn.modules.conv.Conv             [512, 512, 3, 2]
  6                  -1  2   2234368  ultralytics.nn.modules.block.C3k2            [512, 512, 2, True]
  7                  -1  1   2360320  ultralytics.nn.modules.conv.Conv             [512, 512, 3, 2]
  8                  -1  2   2234368  ultralytics.nn.modules.block.C3k2            [512, 512, 2, True]
  9                  -1  1    656896  ultralytics.nn.modules.block.SPPF            [512, 512, 5]
 10                  -1  2   1455616  ultralytics.nn.modules.block.C2PSA           [512, 512, 2]
 11                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']
 12             [-1, 6]  1         0  ultralytics.nn.modules.conv.Concat           [1]
 13                  -1  2   2496512  ultralytics.nn.modules.block.C3k2            [1024, 512, 2, True]
 14                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']
 15             [-1, 4]  1         0  ultralytics.nn.modules.conv.Concat           [1]
 16                  -1  2    756736  ultralytics.nn.modules.block.C3k2            [1024, 256, 2, True]
 17                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]
 18            [-1, 13]  1         0  ultralytics.nn.modules.conv.Concat           [1]
 19                  -1  2   2365440  ultralytics.nn.modules.block.C3k2            [768, 512, 2, True]
 20                  -1  1   2360320  ultralytics.nn.modules.conv.Conv             [512, 512, 3, 2]
 21            [-1, 10]  1         0  ultralytics.nn.modules.conv.Concat           [1]
 22                  -1  2   2496512  ultralytics.nn.modules.block.C3k2            [1024, 512, 2, True]
 23        [16, 19, 22]  1   1414879  ultralytics.nn.modules.head.Detect           [5, [256, 512, 512]]
YOLO11l summary: 631 layers, 25,314,335 parameters, 25,314,319 gradients, 87.3 GFLOPs

DDP: debug command C:\anaconda3_public\anaconda3\envs\yolov9_haoyu\python.exe -m torch.distributed.run --nproc_per_node 2 --master_port 56047 C:\Users\km\AppData\Roaming\Ultralytics\DDP\_temp_kv7m11u22479005911168.py
[2025-04-23 16:57:33,353] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
Ultralytics 8.3.28 πŸš€ Python-3.8.18 torch-2.1.2+cu121 CUDA:0 (NVIDIA GeForce RTX 3080 Ti, 12288MiB)
                                                      CUDA:1 (NVIDIA GeForce RTX 3080 Ti, 12287MiB)
Overriding model.yaml nc=80 with nc=5
Transferred 1009/1015 items from pretrained weights
Freezing layer 'model.23.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks...
AMP: checks passed βœ…
train: Scanning C:\Users\km\Documents\YoloV11Data\cynthiaBoxesKM_2025_02_14_OldVal\training\labels.cache... 448 i
val: Scanning C:\Users\km\Documents\YoloV11Data\cynthiaBoxesKM_2025_02_14_OldVal\trainVal\labels.cache... 32 imag
Plotting labels to runs\cynthiaBoxesKM_2025_02_14_Dell2(04232025)_model11l\labels.jpg...
optimizer: Adam(lr=0.001, momentum=0.937) with parameter groups 167 weight(decay=0.0), 174 weight(decay=0.0005), 173 bias(decay=0.0)
Image sizes 2048 train, 2048 val
Using 16 dataloader workers
Logging results to runs\cynthiaBoxesKM_2025_02_14_Dell2(04232025)_model11l
Starting training for 1000 epochs...



EarlyStopping: Training stopped early as no improvement observed in last 400 epochs. Best results observed at epoch 71, best model saved as best.pt.
To update EarlyStopping(patience=400) pass a new patience value, i.e. `patience=300` or use `patience=0` to disable EarlyStopping.

471 epochs completed in 18.977 hours.
Optimizer stripped from runs\cynthiaBoxesKM_2025_02_14_Dell2(04232025)_model11l\weights\last.pt, 51.7MB
Optimizer stripped from runs\cynthiaBoxesKM_2025_02_14_Dell2(04232025)_model11l\weights\best.pt, 51.7MB

Validating runs\cynthiaBoxesKM_2025_02_14_Dell2(04232025)_model11l\weights\best.pt...
Ultralytics 8.3.28 πŸš€ Python-3.8.18 torch-2.1.2+cu121 CUDA:0 (NVIDIA GeForce RTX 3080 Ti, 12288MiB)
                                                      CUDA:1 (NVIDIA GeForce RTX 3080 Ti, 12287MiB)
YOLO11l summary (fused): 464 layers, 25,283,167 parameters, 0 gradients, 86.6 GFLOPs
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16/16 [00:03
                   all         32       1588      0.624      0.626      0.622      0.177
                    1         31        662      0.463       0.61      0.521      0.135
                    2         31        381      0.761      0.543      0.688      0.226
                    3         14        386      0.774      0.726      0.769      0.195
                    4          7         87      0.698      0.612      0.662      0.205
                    5        21         72      0.424      0.639      0.468      0.124
Speed: 7.0ms preprocess, 60.7ms inference, 0.0ms loss, 21.0ms postprocess per image

Single GPU setup:

####load yolo11#####
New https://pypi.org/project/ultralytics/8.3.113 available πŸ˜ƒ Update with 'pip install -U ultralytics'
Ultralytics 8.3.28 πŸš€ Python-3.8.18 torch-2.1.2+cu121 CUDA:0 (NVIDIA RTX 6000 Ada Generation, 49140MiB)
engine\trainer: task=detect, mode=train, model=./yolo11l.pt, data=./data.yaml, epochs=1000, time=None, patience=400, batch=2, imgsz=2048, save=True, save_period=10, cache=False, device=0, workers=8, project=runs, name=cynthiaBoxesKM_2025_02_14_Dell3(04172025)_model11l, exist_ok=False, pretrained=False, optimizer=Adam, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.2, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=True, lr0=0.001, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.0, hsv_s=0.0, hsv_v=0.0, degrees=180.0, translate=0.2, scale=0.5, shear=0.0, perspective=0.0, flipud=0.5, fliplr=0.5, bgr=0.0, mosaic=0.2, mixup=0.2, copy_paste=0.0, copy_paste_mode=flip, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=runs\cynthiaBoxesKM_2025_02_14_Dell3(04172025)_model11l
Overriding model.yaml nc=80 with nc=5

                   from  n    params  module                                       arguments
  0                  -1  1      1856  ultralytics.nn.modules.conv.Conv             [3, 64, 3, 2]
  1                  -1  1     73984  ultralytics.nn.modules.conv.Conv             [64, 128, 3, 2]
  2                  -1  2    173824  ultralytics.nn.modules.block.C3k2            [128, 256, 2, True, 0.25]
  3                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]
  4                  -1  2    691712  ultralytics.nn.modules.block.C3k2            [256, 512, 2, True, 0.25]
  5                  -1  1   2360320  ultralytics.nn.modules.conv.Conv             [512, 512, 3, 2]
  6                  -1  2   2234368  ultralytics.nn.modules.block.C3k2            [512, 512, 2, True]
  7                  -1  1   2360320  ultralytics.nn.modules.conv.Conv             [512, 512, 3, 2]
  8                  -1  2   2234368  ultralytics.nn.modules.block.C3k2            [512, 512, 2, True]
  9                  -1  1    656896  ultralytics.nn.modules.block.SPPF            [512, 512, 5]
 10                  -1  2   1455616  ultralytics.nn.modules.block.C2PSA           [512, 512, 2]
 11                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']
 12             [-1, 6]  1         0  ultralytics.nn.modules.conv.Concat           [1]
 13                  -1  2   2496512  ultralytics.nn.modules.block.C3k2            [1024, 512, 2, True]
 14                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']
 15             [-1, 4]  1         0  ultralytics.nn.modules.conv.Concat           [1]
 16                  -1  2    756736  ultralytics.nn.modules.block.C3k2            [1024, 256, 2, True]
 17                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]
 18            [-1, 13]  1         0  ultralytics.nn.modules.conv.Concat           [1]
 19                  -1  2   2365440  ultralytics.nn.modules.block.C3k2            [768, 512, 2, True]
 20                  -1  1   2360320  ultralytics.nn.modules.conv.Conv             [512, 512, 3, 2]
 21            [-1, 10]  1         0  ultralytics.nn.modules.conv.Concat           [1]
 22                  -1  2   2496512  ultralytics.nn.modules.block.C3k2            [1024, 512, 2, True]
 23        [16, 19, 22]  1   1414879  ultralytics.nn.modules.head.Detect           [5, [256, 512, 512]]
YOLO11l summary: 631 layers, 25,314,335 parameters, 25,314,319 gradients, 87.3 GFLOPs

Freezing layer 'model.23.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks...
AMP: checks passed βœ…
train: Scanning C:\Users\kmathers\Documents\YoloV11Data\cynthiaBoxesKM_2025_02_14_OldVal\training\labels.cache... 448 i
val: Scanning C:\Users\kmathers\Documents\YoloV11Data\cynthiaBoxesKM_2025_02_14_OldVal\trainVal\labels.cache... 32 imag
Plotting labels to runs\cynthiaBoxesKM_2025_02_14_Dell3(04172025)_model11l\labels.jpg...
optimizer: Adam(lr=0.001, momentum=0.937) with parameter groups 167 weight(decay=0.0), 174 weight(decay=0.0005), 173 bias(decay=0.0)
Image sizes 2048 train, 2048 val
Using 8 dataloader workers
Logging results to runs\cynthiaBoxesKM_2025_02_14_Dell3(04172025)_model11l
Starting training for 1000 epochs...

EarlyStopping: Training stopped early as no improvement observed in last 400 epochs. Best results observed at epoch 158, best model saved as best.pt.
To update EarlyStopping(patience=400) pass a new patience value, i.e. `patience=300` or use `patience=0` to disable EarlyStopping.

558 epochs completed in 9.810 hours.
Optimizer stripped from runs\cynthiaBoxesKM_2025_02_14_Dell3(04172025)_model11l\weights\last.pt, 51.7MB
Optimizer stripped from runs\cynthiaBoxesKM_2025_02_14_Dell3(04172025)_model11l\weights\best.pt, 51.7MB

Validating runs\cynthiaBoxesKM_2025_02_14_Dell3(04172025)_model11l\weights\best.pt...
Ultralytics 8.3.28 πŸš€ Python-3.8.18 torch-2.1.2+cu121 CUDA:0 (NVIDIA RTX 6000 Ada Generation, 49140MiB)
YOLO11l summary (fused): 464 layers, 25,283,167 parameters, 0 gradients, 86.6 GFLOPs
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:02<0
                   all         32       1588      0.718      0.749      0.778      0.423
                  1         31        662      0.564      0.696      0.631      0.307
                  2         31        381      0.804      0.745      0.813       0.43
                  3         14        386      0.865      0.925      0.951      0.586
                  4          7         87       0.71       0.77      0.782      0.454
                  5         21         72      0.646      0.608      0.713      0.337
Speed: 2.0ms preprocess, 23.9ms inference, 0.0ms loss, 22.5ms postprocess per image

I am aware that the number of dataloader workers is different due to the number of GPU’s used, but I have done a trial and even with the same number of data loaders the results still differ by 0.1 or more.

Single GPU training typically gets better results. It’s not equivalent to dual GPU because of BatchNorm and also difference in augmentations due to different random number states.

Also you’re using the same batch size for both training. The batch size is split between the GPUs. So in a single GPU setup, you’re using batch size of 2 and in dual GPU it’s batch of 1, 1 on each, which is not the same. This will heavily affect BatchNorm. Also that batch size is too small and probably has a detrimental effect on models that use BatchNorm like YOLO.

2 Likes

Thank you for the explanation. We are going to work on higher batch sizes in the future. Our images were quite large and were to much for the gpu ram before this upgrade. And thank you for mentioning the batchnorm I was unaware of the use of it in the Yolo architecture.

You’re welcome, kmathers! Glad the explanation was helpful. Batch size, particularly how it interacts with Batch Normalization across single vs. multi-GPU setups, can indeed have a noticeable impact on training dynamics and final performance. Exploring larger batch sizes with your new hardware sounds like a great next step. Good luck with your experiments!