Optimize GPU utilization while training

Samuel · February 13, 2025, 12:28pm

Hello,

when I am training a model with Yolo 11 on 2xA100 80GB 2A100.44V 44 CPU•240GB RAM•160GB GPU VRAM with the following parameters: !yolo task=detect mode=train epochs=100 batch=64 plots=True model=‘runs/detect/train3/weights/last.pt’ resume=True data=data.yaml imgsz=640 patience=50 device=0,1 workers=22 rect=True fraction=0.1 cache=True, then I see the following GPU utilization (Screenshot 1).

Is it good to have such high “Mem Free”? Can it be optimized? Is it normal? Should the settings be changed to improve it? Can a weaker (cheaper) or stronger (more expensive) hardware be better?

I am thankful for your opinions and experience.

Thank you very much

Toxite · February 13, 2025, 1:53pm

You can start training on a single GPU with batch=0.9. It will calculate the appropriate batch size and print that. Then double that for two GPUs.

Samuel · February 18, 2025, 5:44pm

Hi @Toxite

Thank you for your reply.

I tried it and it returned auto-batch-size of 2386 (Screenshot). So I took this yolo task=detect mode=train epochs=100 batch=2386 …, but it stopped without an error.

I played around with batch-sizes (always a multiple of 64, like 448, 512, 576 up to 960), but always the same picture: the training stopped before starting the first-epoch. Sometimes it made the first epoch, but stopped afterwards.

My working setup is now a single GPU (same as in the original question) with batch-size: 448.
(2x same GPU with same batch-size or slightly-higher batch-size made an epoch not really faster, so I sticked to a single GPU.)

If there are any other setups/configurations to utilize GPU(s) as much as possible, it would be great to know it.

Thank you all

Toxite · February 18, 2025, 11:54pm

How many images do to have?

Samuel · February 19, 2025, 12:57am

Around 220000 images

BurhanQ · February 19, 2025, 1:14am

I would recommend trying the Docker container if you haven’t yet.

Additionally, there’s a hard limit of 1024 for the upper bound of batch size

github.com/ultralytics/ultralytics

ultralytics/utils/autobatch.py

e16593336


      
                  and 0 < y[2] < t  # between 0 and GPU limit
                  and (i == 0 or not results[i - 1] or y[2] > results[i - 1][2])  # first item or increasing memory
              ]
              fit_x, fit_y = zip(*xy) if xy else ([], [])
              p = np.polyfit(np.log(fit_x), np.log(fit_y), deg=1)  # first-degree polynomial fit in log space
              b = int(round(np.exp((np.log(f * fraction) - p[1]) / p[0])))  # y intercept (optimal batch size)
              if None in results:  # some sizes failed
                  i = results.index(None)  # first fail index
                  if b >= batch_sizes[i]:  # y intercept above failure point
                      b = batch_sizes[max(i - 1, 0)]  # select prior safe point
              if b < 1 or b > 1024:  # b outside of safe range
                  LOGGER.info(f"{prefix}WARNING ⚠️ batch={b} outside safe range, using default batch-size {batch_size}.")
                  b = batch_size
          
              fraction = (np.exp(np.polyval(p, np.log(b))) + r + a) / t  # predicted fraction
              LOGGER.info(f"{prefix}Using batch-size {b} for {d} {t * fraction:.2f}G/{t:.2f}G ({fraction * 100:.0f}%) ✅")
              return b
          except Exception as e:
              LOGGER.warning(f"{prefix}WARNING ⚠️ error detected: {e},  using default batch-size {batch_size}.")
              return batch_size
          finally:

Topic		Replies	Views
Normal then slow then crashing training YOLO yolo , question	6	44	July 9, 2025
I am seeing major improvements in my model and the only change has been the machine it is trained on YOLO troubleshooting	3	69	April 29, 2025
GPU memory leak YOLO yolo , nvidia , pytorch	18	300	April 15, 2025
Yolov8 CUDA out of memory error Support yolo , support , troubleshooting	3	285	March 12, 2025
Trying to train YOLOv8x on a budget GPU Memes & Jokes yolov8 , gpu , funny	2	92	September 6, 2024

Optimize GPU utilization while training

Related topics