GPU memory leak

python
NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4
NVIDIA GeForce RTX 4060
ultralytics==8.3.107
ultralytics-thop==2.0.14
ubuntu 24.04
torch==2.6.0

Hello. Please help me.
Yesterday I updated my ultralytics library and stopped the GPU process after several epochs of training with Out of memory.
Automatic batch size calculation gives a true value and everything was fine last week.
Changing the batch size does not give any positive results. Epoch after epoch the GPU memory usage increases.

Can you post the training command?

def train_model(
data: str = β€˜datasets/data.yaml’,
epochs: int=300,
batch: int=4,
imgsz: int = 1280,
patience: int=100,
resume: bool=False,
counter: int=1,
):
if resume:
if counter == 1:
model = YOLO(β€˜runs/detect/train/weights/last.pt’)
else:
model = YOLO(f’runs/detect/train{counter}/weights/last.pt’)
else:
model = YOLO(β€˜yolov8s.pt’)

try:
    results = model.train(
        data=data,
        epochs=epochs,
        batch=batch,
        imgsz=imgsz,
        patience=patience,
        resume=resume,
    )
except torch.OutOfMemoryError:
    raise
    # torch.cuda.empty_cache()
    # train_model(
    #     data=data,
    #     epochs=epochs,
    #     batch=batch,
    #     imgsz=imgsz,
    #     patience=patience,
    #     resume=True,
    #     counter=counter,
    # )

file_name = f'results_{epochs}_epochs_{imgsz}_imgsz.txt'
with open(file_name, 'a+') as f:
    f.write(f'{batch}. {str(results.class_result(0))}\n')
train_model(
    data='datasets/earrings_data.yaml',
    epochs=100,
    batch=16,
    imgsz=96 * 7,
    patience=0,
    resume=False,
    counter=1,
)

Simple code without difficult steps.

What was the version of Ultralytics you used that didn’t have the issue?

Sorry. I don’t know. I would like to know this myself.

I installed this to be around February 2025.

Train log before error.
Starting training for 100 epochs…

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  1/100      3.78G      1.008      2.661      1.336         57        672: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13/13 [00:03<00:00,  3.82it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:01<00:00,  4.06it/s]
               all        206        400       0.94      0.946      0.979      0.911

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  2/100      3.89G     0.5729     0.7992      1.028         63        672: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13/13 [00:03<00:00,  4.01it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:01<00:00,  5.06it/s]
               all        206        400      0.806       0.75       0.84      0.625

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  3/100      3.92G     0.6352     0.7481      1.048         39        672: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13/13 [00:03<00:00,  4.03it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:01<00:00,  5.20it/s]
               all        206        400      0.216       0.39      0.165     0.0441

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  4/100      4.02G     0.7442     0.7684      1.109         53        672: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13/13 [00:03<00:00,  4.04it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:01<00:00,  5.24it/s]
               all        206        400      0.405      0.422      0.387      0.174

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  5/100      4.09G     0.7166     0.7191      1.065         57        672: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13/13 [00:03<00:00,  4.05it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:01<00:00,  5.19it/s]
               all        206        400       0.47      0.863      0.662      0.415

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  6/100      4.15G     0.6953     0.6695      1.078         43        672: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13/13 [00:03<00:00,  4.04it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:01<00:00,  5.18it/s]
               all        206        400      0.495      0.647      0.498      0.262

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  7/100      4.21G     0.7525     0.6823      1.071         63        672: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13/13 [00:03<00:00,  4.06it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:01<00:00,  4.98it/s]
               all        206        400      0.565       0.64      0.493      0.231

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  8/100      4.28G     0.6851     0.5854      1.048         58        672: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13/13 [00:03<00:00,  4.06it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:01<00:00,  5.22it/s]
               all        206        400     0.0775      0.135     0.0471     0.0121

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  9/100      4.33G     0.6763     0.5711      1.059         54        672: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13/13 [00:03<00:00,  4.06it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:01<00:00,  5.27it/s]
               all        206        400     0.0838      0.155     0.0277    0.00727

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
 10/100      4.41G      0.666     0.5589      1.045         67        672: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13/13 [00:03<00:00,  4.06it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:01<00:00,  5.21it/s]
               all        206        400      0.203      0.318      0.197     0.0654

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
 11/100      4.46G     0.6612       0.55       1.04         41        672: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13/13 [00:03<00:00,  4.06it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:01<00:00,  5.30it/s]
               all        206        400     0.0164      0.198     0.0115    0.00232

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
 12/100      4.52G     0.6715     0.5809      1.041         60        672: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13/13 [00:03<00:00,  4.06it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:01<00:00,  5.22it/s]
               all        206        400      0.527       0.48      0.552       0.36

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
 13/100       4.6G     0.6548     0.5535      1.047         64        672: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13/13 [00:03<00:00,  4.07it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:01<00:00,  5.23it/s]
               all        206        400      0.293       0.26      0.194      0.036

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
 14/100      4.63G     0.5743     0.5356      1.016         66        672:   8%|β–Š         | 1/13 [00:00<00:04,  2.91it/s]

Traceback (most recent call last):

I think it all depends on the sample size. It didn’t depend before. I tried different cache settings, didn’t help.

At now i install Ultralytics 8.3.40 and all works fine. Any ideas?

Can you try 8.3.104?

8.3.104 with memory leak.

Trying to find problem version.

Problem comes with ultralytics-8.3.87.

Maybe this a problem.

  • Memory Management: Optimized GPU memory clearing to trigger only when usage exceeds 90%.

I install latest version ultralytics==8.3.108 and change
if self._get_memory(fraction=True) > 0.9 to if self._get_memory(fraction=True) > 0.5
and working fine.
As idea, maybe check torch.cuda.set_per_process_memory_fraction(0.6) install manually.
if self._get_memory(fraction=True) > manually_fraction or 0.9: …

1 Like

Find this solution OOM (RAM not gpuοΌ‰ using DDP trainning Β· Issue #19671 Β· ultralytics/ultralytics Β· GitHub
Create custom trainer.

Thanks for the feedback here! We’ve just opened a new PR to improve this: