GPU memory leak

MaestroV · April 14, 2025, 4:25pm

python
NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4
NVIDIA GeForce RTX 4060
ultralytics==8.3.107
ultralytics-thop==2.0.14
ubuntu 24.04
torch==2.6.0

Hello. Please help me.
Yesterday I updated my ultralytics library and stopped the GPU process after several epochs of training with Out of memory.
Automatic batch size calculation gives a true value and everything was fine last week.
Changing the batch size does not give any positive results. Epoch after epoch the GPU memory usage increases.

Toxite · April 14, 2025, 8:00pm

Can you post the training command?

MaestroV · April 14, 2025, 8:10pm

def train_model(
data: str = ‘datasets/data.yaml’,
epochs: int=300,
batch: int=4,
imgsz: int = 1280,
patience: int=100,
resume: bool=False,
counter: int=1,
):
if resume:
if counter == 1:
model = YOLO(‘runs/detect/train/weights/last.pt’)
else:
model = YOLO(f’runs/detect/train{counter}/weights/last.pt’)
else:
model = YOLO(‘yolov8s.pt’)

try:
    results = model.train(
        data=data,
        epochs=epochs,
        batch=batch,
        imgsz=imgsz,
        patience=patience,
        resume=resume,
    )
except torch.OutOfMemoryError:
    raise
    # torch.cuda.empty_cache()
    # train_model(
    #     data=data,
    #     epochs=epochs,
    #     batch=batch,
    #     imgsz=imgsz,
    #     patience=patience,
    #     resume=True,
    #     counter=counter,
    # )

file_name = f'results_{epochs}_epochs_{imgsz}_imgsz.txt'
with open(file_name, 'a+') as f:
    f.write(f'{batch}. {str(results.class_result(0))}\n')

MaestroV · April 14, 2025, 8:12pm

train_model(
    data='datasets/earrings_data.yaml',
    epochs=100,
    batch=16,
    imgsz=96 * 7,
    patience=0,
    resume=False,
    counter=1,
)

MaestroV · April 14, 2025, 8:14pm

Simple code without difficult steps.

Toxite · April 15, 2025, 12:08am

What was the version of Ultralytics you used that didn’t have the issue?

MaestroV · April 15, 2025, 5:04am

Sorry. I don’t know. I would like to know this myself.

MaestroV · April 15, 2025, 5:07am

I installed this to be around February 2025.

MaestroV · April 15, 2025, 5:18am

Train log before error.
Starting training for 100 epochs…

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  1/100      3.78G      1.008      2.661      1.336         57        672: 100%|██████████| 13/13 [00:03<00:00,  3.82it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 7/7 [00:01<00:00,  4.06it/s]
               all        206        400       0.94      0.946      0.979      0.911

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  2/100      3.89G     0.5729     0.7992      1.028         63        672: 100%|██████████| 13/13 [00:03<00:00,  4.01it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 7/7 [00:01<00:00,  5.06it/s]
               all        206        400      0.806       0.75       0.84      0.625

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  3/100      3.92G     0.6352     0.7481      1.048         39        672: 100%|██████████| 13/13 [00:03<00:00,  4.03it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 7/7 [00:01<00:00,  5.20it/s]
               all        206        400      0.216       0.39      0.165     0.0441

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  4/100      4.02G     0.7442     0.7684      1.109         53        672: 100%|██████████| 13/13 [00:03<00:00,  4.04it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 7/7 [00:01<00:00,  5.24it/s]
               all        206        400      0.405      0.422      0.387      0.174

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  5/100      4.09G     0.7166     0.7191      1.065         57        672: 100%|██████████| 13/13 [00:03<00:00,  4.05it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 7/7 [00:01<00:00,  5.19it/s]
               all        206        400       0.47      0.863      0.662      0.415

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  6/100      4.15G     0.6953     0.6695      1.078         43        672: 100%|██████████| 13/13 [00:03<00:00,  4.04it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 7/7 [00:01<00:00,  5.18it/s]
               all        206        400      0.495      0.647      0.498      0.262

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  7/100      4.21G     0.7525     0.6823      1.071         63        672: 100%|██████████| 13/13 [00:03<00:00,  4.06it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 7/7 [00:01<00:00,  4.98it/s]
               all        206        400      0.565       0.64      0.493      0.231

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  8/100      4.28G     0.6851     0.5854      1.048         58        672: 100%|██████████| 13/13 [00:03<00:00,  4.06it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 7/7 [00:01<00:00,  5.22it/s]
               all        206        400     0.0775      0.135     0.0471     0.0121

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  9/100      4.33G     0.6763     0.5711      1.059         54        672: 100%|██████████| 13/13 [00:03<00:00,  4.06it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 7/7 [00:01<00:00,  5.27it/s]
               all        206        400     0.0838      0.155     0.0277    0.00727

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
 10/100      4.41G      0.666     0.5589      1.045         67        672: 100%|██████████| 13/13 [00:03<00:00,  4.06it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 7/7 [00:01<00:00,  5.21it/s]
               all        206        400      0.203      0.318      0.197     0.0654

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
 11/100      4.46G     0.6612       0.55       1.04         41        672: 100%|██████████| 13/13 [00:03<00:00,  4.06it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 7/7 [00:01<00:00,  5.30it/s]
               all        206        400     0.0164      0.198     0.0115    0.00232

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
 12/100      4.52G     0.6715     0.5809      1.041         60        672: 100%|██████████| 13/13 [00:03<00:00,  4.06it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 7/7 [00:01<00:00,  5.22it/s]
               all        206        400      0.527       0.48      0.552       0.36

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
 13/100       4.6G     0.6548     0.5535      1.047         64        672: 100%|██████████| 13/13 [00:03<00:00,  4.07it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 7/7 [00:01<00:00,  5.23it/s]
               all        206        400      0.293       0.26      0.194      0.036

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
 14/100      4.63G     0.5743     0.5356      1.016         66        672:   8%|▊         | 1/13 [00:00<00:04,  2.91it/s]

Traceback (most recent call last):

MaestroV · April 15, 2025, 5:21am

I think it all depends on the sample size. It didn’t depend before. I tried different cache settings, didn’t help.

MaestroV · April 15, 2025, 5:47am

At now i install Ultralytics 8.3.40 and all works fine. Any ideas?

Toxite · April 15, 2025, 7:36am

Can you try 8.3.104?

MaestroV · April 15, 2025, 8:08am

8.3.104 with memory leak.

MaestroV · April 15, 2025, 8:19am

Trying to find problem version.

MaestroV · April 15, 2025, 9:41am

Problem comes with ultralytics-8.3.87.

MaestroV · April 15, 2025, 9:43am

Maybe this a problem.

Memory Management: Optimized GPU memory clearing to trigger only when usage exceeds 90%.

MaestroV · April 15, 2025, 10:09am

I install latest version ultralytics==8.3.108 and change
if self._get_memory(fraction=True) > 0.9 to if self._get_memory(fraction=True) > 0.5
and working fine.
As idea, maybe check torch.cuda.set_per_process_memory_fraction(0.6) install manually.
if self._get_memory(fraction=True) > manually_fraction or 0.9: …

MaestroV · April 15, 2025, 10:16am

Find this solution OOM （RAM not gpu） using DDP trainning · Issue #19671 · ultralytics/ultralytics · GitHub
Create custom trainer.

glenn-jocher · April 15, 2025, 12:09pm

Thanks for the feedback here! We’ve just opened a new PR to improve this:

github.com/ultralytics/ultralytics

Reduce trigger threshold to clear GPU memory to 50%

main ← train-memory

opened 12:01PM - 15 Apr 25 UTC

Y-T-G

+2 -2

Closes #19862 https://community.ultralytics.com/t/gpu-memory-leak/967/16 Us…ers have reported OOM issue. ## 🛠️ PR Summary <sub>Made with ❤️ by [Ultralytics Actions](https://github.com/ultralytics/actions)<sub> ### 🌟 Summary Improves memory management during training by clearing memory more frequently. ### 📊 Key Changes - Lowers the memory usage threshold for clearing memory from 90% to 50% during training. ### 🎯 Purpose & Impact - 🧹 Reduces the risk of running out of memory, especially on devices with limited resources. - 🚀 Helps maintain smoother and more stable training sessions for users. - 💻 Makes YOLO training more reliable across a wider range of hardware.

Topic		Replies	Views
Yolov8 CUDA out of memory error Support yolo , support , troubleshooting	3	249	March 12, 2025
Normal then slow then crashing training YOLO yolo , question	6	18	July 9, 2025
Optimize GPU utilization while training YOLO	5	489	February 19, 2025
Ultralytics 8.3.6 with torch==2.6.0 + torchvision==0.21.0 is silently falling back to CPU during model.predict() despite .to('cuda') and device=0 YOLO yolo , question , code	3	15	July 10, 2025
New Release: Ultralytics v8.3.31 Discussion releases , announcements , ultralytics-official	0	51	November 14, 2024

GPU memory leak

Related topics