Normal then slow then crashing training

Hi,

I’m trying to train Yolo 11 model and I’m getting a strange result. It’s not the first I’m training. I successfully training 5 models with different datasets, all working one. But this last one is reacting strangely.

The first epoch trains well, with the expected time, all good.
The second epoch trains close to 9 times slower with a GPU usage close to 0% and when starting the last step, it fails.

The machine has 3TB of free disk, 32 cores and 64GB of RAM. the GPU is card is an NVidia 5090.

Here is the output of the training steps:

Image sizes 640 train, 640 val
Using 32 dataloader workers
Logging results to runs/classify/train7
Starting training for 200 epochs...

      Epoch    GPU_mem       loss  Instances       Size
      1/200      17.3G     0.1113         16        640: 100%|██████████| 41515/41515 [2:16:17<00:00,  5.08it/s]   
               classes   top1_acc   top5_acc: 100%|██████████| 1355/1355 [1:04:52<00:00,  2.87s/it]
                   all      0.999          1

      Epoch    GPU_mem       loss  Instances       Size
      2/200      17.3G    0.02184         16        640: 100%|██████████| 41515/41515 [18:26:36<00:00,  1.60s/it]   
               classes   top1_acc   top5_acc:   5%|▍         | 64/1355 [00:08<02:33,  8.42it/s]Processus arrêté

I tried twice and got the same result twice.

The training code is VERY simple:

import os
import torch
import torch.distributed as dist
from ultralytics import YOLO
import ultralytics

def main():
    CLASSES = 'classification_border'

    # Check if the process group is already initialized
    if torch.cuda.device_count() > 1 and not dist.is_initialized():
        dist.init_process_group(backend='nccl')

    ultralytics.checks()
    model = YOLO('yolo11x-cls.pt').to(device='cuda') 

    results = model.train(
        data=f'/home/user/data/{CLASSES}/',
        device='cuda',
        epochs=200,
        batch=32,
        workers=32,
        patience=12,
        imgsz=640
    )

if __name__ == '__main__':
    main()

There is not much displayed on the logs, so I’m not really sure where to start looking. I’m looking for guidance to debug this.

Thanks,

JM

Does your system RAM get filled? IIRC there was a bug (I’m honestly not certain if it was patched or not) that classification training would fill up system memory. Have you tried with workers=1?

Memory while processing the first epoch shows 25GB used over 61GB available. So it should be fine. GPU Usage 98%, that’s good. Let’s wait 2h and see how the 2nd epoch is doing. I reduced both batch and workers by half…

Seems that it was indeed a memory issue!

Sure sure on what side. GPU or Computer. But it now works, with a very similar time.

Thanks for putting me in the right direction!

JMS

      Epoch    GPU_mem       loss  Instances       Size
      1/200      10.5G    0.09201         16        640: 100%|██████████| 83029/83029 [2:16:41<00:00, 10.12it/s]  
               classes   top1_acc   top5_acc: 100%|██████████| 2710/2710 [04:02<00:00, 11.18it/s]
                   all      0.999          1

      Epoch    GPU_mem       loss  Instances       Size
      2/200      12.1G    0.03115         16        640: 100%|██████████| 83029/83029 [2:13:19<00:00, 10.38it/s]  
               classes   top1_acc   top5_acc: 100%|██████████| 2710/2710 [04:03<00:00, 11.13it/s]
                   all      0.998          1

1 Like

Hi Jean-Marc,

Glad to hear you’ve pinpointed the memory issue and got your training running smoothly! That’s excellent troubleshooting. The behavior you described—a significant slowdown after the first epoch—is a classic sign of the system running out of RAM or VRAM and resorting to slower swap memory.

For future training runs, you can let Ultralytics YOLO automatically determine the optimal batch size for your hardware. To do this, simply set batch=-1. This will find the largest batch size that fits into your GPU memory with a small margin, which can save you the time of finding it manually.

You can learn more about this and other training settings in our Model Training documentation.

Happy training

Hi Laura,

Thanks for the recommendation. I already tried this in the past and it was not working. I tried it again and it’s still failing. That’s why I have to look for the optimal batch size manually.

Best,

JMS

AutoBatch: Computing optimal batch size for imgsz=640 at 60.0% CUDA memory utilization.
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 5090) 31.36G total, 0.38G reserved, 0.37G allocated, 30.60G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
    28363750         111         1.363         12.28         70.41        (1, 3, 640, 640)                  (1, 6)
    28363750         222         2.439         6.363         38.09        (2, 3, 640, 640)                  (2, 6)
    28363750       443.9         3.932         8.207         39.92        (4, 3, 640, 640)                  (4, 6)
    28363750       887.9         6.239         15.18         44.23        (8, 3, 640, 640)                  (8, 6)
    28363750        1776        11.482         29.22         68.22       (16, 3, 640, 640)                 (16, 6)
    28363750        3552        12.507         61.02         141.7       (32, 3, 640, 640)                 (32, 6)
CUDA out of memory. Tried to allocate 1.17 GiB. GPU 0 has a total capacity of 31.36 GiB of which 428.31 MiB is free. Process 1860345 has 2.18 GiB memory in use. Process 1860344 has 2.18 GiB memory in use. Process 1860343 has 2.18 GiB memory in use. Process 1860346 has 2.18 GiB memory in use. Process 2273508 has 2.18 GiB memory in use. Process 2381534 has 2.18 GiB memory in use. Including non-PyTorch memory, this process has 17.82 GiB memory in use. Of the allocated memory 16.86 GiB is allocated by PyTorch, and 295.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
AutoBatch: Using batch-size 42 for CUDA:0 18.60G/31.36G (59%) ✅

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 394.00 MiB. GPU 0 has a total capacity of 31.36 GiB of which 2.31 MiB is free. Process 1860345 has 2.18 GiB memory in use. Process 1860344 has 2.18 GiB memory in use. Process 1860343 has 2.18 GiB memory in use. Process 1860346 has 2.18 GiB memory in use. Process 2273508 has 2.18 GiB memory in use. Process 2381534 has 2.18 GiB memory in use. Including non-PyTorch memory, this process has 18.23 GiB memory in use. Of the allocated memory 17.43 GiB is allocated by PyTorch, and 143.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Glad that helped! The Autobatch is calculation is helpful but generally needs a bit of adjustments. FWIW, I usually do a short training session, 3-5 epochs, with autobatch=True to get a rough idea of how much can fit in the GPU memory, and make adjustments from there. Not always a bulletproof plan as you can get an out of memory error at epoch 97/300, but the nice thing is you can always resume training with a lower batch value!