when I am training a model with Yolo 11 on 2xA100 80GB 2A100.44V 44 CPU•240GB RAM•160GB GPU VRAM with the following parameters: !yolo task=detect mode=train epochs=100 batch=64 plots=True model=‘runs/detect/train3/weights/last.pt’ resume=True data=data.yaml imgsz=640 patience=50 device=0,1 workers=22 rect=True fraction=0.1 cache=True, then I see the following GPU utilization (Screenshot 1).
Is it good to have such high “Mem Free”? Can it be optimized? Is it normal? Should the settings be changed to improve it? Can a weaker (cheaper) or stronger (more expensive) hardware be better?
I tried it and it returned auto-batch-size of 2386 (Screenshot). So I took this yolo task=detect mode=train epochs=100 batch=2386 …, but it stopped without an error.
I played around with batch-sizes (always a multiple of 64, like 448, 512, 576 up to 960), but always the same picture: the training stopped before starting the first-epoch. Sometimes it made the first epoch, but stopped afterwards.
My working setup is now a single GPU (same as in the original question) with batch-size: 448.
(2x same GPU with same batch-size or slightly-higher batch-size made an epoch not really faster, so I sticked to a single GPU.)
If there are any other setups/configurations to utilize GPU(s) as much as possible, it would be great to know it.