Using batch_size in inference doesn't speed up?

You can try upgrading to latest Ultralytics and PyTorch.

And I still get the same 25 seconds

What’s the GPU utilization in nvidia-smi when running inference?

0.8 GB for batch of 1
6 GB for batch of 16
12 GB for batch of 32
20 GB for batch of 64
CUDA out of memory for batch of 128 :smiley:

I mean GPU utilization. The percentage value shown by nvidia-smi.

That is exactly what I wrote. I type nvidia-smi, and observe my process that takes up some portion of GPU memory.
If you want percents, it’s

0.8 /24 = 3.3 % for batch 1
6/24 = 25% for batch 16
12/24 = 50% for batch 32
20/24 = 83.3% for batch 64

and CUDA out of memory for batch 128, which means 24 GBs of my A10 is not enough to take a batch of 128 images of [640, 640, 3] bytes.

That’s memory usage. I am asking about GPU utilization (which is under “GPU Util.” column). It’s a different thing. I want to know the GPU utilization while the inference is running to see whether your GPU is being fully utilized during inference.

88% for the batch of 1
100% for the batch of 16 and onwards.

Looks like my GPU is fully utilized.

What size of model are you using? And what imgsz?

I’m using your code, the model is yolo11n.pt, and image size is 640.
Everything is from your code.

This is my nvidia-smi output, maybe this will help a bit.

Looks like your GPU fan isn’t working and GPU is overheating.

This is a rented server, I don’t have any physical access to its hardware.

Should probably switch to a different server because this is defective.

I think the A10 has a passive cooling system, so 80°C is probably somewhat expected under load

OK thanks, still no idea why the code is working 2-3 times slower than expected.

It’s very strange. The only thing I can think of, is that the GPU is connected with a very low PCIe contraction rate. This would make transfers from CPU to GPU much slower. I’ve never tested what inference would look like in this situation, but that’s my best guess. Given it’s a rented server, it’s likely not possible to fix this yourself, but you might want to contact your provider to help troubleshoot the issue. You could try running the test script on Google Colab (need to swap to using a T4 instance) to do a direct comparison like Toxite showed above

Hello everyone.
I’ve been researching my problem and I mentioned in the first post that I’m using OBB model. From what I see, the OBB model doesn’t have any advantage with batched input, because all my processing time is the same with different batch size.
Can anyone tell me are there any lifehacks to make YOLO OBB model profit from batch size?

You can export the model to TensorRT for faster speed