We’ve trained a YOLOv11_obb model and we want to perfom a batched inference for speeding up.
We have pytorch dataloader with batch_size, and feed our model with batches of images. And we figured out that increasing batch_size doesn’t speed up the overall inference process! Can someone explain that to me? It seems counter-intuitive.
Also, there are timings in ms in the output, and if we sum them up, they simply don’t add up. Like it takes 4 seconds to infer 80 images, but if we add all the numbers it would be like 0.5 seconds. Where are the others?
Whatever, it takes the same 5 seconds whenever I take batch_size of 32 or 1, if it would be 4 times faster, I would have noticed. Anyway, I tried several experiments and it looks that I have some bottlenecks in my I/O or server hardware.
If you’re including the whole for loop for measurement, then that’s likely. But then, that was also the incorrect way to determine whether the issue was with batch inference specifically or not, because only the predict() is performing the batch inference, so that’s what you should have been measuring. And since you already mentioned Ultralytics latency logs show a lower latency, then it’s even more likely that the issue isn’t with Ultralytics. This is why I wanted to see how you actually measured the actual time (in code).
I measured model.predict() as well, and also model.model() - I get the same performance every time. I switched NMS off, and it makes no difference in time.
When I measured model.model() on one batch of size 1, it takes X ms, on batch 8 - X * 8 and so on. Time taken increases linearly.
The inference time shouldn’t change a lot once warmed up, when using batches, but as your results show:
the time required to process one image is approximately the same as 16 images. This is an increase of throughput, meaning in the same time you can process multiple images as you would for a single image. The inference time might not change much for the full inference process, but in the end you have 16 images with detections instead of only 1 in (roughly) the same timespan. You can see the same behavior in the results @Toxite shared as well.
Crudely speaking, the inference time when batching can be thought of as N/t where N is the batch size and t is the inference time. For a batch size of one, that’s 1 / t for a single image and 16 / t for batch=16, so in your case:
1 image / 25 ms for batch=1
16 image / 25 ms for batch=16
That means there is a throughput of 1 / 25 = 0.04 (image/ms) for batch=1 and 16 / 25 = 0.64 (image/ms). If you can run higher batch values without adding significantly more latency for processing inference, then you’ll improve your throughput (effectively increasing inference times) but looking at a clock, it doesn’t seem like it’s faster but it will be since to process 128 images with batch=1 you need a total of 128 / 1 * t * 128 = 128 * 25 = 3200 (ms) to process all the images, but for batch=16 you only need 128 / 16 * t = 8 * 26 = 208 (ms) is needed. From your results, using batch=64 seems perfectly reasonable, and would be faster than processing one image at a time.
Hi! Thanks for the answer, my output is the time required to process 1024 images, so I don’t see the reason to use batches at all, since it’s the same time, right?
And 25, 26, 27 are seconds to process the entire input of 1024 images, not ms per batch.
Yes, this is your code exactly. Well, it’s like that. model.predict() says it’s 25 ms to process each image, but in practice it take much longer. Is there something I can do to speed it up?
Then there’s something definitely wrong with your GPU or server. Because that’s too long of a time for GPU, especially an NVIDIA A10. I ran the same code on Google Colab on an NVIDIA T4, which is slower than an NVIDIA A10, and it takes less than half the time of what it’s taking for you. And batched is twice as fast too.