Using batch_size in inference doesn't speed up?

Hello!

We’ve trained a YOLOv11_obb model and we want to perfom a batched inference for speeding up.

We have pytorch dataloader with batch_size, and feed our model with batches of images. And we figured out that increasing batch_size doesn’t speed up the overall inference process! Can someone explain that to me? It seems counter-intuitive.

Sincerely

Also, there are timings in ms in the output, and if we sum them up, they simply don’t add up. Like it takes 4 seconds to infer 80 images, but if we add all the numbers it would be like 0.5 seconds. Where are the others?

Can you post the code you’re using?

Nothing special

dataset = ImageDataset(images_list, transform=transform)
batch_size = 16
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=4)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
res_dict = {}

with torch.no_grad():
    for batch in dataloader:
        filenames, paths, images = batch
        images = images.to(device)

        # Perform batch prediction
        results = model.predict(images, conf=0.3, imgsz=RESOLUTION, save_txt=False, save=False)

        # Collect results
        for img_file, img_path, result in zip(filenames, paths, results):
            res_obb_bboxes = result.obb.xyxyxyxyn.cpu().tolist()
            res_dict[img_file] = {
                'file_path': img_path,
                'obb_bboxes': res_obb_bboxes
            }

So are images of shape [N, C, H, W]?

How do you measure the time? Did you exclude first inference?

Is the model PyTorch model, or did you export it to a different format?

  1. Yes, images are a batch of [C, H, W];
  2. I didn’t exclude the first inference at first, but I had the additional experiment with a warm up;
  3. Model is initialized as YOLO(model_path), where path is a .pt file from the last epoch of training.

How do you measure the time? I tested batched inference and it’s faster per image compared to single inference. So I can’t reproduce this.

What GPU are you using?

I measure time using datetime.datetime.now(), and I can see that it takes the same 5 seconds. GPU NVIDIA A10 24 Gb.

BTW there is a github issue

and I think it’s very relevant to my case

That shouldn’t be used for performance measurement. You should be using time.perf_counter().

Whatever, it takes the same 5 seconds whenever I take batch_size of 32 or 1, if it would be 4 times faster, I would have noticed. Anyway, I tried several experiments and it looks that I have some bottlenecks in my I/O or server hardware.

If you’re including the whole for loop for measurement, then that’s likely. But then, that was also the incorrect way to determine whether the issue was with batch inference specifically or not, because only the predict() is performing the batch inference, so that’s what you should have been measuring. And since you already mentioned Ultralytics latency logs show a lower latency, then it’s even more likely that the issue isn’t with Ultralytics. This is why I wanted to see how you actually measured the actual time (in code).

I measured model.predict() as well, and also model.model() - I get the same performance every time. I switched NMS off, and it makes no difference in time.
When I measured model.model() on one batch of size 1, it takes X ms, on batch 8 - X * 8 and so on. Time taken increases linearly.

Seems like something specific to your hardware. I tried reproducing it and even with the code in the issue you linked, I can’t reproduce it:

In [4]: import time
   ...: from ultralytics import YOLO
   ...: import numpy as np
   ...: 
   ...: model = YOLO("yolo11n.pt")
   ...: images = [np.random.randint (0, 256, (640, 640, 3), dtype=np.uint8) for _ in range(2 ** 10)]
   ...: device = "mps"
   ...: 
   ...: # time how long it takes to run the model on the images in batches
   ...: for batch_size in [1, 16, 32, 64, 128]:
   ...:     start_time = time.perf_counter()
   ...:     for i in range(0, len(images), batch_size):
   ...:         batch = images[i:i + batch_size]
   ...:         model.predict(batch, verbose = False, device = device, rect = True)
   ...:     end_time = time.perf_counter()
   ...:     print(f"Batch size: {batch_size}, Time: {end_time - start_time}")
   ...: 
Batch size: 1, Time: 7.425320082998951
Batch size: 16, Time: 4.274201041989727
Batch size: 32, Time: 3.95698041698779
Batch size: 64, Time: 4.685385083997971
Batch size: 128, Time: 4.372804916987661

Batched is faster than single inference. At some point, it saturates and doesn’t get any faster, which is normal.

I had to set device = torch.device(‘cuda’), and my output goes like this

Batch size: 1, Time: 25.83381841983646 Batch size: 16, Time: 26.786037494428456 Batch size: 32, Time: 27.410270627588034
Batch size: 64, Time: 28.64176694303751

The inference time shouldn’t change a lot once warmed up, when using batches, but as your results show:

the time required to process one image is approximately the same as 16 images. This is an increase of throughput, meaning in the same time you can process multiple images as you would for a single image. The inference time might not change much for the full inference process, but in the end you have 16 images with detections instead of only 1 in (roughly) the same timespan. You can see the same behavior in the results @Toxite shared as well.
Crudely speaking, the inference time when batching can be thought of as N/t where N is the batch size and t is the inference time. For a batch size of one, that’s 1 / t for a single image and 16 / t for batch=16, so in your case:

1 image / 25 ms for batch=1
16 image / 25 ms for batch=16

That means there is a throughput of 1 / 25 = 0.04 (image/ms) for batch=1 and 16 / 25 = 0.64 (image/ms). If you can run higher batch values without adding significantly more latency for processing inference, then you’ll improve your throughput (effectively increasing inference times) but looking at a clock, it doesn’t seem like it’s faster but it will be since to process 128 images with batch=1 you need a total of 128 / 1 * t * 128 = 128 * 25 = 3200 (ms) to process all the images, but for batch=16 you only need 128 / 16 * t = 8 * 26 = 208 (ms) is needed. From your results, using batch=64 seems perfectly reasonable, and would be faster than processing one image at a time.

Hi! Thanks for the answer, my output is the time required to process 1024 images, so I don’t see the reason to use batches at all, since it’s the same time, right?
And 25, 26, 27 are seconds to process the entire input of 1024 images, not ms per batch.

Is this using the same code I posted? Because 25 seconds for processing 1024 images on a GPU is very long. Even CPUs are faster than that.

Yes, this is your code exactly. Well, it’s like that. model.predict() says it’s 25 ms to process each image, but in practice it take much longer. Is there something I can do to speed it up?

Then there’s something definitely wrong with your GPU or server. Because that’s too long of a time for GPU, especially an NVIDIA A10. I ran the same code on Google Colab on an NVIDIA T4, which is slower than an NVIDIA A10, and it takes less than half the time of what it’s taking for you. And batched is twice as fast too.

Can you post output of running yolo checks in terminal?