Using batch_size in inference doesn't speed up?

phenomen21 · October 6, 2025, 2:21pm

Hello!

We’ve trained a YOLOv11_obb model and we want to perfom a batched inference for speeding up.

We have pytorch dataloader with batch_size, and feed our model with batches of images. And we figured out that increasing batch_size doesn’t speed up the overall inference process! Can someone explain that to me? It seems counter-intuitive.

Sincerely

phenomen21 · October 6, 2025, 2:22pm

Also, there are timings in ms in the output, and if we sum them up, they simply don’t add up. Like it takes 4 seconds to infer 80 images, but if we add all the numbers it would be like 0.5 seconds. Where are the others?

Toxite · October 6, 2025, 3:50pm

Can you post the code you’re using?

phenomen21 · October 6, 2025, 4:46pm

Nothing special

dataset = ImageDataset(images_list, transform=transform)
batch_size = 16
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=4)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
res_dict = {}

with torch.no_grad():
    for batch in dataloader:
        filenames, paths, images = batch
        images = images.to(device)

        # Perform batch prediction
        results = model.predict(images, conf=0.3, imgsz=RESOLUTION, save_txt=False, save=False)

        # Collect results
        for img_file, img_path, result in zip(filenames, paths, results):
            res_obb_bboxes = result.obb.xyxyxyxyn.cpu().tolist()
            res_dict[img_file] = {
                'file_path': img_path,
                'obb_bboxes': res_obb_bboxes
            }

Toxite · October 6, 2025, 6:22pm

So are images of shape [N, C, H, W]?

How do you measure the time? Did you exclude first inference?

Is the model PyTorch model, or did you export it to a different format?

phenomen21 · October 6, 2025, 6:49pm

Yes, images are a batch of [C, H, W];
I didn’t exclude the first inference at first, but I had the additional experiment with a warm up;
Model is initialized as YOLO(model_path), where path is a .pt file from the last epoch of training.

Toxite · October 6, 2025, 7:30pm

How do you measure the time? I tested batched inference and it’s faster per image compared to single inference. So I can’t reproduce this.

What GPU are you using?

phenomen21 · October 7, 2025, 5:48am

I measure time using datetime.datetime.now(), and I can see that it takes the same 5 seconds. GPU NVIDIA A10 24 Gb.

BTW there is a github issue

github.com/ultralytics/ultralytics

Inference time not faster with larger batch size

opened 10:26PM - 05 Feb 24 UTC

closed 12:16AM - 18 Mar 24 UTC

oscarwzt

question Stale

### Search before asking - [X] I have searched the YOLOv8 [issues](https://gith…ub.com/ultralytics/ultralytics/issues) and [discussions](https://github.com/ultralytics/ultralytics/discussions) and found no similar questions. ### Question I'm trying to use YOLOv8 to inference videos by accumulating a batch of frames and then passing the batch to the model. However, I noticed that there is not a big difference between inferencing sequentially vs. by-batch. Also, changing the batch size does not affect the total inference time much. See the below experiment: <img width="977" alt="Screenshot 2024-02-05 at 5 14 00 PM" src="https://github.com/ultralytics/ultralytics/assets/71715227/18c7a152-9b11-4956-834d-7d5dbb851f2f"> I expected that batch size of 16 to be significantly faster than sequential inference, and batch size of 64 to be faster than batch size of 16, and so on. I've read #1310 but didn't find any clue on why I'm not getting the expected speed-up. I also tried setting `rect = True` as the comment in that issue suggested, but no difference. My questions: - why does changing batch size (number of images in a list passed to the model) not affect inference time as much? - If passing a list of images is not the correct way to do batch inference, what is the correct way? Thank you! ### Additional My GPU: Quadro RTX 6000, 24G Mem

and I think it’s very relevant to my case

Toxite · October 7, 2025, 7:16am

That shouldn’t be used for performance measurement. You should be using time.perf_counter().

phenomen21 · October 7, 2025, 7:41am

Whatever, it takes the same 5 seconds whenever I take batch_size of 32 or 1, if it would be 4 times faster, I would have noticed. Anyway, I tried several experiments and it looks that I have some bottlenecks in my I/O or server hardware.

Toxite · October 7, 2025, 8:02am

If you’re including the whole for loop for measurement, then that’s likely. But then, that was also the incorrect way to determine whether the issue was with batch inference specifically or not, because only the predict() is performing the batch inference, so that’s what you should have been measuring. And since you already mentioned Ultralytics latency logs show a lower latency, then it’s even more likely that the issue isn’t with Ultralytics. This is why I wanted to see how you actually measured the actual time (in code).

phenomen21 · October 7, 2025, 8:12am

I measured model.predict() as well, and also model.model() - I get the same performance every time. I switched NMS off, and it makes no difference in time.
When I measured model.model() on one batch of size 1, it takes X ms, on batch 8 - X * 8 and so on. Time taken increases linearly.

Toxite · October 7, 2025, 8:21am

Seems like something specific to your hardware. I tried reproducing it and even with the code in the issue you linked, I can’t reproduce it:

In [4]: import time
   ...: from ultralytics import YOLO
   ...: import numpy as np
   ...: 
   ...: model = YOLO("yolo11n.pt")
   ...: images = [np.random.randint (0, 256, (640, 640, 3), dtype=np.uint8) for _ in range(2 ** 10)]
   ...: device = "mps"
   ...: 
   ...: # time how long it takes to run the model on the images in batches
   ...: for batch_size in [1, 16, 32, 64, 128]:
   ...:     start_time = time.perf_counter()
   ...:     for i in range(0, len(images), batch_size):
   ...:         batch = images[i:i + batch_size]
   ...:         model.predict(batch, verbose = False, device = device, rect = True)
   ...:     end_time = time.perf_counter()
   ...:     print(f"Batch size: {batch_size}, Time: {end_time - start_time}")
   ...: 
Batch size: 1, Time: 7.425320082998951
Batch size: 16, Time: 4.274201041989727
Batch size: 32, Time: 3.95698041698779
Batch size: 64, Time: 4.685385083997971
Batch size: 128, Time: 4.372804916987661

Batched is faster than single inference. At some point, it saturates and doesn’t get any faster, which is normal.

phenomen21 · October 7, 2025, 8:41am

I had to set device = torch.device(‘cuda’), and my output goes like this

Batch size: 1, Time: 25.83381841983646 Batch size: 16, Time: 26.786037494428456 Batch size: 32, Time: 27.410270627588034
Batch size: 64, Time: 28.64176694303751

BurhanQ · October 7, 2025, 12:48pm

The inference time shouldn’t change a lot once warmed up, when using batches, but as your results show:

the time required to process one image is approximately the same as 16 images. This is an increase of throughput, meaning in the same time you can process multiple images as you would for a single image. The inference time might not change much for the full inference process, but in the end you have 16 images with detections instead of only 1 in (roughly) the same timespan. You can see the same behavior in the results @Toxite shared as well.
Crudely speaking, the inference time when batching can be thought of as N/t where N is the batch size and t is the inference time. For a batch size of one, that’s 1 / t for a single image and 16 / t for batch=16, so in your case:

1 image / 25 ms for batch=1
16 image / 25 ms for batch=16

That means there is a throughput of 1 / 25 = 0.04 (image/ms) for batch=1 and 16 / 25 = 0.64 (image/ms). If you can run higher batch values without adding significantly more latency for processing inference, then you’ll improve your throughput (effectively increasing inference times) but looking at a clock, it doesn’t seem like it’s faster but it will be since to process 128 images with batch=1 you need a total of 128 / 1 * t * 128 = 128 * 25 = 3200 (ms) to process all the images, but for batch=16 you only need 128 / 16 * t = 8 * 26 = 208 (ms) is needed. From your results, using batch=64 seems perfectly reasonable, and would be faster than processing one image at a time.

phenomen21 · October 7, 2025, 3:28pm

Hi! Thanks for the answer, my output is the time required to process 1024 images, so I don’t see the reason to use batches at all, since it’s the same time, right?
And 25, 26, 27 are seconds to process the entire input of 1024 images, not ms per batch.

Toxite · October 8, 2025, 6:39am

Is this using the same code I posted? Because 25 seconds for processing 1024 images on a GPU is very long. Even CPUs are faster than that.

phenomen21 · October 8, 2025, 6:41am

Yes, this is your code exactly. Well, it’s like that. model.predict() says it’s 25 ms to process each image, but in practice it take much longer. Is there something I can do to speed it up?

Toxite · October 8, 2025, 6:50am

Then there’s something definitely wrong with your GPU or server. Because that’s too long of a time for GPU, especially an NVIDIA A10. I ran the same code on Google Colab on an NVIDIA T4, which is slower than an NVIDIA A10, and it takes less than half the time of what it’s taking for you. And batched is twice as fast too.

Toxite · October 8, 2025, 6:56am

Can you post output of running yolo checks in terminal?

Topic		Replies	Views
Total execution time of `model.predict()` is way higher than inference time Discussion code	5	133	August 21, 2025
Speed up inference time for the model trained with YOLO12x YOLO	32	578	December 17, 2025
Yoloe inference very slow on jetson with tensorrt Discussion discussion , tensorrt	28	358	September 9, 2025
YOLOv11-cls.pt for image classification processing time per image fluctuates Discussion yolo , question , discussion	3	78	August 28, 2025
Performance issues with yolo model reading results YOLO yolo , question , support , troubleshooting , code	10	1496	February 12, 2025

Using batch_size in inference doesn't speed up?

Related topics