Using batch_size in inference doesn't speed up?

@phenomen21 I ran some tests on my machine with an RTX card, and the results show that batches with a PyTorch model aren’t really any different. Exporting and using an TensorRT .engine version helps significantly.

Export command

yolo export format=engine model=yolo11n-obb.pt batch=8 half=True dynamic=True

Run results

YOLO11n.pt
 Average Throughputs & Times:
 ===
 Batch=1: 283.97 im/s
 Batch=1: 0.23 s
 ---
 Batch=4: 285.34 im/s
 Batch=4: 0.22 s
 ---
 Batch=8: 284.81 im/s
 Batch=8: 0.22 s

YOLO11n-obb.pt
 Average Throughputs & Times:
 ===
 Batch=1: 100.02 im/s
 Batch=1: 0.64 s
 ---
 Batch=4: 100.40 im/s
 Batch=4: 0.64 s
 ---
 Batch=8: 100.44 im/s
 Batch=8: 0.64 s

YOLO11n.engine
 Average Throughputs & Times:
 ===
 Batch=1: 390.23 im/s
 Batch=1: 0.16 s
 ---
 Batch=4: 721.18 im/s
 Batch=4: 0.09 s
 ---
 Batch=8: 796.44 im/s
 Batch=8: 0.08 s

YOLO11n-obb.engine
 Average Throughputs & Times:
 ===
 Batch=1: 353.91 im/s
 Batch=1: 0.18 s
 ---
 Batch=4: 631.63 im/s
 Batch=4: 0.10 s
 ---
 Batch=8: 682.96 im/s
 Batch=8: 0.09 s
The code used
 import gc
 import random
 import time
 from pathlib import Path
 
 import numpy as np
 import torch
 from ultralytics import YOLO
 
 N = 64  # total images
 RR = 32  # run repeats
 WU = 3  # warmup runs
 H, W, C = 640, 640, 3  # image height, width, channels
 DEV = "cuda"  # device
 TASK = {
     "": "detect",
     "obb": "obb",
     "cls": "classify",
     "seg": "segment",
     "pose": "pose",
 }
 
 
 bsizes = [1, 4, 8]  # batch sizes
 batches = bsizes * RR
 random.shuffle(batches)
 im = np.random.randint(0, 256, (H, W, C), dtype=np.uint8)
 
 p = Path("ultralytics")  # update as needed
 model_name = p / "yolo11n-obb.engine"  # update as needed
 task = TASK.get(Path(model_name).stem.split("-")[-1]) or "detect"
 
 if __name__ == "__main__":
     thru = {b: [] for b in bsizes}
     times = {b: [] for b in bsizes}
     for b in batches:
         imgs = [[im] * b for _ in range(N // b)]
         model = YOLO(model_name)  # new model
         _ = [
             model.predict([im] * b, batch=b, device=DEV, verbose=False)
             for _ in range(WU)
         ]
 
         t0 = time.perf_counter()
         for ib in imgs:
             results = model.predict(ib, batch=b, device=DEV, verbose=False)
         delta = time.perf_counter() - t0
 
         print(f"Batch={b}\nTime taken={delta:.2f} s\nThroughput={N / delta:.2f} im/s")
         print("===" * 12)
         # Explicit cleanup to mitigate cumulative VRAM growth across iterations
         del model, results, imgs  # drop references
         gc.collect()
         if torch.cuda.is_available() and str(DEV) in {"cuda", "0"}:
             torch.cuda.empty_cache()
 
         thru[b].append(N / delta)
         times[b].append(delta)
 
         time.sleep(random.randint(2, 5))  # dwell time
 
     print("Average Throughputs & Times:")
     for b in sorted(thru.keys()):
         print(f"Batch={b}: {np.mean(thru[b]):.2f} im/s")
         print(f"Batch={b}: {np.mean(times[b]):.2f} s")
2 Likes