@phenomen21 I ran some tests on my machine with an RTX card, and the results show that batches with a PyTorch model aren’t really any different. Exporting and using an TensorRT .engine version helps significantly.
Export command
yolo export format=engine model=yolo11n-obb.pt batch=8 half=True dynamic=True
Run results
YOLO11n.pt
Average Throughputs & Times:
===
Batch=1: 283.97 im/s
Batch=1: 0.23 s
---
Batch=4: 285.34 im/s
Batch=4: 0.22 s
---
Batch=8: 284.81 im/s
Batch=8: 0.22 s
YOLO11n-obb.pt
Average Throughputs & Times:
===
Batch=1: 100.02 im/s
Batch=1: 0.64 s
---
Batch=4: 100.40 im/s
Batch=4: 0.64 s
---
Batch=8: 100.44 im/s
Batch=8: 0.64 s
YOLO11n.engine
Average Throughputs & Times:
===
Batch=1: 390.23 im/s
Batch=1: 0.16 s
---
Batch=4: 721.18 im/s
Batch=4: 0.09 s
---
Batch=8: 796.44 im/s
Batch=8: 0.08 s
YOLO11n-obb.engine
Average Throughputs & Times:
===
Batch=1: 353.91 im/s
Batch=1: 0.18 s
---
Batch=4: 631.63 im/s
Batch=4: 0.10 s
---
Batch=8: 682.96 im/s
Batch=8: 0.09 s
The code used
import gc
import random
import time
from pathlib import Path
import numpy as np
import torch
from ultralytics import YOLO
N = 64 # total images
RR = 32 # run repeats
WU = 3 # warmup runs
H, W, C = 640, 640, 3 # image height, width, channels
DEV = "cuda" # device
TASK = {
"": "detect",
"obb": "obb",
"cls": "classify",
"seg": "segment",
"pose": "pose",
}
bsizes = [1, 4, 8] # batch sizes
batches = bsizes * RR
random.shuffle(batches)
im = np.random.randint(0, 256, (H, W, C), dtype=np.uint8)
p = Path("ultralytics") # update as needed
model_name = p / "yolo11n-obb.engine" # update as needed
task = TASK.get(Path(model_name).stem.split("-")[-1]) or "detect"
if __name__ == "__main__":
thru = {b: [] for b in bsizes}
times = {b: [] for b in bsizes}
for b in batches:
imgs = [[im] * b for _ in range(N // b)]
model = YOLO(model_name) # new model
_ = [
model.predict([im] * b, batch=b, device=DEV, verbose=False)
for _ in range(WU)
]
t0 = time.perf_counter()
for ib in imgs:
results = model.predict(ib, batch=b, device=DEV, verbose=False)
delta = time.perf_counter() - t0
print(f"Batch={b}\nTime taken={delta:.2f} s\nThroughput={N / delta:.2f} im/s")
print("===" * 12)
# Explicit cleanup to mitigate cumulative VRAM growth across iterations
del model, results, imgs # drop references
gc.collect()
if torch.cuda.is_available() and str(DEV) in {"cuda", "0"}:
torch.cuda.empty_cache()
thru[b].append(N / delta)
times[b].append(delta)
time.sleep(random.randint(2, 5)) # dwell time
print("Average Throughputs & Times:")
for b in sorted(thru.keys()):
print(f"Batch={b}: {np.mean(thru[b]):.2f} im/s")
print(f"Batch={b}: {np.mean(times[b]):.2f} s")