We are trying to run a yolo model for tracking objects in a camera stream, but we are experiencing a huge issue in performance. Whenever the results are read, we read a frame time that is around 15 ms slower than the preprocessing, inference, and postprocessing times indicate.
We have tried swapping out YOLO models (v8, v10, v11) and experience the same behavior. We tried using .engine vs .pt files, running it on a Jetson and also our personal computers. We have reasonably concluded that it is not a hardware issue, and we have scaled down the issue to the tracking results function. We have also tried using model.predict
and even just model()
with no change in performance.
Here is a minimum viable replication of our issue:
import time
from ultralytics import YOLO
frames = ["img.png"]
model = YOLO("./best.pt")
for i in range(10):
print("==================")
start_time = time.time()
results = model.track(frames[0],
iou=0.85,
stream=True,
verbose=False,
tracker="./configs/botsort.yaml",
persist=True)
print("before next--- %2.5f ms ---" % ((time.time() - start_time) * 1000))
for r in results:
print(r.speed)
print("after next--- %2.5f ms ---" % ((time.time() - start_time) * 1000))
We ran this code on the ultralytics docker image located here, with framework versions:
- ultralytics - 8.3.18
- pytorch - 2.5.1+cu124
- torchvision - 0.20.1+cu124
And these were the results:
==================
before next--- 7478.14798 ms ---
{'preprocess': 8.332490921020508, 'inference': 65.53292274475098, 'postprocess': 2.1097660064697266}
after next--- 11215.55352 ms ---
==================
before next--- 0.11039 ms ---
{'preprocess': 7.988691329956055, 'inference': 59.77582931518555, 'postprocess': 2.177715301513672}
after next--- 84.64813 ms ---
==================
before next--- 0.14186 ms ---
{'preprocess': 7.1582794189453125, 'inference': 59.71169471740723, 'postprocess': 2.2802352905273438}
after next--- 84.18584 ms ---
==================
before next--- 0.14973 ms ---
{'preprocess': 3.900766372680664, 'inference': 22.022485733032227, 'postprocess': 1.3422966003417969}
after next--- 43.89453 ms ---
==================
before next--- 0.10395 ms ---
{'preprocess': 3.3295154571533203, 'inference': 15.07568359375, 'postprocess': 1.0590553283691406}
after next--- 32.76610 ms ---
==================
before next--- 0.10371 ms ---
{'preprocess': 3.370523452758789, 'inference': 14.656782150268555, 'postprocess': 1.0292530059814453}
after next--- 32.36604 ms ---
==================
before next--- 0.10896 ms ---
{'preprocess': 3.3600330352783203, 'inference': 14.789104461669922, 'postprocess': 0.9891986846923828}
after next--- 32.51362 ms ---
==================
before next--- 0.10991 ms ---
{'preprocess': 3.4210681915283203, 'inference': 14.419078826904297, 'postprocess': 1.0135173797607422}
after next--- 32.46212 ms ---
==================
before next--- 0.10443 ms ---
{'preprocess': 3.2918453216552734, 'inference': 15.416860580444336, 'postprocess': 0.9286403656005859}
after next--- 32.98044 ms ---
==================
before next--- 0.10943 ms ---
{'preprocess': 3.2393932342529297, 'inference': 13.374805450439453, 'postprocess': 1.0454654693603516}
after next--- 31.03900 ms ---
When we take out the for r in results:
block, we get significantly faster performance, so we think that the model only predicts when the generator function iterates. Why is this? Is this expected behavior, and if so, how do we read the results in a way that is usable in a camera stream environment?