Performance issues with yolo model reading results

fernanms · February 9, 2025, 9:38pm

We are trying to run a yolo model for tracking objects in a camera stream, but we are experiencing a huge issue in performance. Whenever the results are read, we read a frame time that is around 15 ms slower than the preprocessing, inference, and postprocessing times indicate.

We have tried swapping out YOLO models (v8, v10, v11) and experience the same behavior. We tried using .engine vs .pt files, running it on a Jetson and also our personal computers. We have reasonably concluded that it is not a hardware issue, and we have scaled down the issue to the tracking results function. We have also tried using model.predict and even just model() with no change in performance.

Here is a minimum viable replication of our issue:

import time
from ultralytics import YOLO

frames = ["img.png"]
model = YOLO("./best.pt")

for i in range(10):
    print("==================")
    start_time = time.time()
    results = model.track(frames[0],
                                        iou=0.85,
                                        stream=True,
                                        verbose=False,
                                        tracker="./configs/botsort.yaml",
                                        persist=True)

    print("before next--- %2.5f ms ---" % ((time.time() - start_time) * 1000))

    for r in results:
        print(r.speed)

    print("after next--- %2.5f ms ---" % ((time.time() - start_time) * 1000))

We ran this code on the ultralytics docker image located here, with framework versions:

ultralytics - 8.3.18
pytorch - 2.5.1+cu124
torchvision - 0.20.1+cu124

And these were the results:

==================
before next--- 7478.14798 ms ---
{'preprocess': 8.332490921020508, 'inference': 65.53292274475098, 'postprocess': 2.1097660064697266}
after next--- 11215.55352 ms ---
==================
before next--- 0.11039 ms ---
{'preprocess': 7.988691329956055, 'inference': 59.77582931518555, 'postprocess': 2.177715301513672}
after next--- 84.64813 ms ---
==================
before next--- 0.14186 ms ---
{'preprocess': 7.1582794189453125, 'inference': 59.71169471740723, 'postprocess': 2.2802352905273438}
after next--- 84.18584 ms ---
==================
before next--- 0.14973 ms ---
{'preprocess': 3.900766372680664, 'inference': 22.022485733032227, 'postprocess': 1.3422966003417969}
after next--- 43.89453 ms ---
==================
before next--- 0.10395 ms ---
{'preprocess': 3.3295154571533203, 'inference': 15.07568359375, 'postprocess': 1.0590553283691406}
after next--- 32.76610 ms ---
==================
before next--- 0.10371 ms ---
{'preprocess': 3.370523452758789, 'inference': 14.656782150268555, 'postprocess': 1.0292530059814453}
after next--- 32.36604 ms ---
==================
before next--- 0.10896 ms ---
{'preprocess': 3.3600330352783203, 'inference': 14.789104461669922, 'postprocess': 0.9891986846923828}
after next--- 32.51362 ms ---
==================
before next--- 0.10991 ms ---
{'preprocess': 3.4210681915283203, 'inference': 14.419078826904297, 'postprocess': 1.0135173797607422}
after next--- 32.46212 ms ---
==================
before next--- 0.10443 ms ---
{'preprocess': 3.2918453216552734, 'inference': 15.416860580444336, 'postprocess': 0.9286403656005859}
after next--- 32.98044 ms ---
==================
before next--- 0.10943 ms ---
{'preprocess': 3.2393932342529297, 'inference': 13.374805450439453, 'postprocess': 1.0454654693603516}
after next--- 31.03900 ms ---

When we take out the for r in results: block, we get significantly faster performance, so we think that the model only predicts when the generator function iterates. Why is this? Is this expected behavior, and if so, how do we read the results in a way that is usable in a camera stream environment?

pderrenger · February 10, 2025, 2:24am

Hi fernanms,

Thanks for the detailed report. The behavior you’re observing is expected when using stream=True in Ultralytics YOLO. When stream=True, the model.track() method returns a generator that yields results frame by frame, and the actual inference only occurs when you iterate over the generator. This design is intentional to optimize memory usage, especially for video streams or large datasets, as it avoids loading all frames into memory at once.

The delay you’re noticing when iterating over results is due to the inference happening at that point. If you need real-time performance in a camera stream environment, consider processing each frame immediately as it’s yielded by the generator. For example:

for r in results:
    # Process each result (e.g., visualize, save, or analyze) here
    print(r.speed)

If you don’t require the memory efficiency of stream=True, you can set stream=False to force the model to process all frames at once and return a list of results. However, this may not be suitable for long videos or high-frame-rate streams due to higher memory usage.

Additionally, ensure you’re using the latest version of Ultralytics YOLO, as performance improvements are continuously being made. You can update with:

pip install -U ultralytics

Let me know if you have further questions or need clarification!

Toxite · February 10, 2025, 8:02am

Were you the one who asked this on Discord about a week ago?

I mentioned that you shouldn’t be using stream if you’re inferring on single frames. And what you think is slowdown, it’s just inference latency. Your “before” time doesn’t include inference. Inference only occurs when iterate because you’re using stream=True

fernanms · February 10, 2025, 5:54pm

Even using this code:

for i in range(10):
    print("==================")
    start_time = time.time()
    results = model.track(frames[0],
                                        iou=0.85,
                                        stream=False,
                                        verbose=False,
                                        tracker="./configs/botsort.yaml",
                                        persist=True)

    print(results[0].speed)

    print("frame time --- %2.5f ms ---" % ((time.time() - start_time) * 1000))

I get the same results, like

==================
{'preprocess': 3.030538558959961, 'inference': 10.943174362182617, 'postprocess': 0.7059574127197266}
frame time --- 27.39334 ms ---

My question is, where is this extra 10-15 ms of overhead coming from, and how can I fix it?

Toxite · February 10, 2025, 11:22pm

The first problem here is you’re restarting a conversation that has already happened on Discord, asking the same questions again. I am not sure why you would restart a conversation and ask the same questions again that was answered there like that (and not even mention that in the post) instead of continuing it on Discord. The answers to those questions aren’t different.

On Discord, I had sent you some code to test which performs inference differently. But you said it was late and will test it later and never responded in a week.

And now you’re restarting the same conversation here almost like it never happened.

The latency you’re seeing comes from tracking which is not included in the latency you get from result.speed. You can try using bytetrack.yaml instead of botsort.yaml to reduce it. But the latency from tracking will always be there as long as you use tracking.

fernanms · February 11, 2025, 4:16am

I’m sorry for not mentioning the conversation we had in Discord, I admit that was a lapse in judgement on my part. However, we had not resolved the issue over Discord, and I wanted some other perspectives on the problem.
Back to the issue, I tried using model.predict() and model() instead of tracking, and I also switched out the config file to bytetrack.yaml. In all of these cases, I experienced the same latency after inference time (about 10 ms). So it is not the tracking that causes this latency. Do you have any idea what else could be happening?

Toxite · February 11, 2025, 8:29am

What’s the code you used to check latency for model.predict() and did you run it in a completely different session?

Once you use either model.predict() or model.track(), you can’t modify the tracking settings without reinitializing the model, which is why you should either benchmark them in separate sessions or reload the model before each.

fernanms · February 11, 2025, 11:35pm

This is the code I was using:

import time
from ultralytics import YOLO

frames = ["img.png"]
model = YOLO("./best.pt")

for i in range(10):
    print("==================")
    start_time = time.time()
    results = model.predict(frames[0],
                            stream=False,
                            verbose=False)

    print(results[0].speed)

    print("frame time --- %2.5f ms ---" % ((time.time() - start_time) * 1000))

I ran this code in an isolated session after reloading my docker environment, so the model was completely reinitialized. We are also using python’s standard time library, so I doubt there is an issue with the timing function. Is there anything else you can think of?

Toxite · February 12, 2025, 12:04am

In this case, the extra latency comes from loading the PNG file, which is again, not included in the latency returned in result.speed. If you want to test the latency correctly, you should be passing a numpy array, not a file.

Extra latency can come from tracking, loading the image file, saving file/results, printing logs and/or displaying image with OpenCV. These latencies are not included in the latency that’s returned. Any form of file loading or saving will incur latency that’s not part of the logged latency.

fernanms · February 12, 2025, 1:44am

Yes, you are correct, this was adding a good amount of latency. I think I have resolved all of the major issues now. Thank you everyone for your responses! I appreciate your continued help.

I have reduced several sources of latency by doing the following:

Using camera stream instead of image file
setting stream=False (misleading in docs)
setting persist=True (unclear in docs)
setting tracker=bytetrack.yaml (faster than botsort)

Do you have any more advice on configuring a Jetson environment for maximum performance? More specifically, using DLA, which power modes to use, how to maximize gpu utilization, etc.

Toxite · February 12, 2025, 4:38am

You should use TensorRT with half=True and nms=True. DLA is slower than using GPU on Jetson. So you should use GPU instead.

You can try this for faster tracking:

Topic		Replies	Views
Yolov8 model latency on jetson orin nx YOLO yolov8 , yolov9 , jetson	11	133	April 30, 2025
Speed up inference time for Live Inference with Streamlit Application using Ultralytics YOLO11 YOLO	5	6	July 20, 2025
[Unofficial] Benchmark Results (How fast can you YOLO) Hardware yolov8 , desktop , benchmark	4	698	August 14, 2024
Tracking metrics low YOLO	1	63	November 29, 2024
Ultralytics 8.3.6 with torch==2.6.0 + torchvision==0.21.0 is silently falling back to CPU during model.predict() despite .to('cuda') and device=0 YOLO yolo , question , code	3	16	July 10, 2025

Performance issues with yolo model reading results

Related topics