Jetson Orin Nano - Extreme time overhead

I have a Jetson Orin Nano, I’m using the ultralytics Docker provided for Jetpack 6. I have encountered a huge time heap and I don’t know where it comes from.

I have also created an “engine” model using yolo26 nano.
Export complete (563.0s)
Results saved to /ultralytics
Predict: yolo predict task=detect model=yolo26n.engine imgsz=960 half
Validate: yolo val task=detect model=yolo26n.engine imgsz=960 data=/home/lq/codes/ultralytics/ultralytics/cfg/datasets/coco.yaml half
‘yolo26n.engine’

This is the code

import time
from ultralytics import YOLO

image = "https://ultralytics.com/images/bus.jpg"
model = YOLO("yolo26n.engine")
for i in range(10):
    print("==================")
    start_time = time.time()
    results = model.track(image,
                          iou=0.85,
                          stream=False,
                          verbose=False,
                          persist=True)
    print("before next--- %2.5f ms ---" % ((time.time() - start_time) * 1000))
    for r in results:
        print(r.speed)
    print("after next--- %2.5f ms ---" % ((time.time() - start_time) * 1000))

And this is the response

==================

Loading yolo26n.engine for TensorRT inference...

[02/03/2026-11:20:44] [TRT] [I] Loaded engine size: 8 MiB

[02/03/2026-11:20:44] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +21, now: CPU 0, GPU 25 (MiB)

Downloading https://ultralytics.com/images/bus.jpg to 'bus.jpg': 100% ━━━━━━━━━━━━ 134.2KB 17.9MB/s 0.0s

before next--- 1702.60859 ms ---

{'preprocess': 17.07121300023573, 'inference': 24.008819999835396, 'postprocess': 12.153244000273844}

after next--- 1702.73948 ms ---

==================

Found https://ultralytics.com/images/bus.jpg locally at bus.jpg

before next--- 93.06192 ms ---

{'preprocess': 16.566280000006373, 'inference': 23.785841000062646, 'postprocess': 1.6990890003398817}

after next--- 93.19997 ms ---

==================

[….]

As you can see the inference time is super good, but the total time is around 90 to 100ms, where those extra ~48ms are being lost?

PS: I’ve tried without tracking but exactly same overhead.

Thank you in advance for your time!
Cristian.

Since you’re specifying the URL for the image source, the code checks if the image is downloaded, if found it reads it from file, otherwise it downloads the image. This will add overhead on each iteration of the loop. The ultralytics code includes this image, and you can update your code to this and try running again:

import time
+ import cv2  # Optional, see below

- from ultralytics import YOLO
+ from ultralytics import ASSETS, YOLO

- image = "https://ultralytics.com/images/bus.jpg"
+ image = ASSETS / "bus.jpg"
+
+ # Optionally, read image into memory before loop
+ # image = cv2.imread(ASSETS / "bus.jpg")
model = YOLO("yolo26n.engine")
for i in range(10):
    print("==================")
    start_time = time.time()
    results = model.track(image,
                          iou=0.85,
                          stream=False,
                          verbose=False,
                          persist=True)
    print("before next--- %2.5f ms ---" % ((time.time() - start_time) * 1000))
    for r in results:
        print(r.speed)
    print("after next--- %2.5f ms ---" % ((time.time() - start_time) * 1000))

Using the local file instead of the image URL should be faster, since the execution path in the code is shorter, although after the image is downloaded, it should be close to the same. Additionally loading the image into memory using cv2.imread should also reduce the time for each iteration of processing time.

Testing on a Macbook with ONNX, here are the time differences for the different sources (last iteration of the loop):

  • Using local file read on each iteration (`image = ASSETS / “bus.jpg”)

    before next--- 79.326 ms ---
    {
        'preprocess': 2.737, 
        'inference': 59.811, 
        'postprocess': 0.1899
    } 
    Total time: 62.738
    after next--- 79.375 ms ---
    
    • Total execution time - Model processing time = 16.6 ms

  • Using URL file on each iteration (`image = “https://ultralytics.com/images/bus.jpg”)

    Found https://ultralytics.com/images/bus.jpg locally at bus.jpg
    before next--- 79.782 ms ---
    {
        'preprocess': 2.721, 
        'inference': 59.376, 
        'postprocess': 0.225
    } 
    Total time: 62.322
    after next--- 79.834 ms ---
    
    • Total execution time - Model processing time = 17.5ms

  • Using in-memory image on each iteration (`image = cv2.imread(ASSETS / “bus.jpg”)

    before next--- 74.457 ms ---
    {
        'preprocess': 2.586, 
        'inference': 59.023, 
        'postprocess': 0.197
    } 
    Total time: 61.804
    after next--- 74.501 ms ---
    
    • Total execution time - Model processing time = 12.7ms

I used the last iteration time since that gives the fairest chance to the URL execution method. As you can see, the fastest should be when the image is loaded in memory. Of course, this is on a MacBook, not a Jetson, so I expect there to be some difference if you test the same, but the overall result should be similar. There could be additional time in the difference between the total execution time and the model processing time when using TensorRT, since the data is copied from CPU to GPU then back to CPU, which adds overhead as well.

Hi @BurhanQ. Thank you for your help, it’s very appreciated.

I’ve tried your solution and I still had a bit of this time overhead, but I realized that it was because of the tracker callback
self.run_callbacks("on_predict_postprocess_end")

So now all number add’s up and everything makes much more sense.

Again thank you very much!

1 Like

Of course, you’re very welcome!