How can I get max performance on Jetson Orin NX?

Hi everyone:

I’ve successfully installed all necessary packages on my Jetson Orin NX 16GB plantform. And then I wrote a simple demo using YOLO11n:

model = YOLO(‘yolo11n.engine’)
image = cv2.imread(image_path)
results = model(image_path)

The ‘yolo11n.engine’ model file was exported using half=True, imgsz=640, nms=True.
And I’ve opened jetson_clock.

Here’s the result:
Loading yolo11n.engine for TensorRT inference…
[07/27/2025-00:39:00] [TRT] [I] Loaded engine size: 8 MiB
[07/27/2025-00:39:00] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +9, now: CPU 0, GPU 14 (MiB)

image 1/1 /home/me/Projects/helloPython/pics/bus.jpg: 640x640 4 persons, 1 bus, 8.0ms
Speed: 17.3ms preprocess, 8.0ms inference, 34.5ms postprocess per image at shape (1, 3, 640, 640)

As U can see above, the inference time is 8.0ms. But according to the benchmark metioned here:

Jetson Orin NX 16GB should has a performace of 4-5ms inference time @FP=16, that’s twice as fast as mine.

WHAT CAN I DO TO FIND OUT THE REASON?
THANKS A LOT!!!

ps: my evn info:
Ultralytics 8.3.169 :rocket: Python-3.10.12 torch-2.5.0a0+872d972e41.nv24.08 CUDA:0 (Orin, 15656MiB)
Setup complete :white_check_mark: (8 CPUs, 15.3 GB RAM, 38.1/232.2 GB disk)

OS Linux-5.15.148-tegra-aarch64-with-glibc2.35
Environment Linux
Python 3.10.12
Install pip
Path /home/housebrain/Projects/.yenv/lib/python3.10/site-packages/ultralytics
RAM 15.29 GB
Disk 38.1/232.2 GB
CPU ARMv8 Processor rev 1 (v8l)
CPU count 8
GPU Orin, 15656MiB
GPU count 1
CUDA 12.6

numpy :white_check_mark: 1.26.4>=1.23.0
matplotlib :white_check_mark: 3.10.3>=3.3.0
opencv-python :white_check_mark: 4.12.0.88>=4.6.0
pillow :white_check_mark: 11.3.0>=7.1.2
pyyaml :white_check_mark: 6.0.2>=5.3.1
requests :white_check_mark: 2.32.4>=2.23.0
scipy :white_check_mark: 1.15.3>=1.4.1
torch :white_check_mark: 2.5.0a0+872d972e41.nv24.8>=1.8.0
torch :white_check_mark: 2.5.0a0+872d972e41.nv24.8!=2.4.0,>=1.8.0; sys_platform == “win32”
torchvision :white_check_mark: 0.20.0a0>=0.9.0
tqdm :white_check_mark: 4.67.1>=4.64.0
psutil :white_check_mark: 7.0.0
py-cpuinfo :white_check_mark: 9.0.0
pandas :white_check_mark: 2.3.1>=1.1.4
ultralytics-thop :white_check_mark: 2.0.14>=2.0.0

You should benchmark the latency by running it on multiple images, not a single image. Otherwise, it won’t provide you with accurate numbers.

Thanks, I’ll try it out!

Hello! Thanks for reaching out and providing such detailed information. It’s a great question about the difference between your observed inference time and our published benchmarks.

The benchmark values you’re referencing are generated under specific conditions to measure pure model inference speed. As noted in our Quick Start Guide for NVIDIA Jetson, these times are averaged over many images and importantly, they exclude the pre-processing and post-processing steps.

A single prediction can include some initial warm-up overhead, which may not reflect the model’s optimal steady-state performance. To get a more comparable result, I recommend using our benchmark mode, which was designed to replicate our testing environment.

You can run it like this:

from ultralytics import YOLO

# Load the base PyTorch model
model = YOLO('yolo11n.pt')

# Run the benchmark for TensorRT FP16 export on your device
# This will handle the export and measure performance over a dataset
results = model.benchmark(data='coco128.yaml', half=True, imgsz=640)

Running this benchmark should give you an inference time that aligns more closely with our documentation. Let us know how it goes

Thank U very much for the information!

After I posted the question yesterday, I thought I should run the benchmark as ultralytics did. So I ran yolo benchmark with model=yolo11n.pt and data=coco128.yaml.

Due to some package dependance issues I did not run all the benchmark, but I have successfully finished the TensorRT@FP32, I’ve got 8.71ms, which is a little bit slower than the standard benchmark(7.94ms).
That will be enough for me right now.

Thank U again!

1 Like

Hello Shawn_Chan,

Great to hear you were able to run the benchmark yourself! That’s the best way to get a direct comparison.

The small difference you’re seeing between your result (8.71ms) and the documented one (7.94ms) is quite normal. These variations can be due to minor differences in software versions (like JetPack, CUDA, or PyTorch) or the system’s background workload at the time of the test. Our published benchmarks on NVIDIA Jetson serve as a reference point, but slight fluctuations are expected on different setups.

You’re on the right track by using the benchmark mode for performance evaluation. Thanks for sharing your findings with the community

1 Like