How can I get max performance on Jetson Orin NX?

Shawn_Chan · July 26, 2025, 5:07pm

Hi everyone:

I’ve successfully installed all necessary packages on my Jetson Orin NX 16GB plantform. And then I wrote a simple demo using YOLO11n:

model = YOLO(‘yolo11n.engine’)
image = cv2.imread(image_path)
results = model(image_path)

The ‘yolo11n.engine’ model file was exported using half=True, imgsz=640, nms=True.
And I’ve opened jetson_clock.

Here’s the result:
Loading yolo11n.engine for TensorRT inference…
[07/27/2025-00:39:00] [TRT] [I] Loaded engine size: 8 MiB
[07/27/2025-00:39:00] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +9, now: CPU 0, GPU 14 (MiB)

image 1/1 /home/me/Projects/helloPython/pics/bus.jpg: 640x640 4 persons, 1 bus, 8.0ms
Speed: 17.3ms preprocess, 8.0ms inference, 34.5ms postprocess per image at shape (1, 3, 640, 640)

As U can see above, the inference time is 8.0ms. But according to the benchmark metioned here:

Jetson Orin NX 16GB should has a performace of 4-5ms inference time @FP=16, that’s twice as fast as mine.

WHAT CAN I DO TO FIND OUT THE REASON?
THANKS A LOT!!!

ps: my evn info:
Ultralytics 8.3.169 Python-3.10.12 torch-2.5.0a0+872d972e41.nv24.08 CUDA:0 (Orin, 15656MiB)
Setup complete (8 CPUs, 15.3 GB RAM, 38.1/232.2 GB disk)

OS Linux-5.15.148-tegra-aarch64-with-glibc2.35
Environment Linux
Python 3.10.12
Install pip
Path /home/housebrain/Projects/.yenv/lib/python3.10/site-packages/ultralytics
RAM 15.29 GB
Disk 38.1/232.2 GB
CPU ARMv8 Processor rev 1 (v8l)
CPU count 8
GPU Orin, 15656MiB
GPU count 1
CUDA 12.6

numpy 1.26.4>=1.23.0
matplotlib 3.10.3>=3.3.0
opencv-python 4.12.0.88>=4.6.0
pillow 11.3.0>=7.1.2
pyyaml 6.0.2>=5.3.1
requests 2.32.4>=2.23.0
scipy 1.15.3>=1.4.1
torch 2.5.0a0+872d972e41.nv24.8>=1.8.0
torch 2.5.0a0+872d972e41.nv24.8!=2.4.0,>=1.8.0; sys_platform == “win32”
torchvision 0.20.0a0>=0.9.0
tqdm 4.67.1>=4.64.0
psutil 7.0.0
py-cpuinfo 9.0.0
pandas 2.3.1>=1.1.4
ultralytics-thop 2.0.14>=2.0.0

Toxite · July 26, 2025, 5:26pm

You should benchmark the latency by running it on multiple images, not a single image. Otherwise, it won’t provide you with accurate numbers.

Shawn_Chan · July 27, 2025, 12:42am

Thanks, I’ll try it out!

pderrenger · July 27, 2025, 1:28pm

Hello! Thanks for reaching out and providing such detailed information. It’s a great question about the difference between your observed inference time and our published benchmarks.

The benchmark values you’re referencing are generated under specific conditions to measure pure model inference speed. As noted in our Quick Start Guide for NVIDIA Jetson, these times are averaged over many images and importantly, they exclude the pre-processing and post-processing steps.

A single prediction can include some initial warm-up overhead, which may not reflect the model’s optimal steady-state performance. To get a more comparable result, I recommend using our benchmark mode, which was designed to replicate our testing environment.

You can run it like this:

from ultralytics import YOLO

# Load the base PyTorch model
model = YOLO('yolo11n.pt')

# Run the benchmark for TensorRT FP16 export on your device
# This will handle the export and measure performance over a dataset
results = model.benchmark(data='coco128.yaml', half=True, imgsz=640)

Running this benchmark should give you an inference time that aligns more closely with our documentation. Let us know how it goes

Shawn_Chan · July 27, 2025, 2:01pm

Thank U very much for the information!

After I posted the question yesterday, I thought I should run the benchmark as ultralytics did. So I ran yolo benchmark with model=yolo11n.pt and data=coco128.yaml.

Due to some package dependance issues I did not run all the benchmark, but I have successfully finished the TensorRT@FP32, I’ve got 8.71ms, which is a little bit slower than the standard benchmark(7.94ms).
That will be enough for me right now.

Thank U again!

pderrenger · July 28, 2025, 1:31pm

Hello Shawn_Chan,

Great to hear you were able to run the benchmark yourself! That’s the best way to get a direct comparison.

The small difference you’re seeing between your result (8.71ms) and the documented one (7.94ms) is quite normal. These variations can be due to minor differences in software versions (like JetPack, CUDA, or PyTorch) or the system’s background workload at the time of the test. Our published benchmarks on NVIDIA Jetson serve as a reference point, but slight fluctuations are expected on different setups.

You’re on the right track by using the benchmark mode for performance evaluation. Thanks for sharing your findings with the community

Topic		Replies	Views
Yolov8 model latency on jetson orin nx YOLO yolov8 , yolov9 , jetson	11	194	April 30, 2025
Slimming with onnxslim 0.1.50 fails with yolo export on Jetson 6.2 L4T Discussion yolo , troubleshooting , resource	4	170	April 26, 2025
Yolov11 C++ Example for Jetson Orin Nano Jetpack6.2 Support question , jetson	1	502	February 20, 2025
Speed up inference time for the model trained with YOLO12x YOLO	30	79	August 22, 2025
Yolov8 model slowing down in real time detection after 2 hours Support support , bug-fix , discussion	6	25	August 20, 2025

How can I get max performance on Jetson Orin NX?

Related topics