How is the up to 5x speedup with TensorRT achived?

On the MODEL.export documentation it says:

Why Choose YOLO11’s Export Mode?

  • Performance: Gain up to 5x GPU speedup with TensorRT and 3x CPU speedup with ONNX or OpenVINO.

How is the up to 5x GPU speed up achieved?, is it compared to CUDA?

On my NVIDIA GTX 1650 Ti Mobile, I got 3.37% inference time speed up over CUDA (97.815ms vs 94.624ms)
Note: with nms=True the post process time saw significant post process time speed up (3.15ms vs 0.43ms) but that is not significant a part of the total prediction time

Benchmark source code:

from ultralytics import YOLO
import cv2


def bench_model(model, num_iterations=1000):
    total_preprocess_times = []
    total_inference_times = []
    total_postprocess_times = []
    for _ in range(num_iterations):
        detections = model.predict(image, verbose=False)
        # Measure and record times for each stage
        total_preprocess_times.append(detections[0].speed["preprocess"])
        total_inference_times.append(detections[0].speed["inference"])
        total_postprocess_times.append(detections[0].speed["postprocess"])

    print("AVG Preprocess times:", sum(total_preprocess_times) / len(total_preprocess_times))
    print("AVG Inference times:", sum(total_inference_times) / len(total_inference_times))
    print("AVG Postprocess times:", sum(total_postprocess_times) / len(total_postprocess_times))


image = cv2.imread("debug/images/face_4096.jpg")

model = YOLO('panopticon_models/yolo11x.pt')
print("-------- Testing vanilla model")
bench_model(model)

model = YOLO('panopticon_models/yolo11x.pt')
exported_model = model.export(format="engine")
model = YOLO(exported_model)
print("-------- testing TensorRT model")
bench_model(model)

model = YOLO('panopticon_models/yolo11x.pt')
exported_model = model.export(format="engine", nms=True)
model = YOLO(exported_model)
print("-------- Testing TensorRT model with exported nms")
bench_model(model)

Sources:

The benchmarking is using image file. You should be using a numpy array, otherwise you’re getting bottlenecked by I/O speed.

And you need to export with int8=True for the largest speedup. If not, at least half=True. Otherwise, you’re running inference at FP32, which would offer almost no speed up.

I am using BGR numpy array

So if i understand correctly the 4x speed up is due FP32 to INT8 quantization?, if so where is the rest 25% coming from?

Ah okay.

The speedup isn’t guaranteed to be 5x. And it’s not just due to quantization. It’s also due to TensorRT optimization. It depends on your GPU and what other bottlenecks you have. But int8 gets you largest boost.

In your code, the PyTorch inference would use minimum rectangle padding, while TensorRT would use full imgsz, which means TensorRT is using a larger input size compared to PyTorch inference and hence it’s not a one to one comparison. You can export with dynamic=True to have the TensorRT also use minimum rectangle padding.

You’re also using a laptop GPU, which would get throttled quickly.

  1. I understand that 5x speedup is not guaranteed, but 3% is not what I was expecting, is there a reason those TensorRT do not apply to my case?, is there something I can do to obtain them?
  2. I explicitly set imgsz=640 and observed the same behavior
  3. from what I understand, dynamic=True adds the ability to the exported model to accept different input image sizes, which negates all TensorRT optimizations in my case
  4. No throttling is happing on those tests, I have measured it

updated tests:

from ultralytics import YOLO
import cv2


def bench_model(model, num_iterations=1000):
    total_preprocess_times = []
    total_inference_times = []
    total_postprocess_times = []
    for _ in range(num_iterations):
        detections = model.predict(image, imgsz=640, verbose=False)
        # Measure and record times for each stage
        total_preprocess_times.append(detections[0].speed["preprocess"])
        total_inference_times.append(detections[0].speed["inference"])
        total_postprocess_times.append(detections[0].speed["postprocess"])

    print("AVG Preprocess times:", sum(total_preprocess_times) / len(total_preprocess_times))
    print("AVG Inference times:", sum(total_inference_times) / len(total_inference_times))
    print("AVG Postprocess times:", sum(total_postprocess_times) / len(total_postprocess_times))


image = cv2.imread("debug/images/face_4096.jpg")

model = YOLO('panopticon_models/yolo11x.pt')
print("-------- Testing vanilla model")
bench_model(model)

model = YOLO('panopticon_models/yolo11x.pt')
exported_model = model.export(format="engine", imgsz=640)
model = YOLO(exported_model)
print("-------- testing TensorRT model")
bench_model(model)

model = YOLO('panopticon_models/yolo11x.pt')
exported_model = model.export(format="engine", imgsz=640, dynamic=True)
model = YOLO(exported_model)
print("-------- testing dynamic TensorRT model")
bench_model(model)

model = YOLO('panopticon_models/yolo11x.pt')
exported_model = model.export(format="engine", imgsz=640, nms=True)
model = YOLO(exported_model)
print("-------- Testing TensorRT model with exported nms")
bench_model(model)
  1. Since you’re inferencing with FP32, there would only be small speedup. And like I mentioned, TensorRT is using full input size, and PyTorch is using minimum rectangle padding.
  2. That doesn’t disable minimum rectangle padding for PyTorch. To disable that, you need to use rect=False
  3. That’s not true. Alternatively, you can manually check the verbose logs of PyTorch inference to find the exact minimum padded input size it’s using, and use that as imgsz during export.

There’s really no reason to not use FP16. There’s almost no difference in accuracy. In fact, Ultralytics validation runs with FP16. And the model saved is also an FP16 checkpoint.

2 Likes

I am focusing on TensorRT specific speed up

It is expected for TensorRT to provide higher speed up with FP16 vs CUDA FP16, is TensorRT just not as good at optimizing FP32 YOLO models

FP16 is always going to be faster. The default for PyTorch models is to use FP16, so when exporting to TensorRT without converting to FP16, you end up with a significantly larger model that uses FP32. IIRC without modifying the source code you won’t be able to test against a PyTorch FP32 model for inference, so whenever testings the PyTorch model, you’re using the FP16 weights.

To clarify:

# This lanches a FP16 PyTorch model
model = YOLO('yolo11x.pt')

# This launches a FP32 TensorRT model
model = YOLO('yolo11x.pt')
exported_model = model.export(format="engine") 
model = YOLO(exported_model)

# This launches a FP16 TensorRT model
model = YOLO('yolo11x.pt')
exported_model = model.export(format="engine", half=True) 
model = YOLO(exported_model)

Cause if that is the case this is crazy getting the same performance for FP32 instead of FP16, that is basically a 2x speed up

That is correct

1 Like

Based on the data type correction you provided, here is my updated/correct measurements

Framework Execution Time (ms) Speed up relative to CUDA
CUDA 96.289 1.00x
TensorRT 42.694 2.26x
TensrorRT dynamic=True 43.069 2.24x

2.25x speed up is pretty good

From what I have read, TensorRT offer further optimizations for Nvidia RTX GPUs which, I can not test at the moment.

Updated benchmark:

from ultralytics import YOLO
import cv2


def benchmark_ultralytics_model(model, num_iterations=1000):
    total_preprocess_times = []
    total_inference_times = []
    total_postprocess_times = []
    for _ in range(num_iterations):
        detections = model.predict(image, imgsz=640, verbose=False)
        # Measure and record times for each stage
        total_preprocess_times.append(detections[0].speed["preprocess"])
        total_inference_times.append(detections[0].speed["inference"])
        total_postprocess_times.append(detections[0].speed["postprocess"])

    print("AVG Preprocess times:", sum(total_preprocess_times) / len(total_preprocess_times))
    print("AVG Inference times:", sum(total_inference_times) / len(total_inference_times))
    print("AVG Postprocess times:", sum(total_postprocess_times) / len(total_postprocess_times))


image = cv2.imread("debug/images/face_4096.jpg")

model = YOLO('yolo11x.pt')
print("-------- Testing vanilla model")
benchmark_ultralytics_model(model)

model = YOLO('yolo11x.pt')
exported_model = model.export(format="engine", imgsz=640, half=True)
model = YOLO(exported_model)
print("-------- testing TensorRT model")
benchmark_ultralytics_model(model)

model = YOLO('yolo11x.pt')
exported_model = model.export(format="engine", imgsz=640, dynamic=True, half=True)
model = YOLO(exported_model)
print("-------- testing dynamic TensorRT model")
benchmark_ultralytics_model(model)

model = YOLO('yolo11x.pt')
exported_model = model.export(format="engine", imgsz=640, nms=True, half=True)
model = YOLO(exported_model)
print("-------- Testing TensorRT model with exported nms")
benchmark_ultralytics_model(model)

Nice work—those FP16 TensorRT results look spot on for a GTX 1650 Ti Mobile. ~2.25x over PyTorch FP16 is typical on that GPU class. The “up to 5x” claims come from INT8 + TensorRT layer fusion on newer RTX/Ampere/Ada or datacenter GPUs (or Jetson) with larger batches; see the performance tables in our TensorRT export guide for context.

If you want to push further on your setup, try an INT8 export with proper calibration and a larger batch if memory allows:

from ultralytics import YOLO
YOLO('yolo11x.pt').export(
    format='engine',
    int8=True,
    data='your_dataset.yaml',  # representative val images; aim for ~500+ if possible
    batch=8,                   # calibrates at 2x this internally
    dynamic=True,              # enabled automatically for INT8
    workspace=4,               # GiB; adjust if needed
    nms=True                   # GPU NMS to cut postprocess time
)

Key tips:

  • INT8 gains depend heavily on calibration quality and using the same target device for export and inference. Details are in the INT8 section of the TensorRT export guide.
  • Batch >1 helps TensorRT optimize throughput; single-image benchmarking can understate gains.
  • On 16‑series Turing (no Tensor Cores), FP16/INT8 speedups are smaller vs RTX/Ampere. Smaller YOLO11 variants (e.g., n/s) can also improve effective throughput on limited GPUs.

For apples-to-apples comparisons across formats, the built-in Benchmark mode can help streamline tests. Relevant docs: the TensorRT export guide and Benchmark mode.

1 Like