Yoloe inference very slow on jetson with tensorrt

code-cp · September 6, 2025, 1:58pm

I exported yoloe with visual prompt to tensorrt engine file and with int8 quantization imgsz=640, nms=True, then run the inference on jetson orin nx 16G, all the steps are based on YOLOE: Real-Time Seeing Anything - Ultralytics YOLO Docs

input image size is 1280 x 720, with segmentation head, the total time to process the image (preprocessing, inference, postprocessing) is 1s, this is too slow compared with other models like yolov8 with tensorrt on jetson, I am wondering what might be the reason why tesnorrt model runs very slow?

Toxite · September 6, 2025, 6:22pm

Which YOLOE model did you export?

You can convert the YOLOE model to detection only model for faster inference if you don’t need the masks.

code-cp · September 7, 2025, 2:22am

I exported 11s-seg, also tried 11l-seg, same speed

I tried detection also, the results are not acceptable for my use case (too many bounding box all over the image), and the speed is not faster, I suspect I am doing something wrong

if I use tensorrt python api directly (instead of ultralytics api) for inferrence, will it speed up the speed?

Toxite · September 7, 2025, 4:35am

Can you post the code you used for export and inference?

code-cp · September 7, 2025, 5:17am

export code

yolo_model = "yoloe-11s-seg.pt"
model = YOLOE(yolo_model)

# Define visual prompts using bounding boxes and their corresponding class IDs.
# Each box highlights an example of the object you want the model to detect.
visual_prompts = dict(
    bboxes=np.array(
        [
            [221.52, 405.8, 344.98, 857.54],  # Box enclosing person
            [120, 425, 160, 445],  # Box enclosing glasses
        ],
    ),
    cls=np.array(
        [
            0,  # ID to be assigned for person
            1,  # ID to be assigned for glassses
        ]
    ),
)

# Run inference on an image, using the provided visual prompts as guidance
results = model.predict(
    "ultralytics/assets/bus.jpg",
    visual_prompts=visual_prompts,
    predictor=YOLOEVPSegPredictor,
)

exported_path = model.export(
    format="engine",
    int8=True,
)

inference code

model = YOLOE("yoloe-11s-seg.engine")

# Run inference on an image, using the provided visual prompts as guidance
results = model(
    "ultralytics/assets/bus.jpg",
    verbose=False, 
)

in the flamegraph, the warmup and forward take a long time, about 500ms

I also wrote a tensorrt api wrapper but the output is all zero

class TRTWrapper:
    """
    TensorRT 10+ wrapper for YOLOE engine.
    Handles preprocessing, execution, and postprocessing.
    """
    def __init__(self, engine_path: str):
        if not os.path.exists(engine_path):
            raise FileNotFoundError(f"Engine file not found: {engine_path}")

        self.logger = trt.Logger(trt.Logger.WARNING)

        # Load engine
        with open(engine_path, "rb") as f, trt.Runtime(self.logger) as runtime:
            self.engine = runtime.deserialize_cuda_engine(f.read())

        self.context = self.engine.create_execution_context()

        # Allocate device memory
        self.inputs, self.outputs, self.bindings = [], [], []
        self.input_tensor_name = None
        self.output_tensor_name = None
        self.input_shape = None

        # Iterate over bindings
        for binding_name in self.engine:  # engine is iterable over tensor names
            shape = self.engine.get_tensor_shape(binding_name)
            dtype = trt.nptype(self.engine.get_tensor_dtype(binding_name))
            size = trt.volume(shape)
            device_mem = cuda.mem_alloc(size * dtype().nbytes)
            self.bindings.append(int(device_mem))

            if self.engine.get_tensor_mode(binding_name) == trt.TensorIOMode.INPUT:
                self.inputs.append(device_mem)
                self.input_tensor_name = binding_name
                self.input_shape = shape
            else:
                self.outputs.append(device_mem)
                self.output_tensor_name = binding_name

        if self.input_shape is None or self.input_tensor_name is None:
            raise RuntimeError("No input binding found in engine.")
        if self.output_tensor_name is None:
            raise RuntimeError("No output binding found in engine.")

        # CUDA stream
        self.stream = cuda.Stream()

    def preprocess(self, img: np.ndarray) -> np.ndarray:
        """Resize, BGR->RGB, normalize, CHW."""
        _, h, w = self.input_shape[-3:]
        # img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        img = cv2.resize(img, (w, h))
        # img = img.astype(np.float32) / 255.0
        img = np.transpose(img, (2, 0, 1))  # HWC -> CHW
        return np.ascontiguousarray(img)

    def postprocess(self, output: np.ndarray, conf_thresh: float, original_shape):
        """
        Converts network output to masks, bounding boxes, and class IDs.
        
        Args:
            output: np.ndarray, shape (batch, channels, H, W)
            conf_thresh: float, threshold for confidence
            original_shape: tuple, (H_orig, W_orig, C)
            
        Returns:
            List of dicts per mask with keys: 'bbox', 'conf', 'class_id', 'mask'
        """
        h_orig, w_orig = original_shape[:2]
        batch, num_channels, H, W = output.shape

        results = []

        for b in range(batch):
            pred = output[b]  # shape: (num_channels, H, W)
            
            # Assume:
            # channels 0..N-2 = feature/mask channels
            # last channel (or some channel) = confidence map
            # You can adjust depending on your model
            conf_map = pred[4]  # shape (H, W), example confidence channel

            if conf_map.max() < conf_thresh:
                continue  # skip low-confidence maps

            # Example: generate mask(s)
            mask_channels = pred[5:]  # remaining channels are masks
            for i, mask_map in enumerate(mask_channels):
                mask = (mask_map >= 0.5).astype(np.uint8)  # threshold mask
                if mask.sum() == 0:
                    continue  # skip empty masks

                # Compute bounding box from mask
                ys, xs = np.where(mask)
                y1, y2 = ys.min(), ys.max()
                x1, x2 = xs.min(), xs.max()

                # Rescale bbox to original image
                scale_y = h_orig / H
                scale_x = w_orig / W
                bbox = [
                    int(x1 * scale_x),
                    int(y1 * scale_y),
                    int(x2 * scale_x),
                    int(y2 * scale_y),
                ]

                # Aggregate confidence for the mask
                conf_value = conf_map[y1:y2+1, x1:x2+1].mean()

                results.append({
                    "bbox": bbox,
                    "conf": float(conf_value),
                    "class_id": i,  # or assign proper class
                    "mask": cv2.resize(mask, (w_orig, h_orig), interpolation=cv2.INTER_NEAREST)
                })

        return results


    def __call__(self, image: np.ndarray, conf: float = 0.1, verbose: bool = False):
        # Preprocess
        img = self.preprocess(image)
        if len(self.input_shape) == 4:
            img = np.expand_dims(img, axis=0)

        # Copy input to device
        cuda.memcpy_htod_async(self.inputs[0], img, self.stream)

        # Allocate output buffers
        output_buffers = {}
        for name in [self.output_tensor_name]:
            shape = tuple(self.engine.get_tensor_shape(name))  # <-- convert Dims to tuple
            dtype = trt.nptype(self.engine.get_tensor_dtype(name))
            host_arr = cuda.pagelocked_empty(shape, dtype=dtype)
            device_arr = cuda.mem_alloc(host_arr.nbytes)
            self.context.set_tensor_address(name, int(device_arr))
            output_buffers[name] = (host_arr, device_arr)

        # Run inference
        self.context.execute_async_v3(stream_handle=self.stream.handle)

        # Copy outputs to host
        for name, (host_arr, device_arr) in output_buffers.items():
            cuda.memcpy_dtoh_async(host_arr, device_arr, self.stream)

        # Synchronize
        self.stream.synchronize()

        # Postprocess
        output_name = self.output_tensor_name
        output_arr = output_buffers[output_name][0].reshape(tuple(self.engine.get_tensor_shape(output_name)))
        return self.postprocess(output_arr, conf_thresh=conf, original_shape=image.shape)

I am using TensorRT 10.13.2, cuda 11.4

code-cp · September 7, 2025, 5:42am

is there a way to disable the warmup step to make inference faster?

There are also the find and load steps in flamegraph that I am not sure whether can be eliminated

code-cp · September 7, 2025, 5:51am

Is it possible that somehow when running yolo tensorrt on jetson, it falls back to cpu, instead of using cuda? I have checked that my pytorch has cuda available, so I don’t know how to check whether tensorrt is using cpu or cuda

Toxite · September 7, 2025, 6:02am

The first inference would always be slow. Did you try testing multiple inference? Also you’re passing a JPG file which will also slow down inference. You should be testing with a loaded image.

code-cp · September 7, 2025, 6:04am

I tried testing multiple inference, subsequent inference is faster than first time, but still takes 600ms, (first time takes 1s)

in my test, I am using loaded numpy image, instead of jpeg, that’s only used for illustrative purpose

code-cp · September 7, 2025, 6:05am

if i write my own tensorrt python api to handle preprocessing and postprocessing, will it be faster than using ultralytics to handle tensorrt model?

Toxite · September 7, 2025, 6:07am

Maybe slightly

Toxite · September 7, 2025, 6:14am

Did you try exporting like this:

yolo_model = "yoloe-11s-seg.pt"
model = YOLOE("yoloe-11s.yaml").load(yolo_model)

# Define visual prompts using bounding boxes and their corresponding class IDs.
# Each box highlights an example of the object you want the model to detect.
visual_prompts = dict(
    bboxes=np.array(
        [
            [221.52, 405.8, 344.98, 857.54],  # Box enclosing person
            [120, 425, 160, 445],  # Box enclosing glasses
        ],
    ),
    cls=np.array(
        [
            0,  # ID to be assigned for person
            1,  # ID to be assigned for glassses
        ]
    ),
)

# Run inference on an image, using the provided visual prompts as guidance
results = model.predict(
    "ultralytics/assets/bus.jpg",
    visual_prompts=visual_prompts,
    predictor=YOLOEVPDetectPredictor,
)

exported_path = model.export(
    format="engine",
    int8=True,
)

code-cp · September 7, 2025, 6:15am

I never tried to use yoloe-11s.yaml, will it make inference faster?

I tried YOLOEVPDetectPredictor, the result is too bad (bounding box in wrong places even with FP32), so I have to use segmentation head

Toxite · September 7, 2025, 6:17am

Yes. YOLOEDetectPredictor is used only when you convert the segmentation model to detection model like showed.

Toxite · September 7, 2025, 6:18am

Even on an old 2 core CPU, I get less than 50ms inference speed.

image 1/2 /teamspace/studios/this_studio/ultralytics/ultralytics/assets/bus.jpg: 640x640 3 object0s, 45.1ms
image 2/2 /teamspace/studios/this_studio/ultralytics/ultralytics/assets/zidane.jpg: 640x640 2 object0s, 37.6ms
Speed: 2.1ms preprocess, 41.3ms inference, 1.2ms postprocess per image at shape (1, 3, 640, 640)

So I don’t understand how you are getting 600ms.

Toxite · September 7, 2025, 6:19am

Can you post Ultralytics verbose logs from your inference? Maybe your postprocessing is taking a long time.

code-cp · September 7, 2025, 6:21am

For my case the tensorrt log shows the inference time is 14ms, but on the flamegraph, the forward in engine/model.py takes ~200ms, I am not able to show the entire flamegraph due to privacy issues

[09/02/2025-09:51:49] [TRT] [I] Loaded engine size: 11 MiB
[09/02/2025-09:51:50] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +534, GPU +509, now: CPU 1100, GPU 9107 (MiB)
[09/02/2025-09:51:50] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +83, GPU +75, now: CPU 1183, GPU 9182 (MiB)
[09/02/2025-09:51:50] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +10, now: CPU 0, GPU 10 (MiB)
[09/02/2025-09:51:50] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1172, GPU 9173 (MiB)
[09/02/2025-09:51:50] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1172, GPU 9173 (MiB)
[09/02/2025-09:51:50] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +14, now: CPU 0, GPU 24 (MiB)

0: 640x640 2 object0s, 14.0ms
Speed: 8.6ms preprocess, 14.0ms inference, 20.1ms postprocess per image at shape (1, 3, 640, 640)

Toxite · September 7, 2025, 6:22am

Can you try exporting with nms=True?

code-cp · September 7, 2025, 6:22am

I tried with both nms=True and nms=False, does not notice any significant speed improvement

Toxite · September 7, 2025, 6:24am

You can try running inference like this:

github.com/ultralytics/ultralytics

Facing issue to understand time difference

opened 04:38PM - 31 Jul 24 UTC

closed 12:21AM - 10 Nov 24 UTC

sourabCV

question Stale pose

### Search before asking - [X] I have searched the Ultralytics YOLO [issues](…https://github.com/ultralytics/ultralytics/issues) and [discussions](https://github.com/ultralytics/ultralytics/discussions) and found no similar questions. ### Question simple code : ``` import time import sys import cv2 from ultralytics import YOLO pos_model = YOLO('models/yolov8n-pose.onnx', task='pose') cap = cv2.VideoCapture( 'rtsp://xxx:xxxx!@0xxxx:1xx77/stream1') ret, frame = cap.read() st = time.time() results = pos_model.predict( frame, conf=0.5, iou=0.6 )#, imgsz=640, verbose=False) print(time.time()-st) ``` Issue : output showing Ultralytics YOLOv8.1.34 🚀 Python-3.11.9 torch-2.0.1+cpu CPU (Intel Xeon Platinum 8370C 2.80GHz) Setup complete ✅ (64 CPUs, 256.0 GB RAM, 92.6/126.4 GB disk) Loading models\yolov8n-pose.onnx for ONNX Runtime inference... 0: 640x640 (no detections), 46.5ms Speed: 60.5ms preprocess, 46.5ms inference, 1.0ms postprocess per image at shape (1, 3, 640, 640) 1.9199035167694092 what issue is there as per 60.5ms +46.5ms +1.0ms =0.108 sec but next line printing 1.91 sec ### Additional I am running this code in azure windows vm in cpu

But your preprocessing and postprocessing is taking a long time. Maybe your Jetson is throttling.

Topic		Replies	Views
Speed up inference time for the model trained with YOLO12x YOLO	30	382	August 22, 2025
How is the up to 5x speedup with TensorRT achived? Discussion code	13	155	September 3, 2025
Yolov8 model slowing down in real time detection after 2 hours Support support , bug-fix , discussion	6	104	August 20, 2025
Using batch_size in inference doesn't speed up? Discussion yolo , question , discussion	40	271	October 25, 2025
Choose appropriate configurations for YOLO model exported to OpenVINO format YOLO openvino	5	92	August 19, 2025

Yoloe inference very slow on jetson with tensorrt

Related topics