Yoloe inference very slow on jetson with tensorrt

I exported yoloe with visual prompt to tensorrt engine file and with int8 quantization imgsz=640, nms=True, then run the inference on jetson orin nx 16G, all the steps are based on YOLOE: Real-Time Seeing Anything - Ultralytics YOLO Docs

input image size is 1280 x 720, with segmentation head, the total time to process the image (preprocessing, inference, postprocessing) is 1s, this is too slow compared with other models like yolov8 with tensorrt on jetson, I am wondering what might be the reason why tesnorrt model runs very slow?

Which YOLOE model did you export?

You can convert the YOLOE model to detection only model for faster inference if you don’t need the masks.

I exported 11s-seg, also tried 11l-seg, same speed

I tried detection also, the results are not acceptable for my use case (too many bounding box all over the image), and the speed is not faster, I suspect I am doing something wrong

if I use tensorrt python api directly (instead of ultralytics api) for inferrence, will it speed up the speed?

Can you post the code you used for export and inference?

export code

yolo_model = "yoloe-11s-seg.pt"
model = YOLOE(yolo_model)

# Define visual prompts using bounding boxes and their corresponding class IDs.
# Each box highlights an example of the object you want the model to detect.
visual_prompts = dict(
    bboxes=np.array(
        [
            [221.52, 405.8, 344.98, 857.54],  # Box enclosing person
            [120, 425, 160, 445],  # Box enclosing glasses
        ],
    ),
    cls=np.array(
        [
            0,  # ID to be assigned for person
            1,  # ID to be assigned for glassses
        ]
    ),
)

# Run inference on an image, using the provided visual prompts as guidance
results = model.predict(
    "ultralytics/assets/bus.jpg",
    visual_prompts=visual_prompts,
    predictor=YOLOEVPSegPredictor,
)

exported_path = model.export(
    format="engine",
    int8=True,
)

inference code

model = YOLOE("yoloe-11s-seg.engine")

# Run inference on an image, using the provided visual prompts as guidance
results = model(
    "ultralytics/assets/bus.jpg",
    verbose=False, 
)

in the flamegraph, the warmup and forward take a long time, about 500ms

I also wrote a tensorrt api wrapper but the output is all zero

class TRTWrapper:
    """
    TensorRT 10+ wrapper for YOLOE engine.
    Handles preprocessing, execution, and postprocessing.
    """
    def __init__(self, engine_path: str):
        if not os.path.exists(engine_path):
            raise FileNotFoundError(f"Engine file not found: {engine_path}")

        self.logger = trt.Logger(trt.Logger.WARNING)

        # Load engine
        with open(engine_path, "rb") as f, trt.Runtime(self.logger) as runtime:
            self.engine = runtime.deserialize_cuda_engine(f.read())

        self.context = self.engine.create_execution_context()

        # Allocate device memory
        self.inputs, self.outputs, self.bindings = [], [], []
        self.input_tensor_name = None
        self.output_tensor_name = None
        self.input_shape = None

        # Iterate over bindings
        for binding_name in self.engine:  # engine is iterable over tensor names
            shape = self.engine.get_tensor_shape(binding_name)
            dtype = trt.nptype(self.engine.get_tensor_dtype(binding_name))
            size = trt.volume(shape)
            device_mem = cuda.mem_alloc(size * dtype().nbytes)
            self.bindings.append(int(device_mem))

            if self.engine.get_tensor_mode(binding_name) == trt.TensorIOMode.INPUT:
                self.inputs.append(device_mem)
                self.input_tensor_name = binding_name
                self.input_shape = shape
            else:
                self.outputs.append(device_mem)
                self.output_tensor_name = binding_name

        if self.input_shape is None or self.input_tensor_name is None:
            raise RuntimeError("No input binding found in engine.")
        if self.output_tensor_name is None:
            raise RuntimeError("No output binding found in engine.")

        # CUDA stream
        self.stream = cuda.Stream()

    def preprocess(self, img: np.ndarray) -> np.ndarray:
        """Resize, BGR->RGB, normalize, CHW."""
        _, h, w = self.input_shape[-3:]
        # img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        img = cv2.resize(img, (w, h))
        # img = img.astype(np.float32) / 255.0
        img = np.transpose(img, (2, 0, 1))  # HWC -> CHW
        return np.ascontiguousarray(img)

    def postprocess(self, output: np.ndarray, conf_thresh: float, original_shape):
        """
        Converts network output to masks, bounding boxes, and class IDs.
        
        Args:
            output: np.ndarray, shape (batch, channels, H, W)
            conf_thresh: float, threshold for confidence
            original_shape: tuple, (H_orig, W_orig, C)
            
        Returns:
            List of dicts per mask with keys: 'bbox', 'conf', 'class_id', 'mask'
        """
        h_orig, w_orig = original_shape[:2]
        batch, num_channels, H, W = output.shape

        results = []

        for b in range(batch):
            pred = output[b]  # shape: (num_channels, H, W)
            
            # Assume:
            # channels 0..N-2 = feature/mask channels
            # last channel (or some channel) = confidence map
            # You can adjust depending on your model
            conf_map = pred[4]  # shape (H, W), example confidence channel

            if conf_map.max() < conf_thresh:
                continue  # skip low-confidence maps

            # Example: generate mask(s)
            mask_channels = pred[5:]  # remaining channels are masks
            for i, mask_map in enumerate(mask_channels):
                mask = (mask_map >= 0.5).astype(np.uint8)  # threshold mask
                if mask.sum() == 0:
                    continue  # skip empty masks

                # Compute bounding box from mask
                ys, xs = np.where(mask)
                y1, y2 = ys.min(), ys.max()
                x1, x2 = xs.min(), xs.max()

                # Rescale bbox to original image
                scale_y = h_orig / H
                scale_x = w_orig / W
                bbox = [
                    int(x1 * scale_x),
                    int(y1 * scale_y),
                    int(x2 * scale_x),
                    int(y2 * scale_y),
                ]

                # Aggregate confidence for the mask
                conf_value = conf_map[y1:y2+1, x1:x2+1].mean()

                results.append({
                    "bbox": bbox,
                    "conf": float(conf_value),
                    "class_id": i,  # or assign proper class
                    "mask": cv2.resize(mask, (w_orig, h_orig), interpolation=cv2.INTER_NEAREST)
                })

        return results


    def __call__(self, image: np.ndarray, conf: float = 0.1, verbose: bool = False):
        # Preprocess
        img = self.preprocess(image)
        if len(self.input_shape) == 4:
            img = np.expand_dims(img, axis=0)

        # Copy input to device
        cuda.memcpy_htod_async(self.inputs[0], img, self.stream)

        # Allocate output buffers
        output_buffers = {}
        for name in [self.output_tensor_name]:
            shape = tuple(self.engine.get_tensor_shape(name))  # <-- convert Dims to tuple
            dtype = trt.nptype(self.engine.get_tensor_dtype(name))
            host_arr = cuda.pagelocked_empty(shape, dtype=dtype)
            device_arr = cuda.mem_alloc(host_arr.nbytes)
            self.context.set_tensor_address(name, int(device_arr))
            output_buffers[name] = (host_arr, device_arr)

        # Run inference
        self.context.execute_async_v3(stream_handle=self.stream.handle)

        # Copy outputs to host
        for name, (host_arr, device_arr) in output_buffers.items():
            cuda.memcpy_dtoh_async(host_arr, device_arr, self.stream)

        # Synchronize
        self.stream.synchronize()

        # Postprocess
        output_name = self.output_tensor_name
        output_arr = output_buffers[output_name][0].reshape(tuple(self.engine.get_tensor_shape(output_name)))
        return self.postprocess(output_arr, conf_thresh=conf, original_shape=image.shape)

I am using TensorRT 10.13.2, cuda 11.4

is there a way to disable the warmup step to make inference faster?

There are also the find and load steps in flamegraph that I am not sure whether can be eliminated

Is it possible that somehow when running yolo tensorrt on jetson, it falls back to cpu, instead of using cuda? I have checked that my pytorch has cuda available, so I don’t know how to check whether tensorrt is using cpu or cuda

The first inference would always be slow. Did you try testing multiple inference? Also you’re passing a JPG file which will also slow down inference. You should be testing with a loaded image.

I tried testing multiple inference, subsequent inference is faster than first time, but still takes 600ms, (first time takes 1s)

in my test, I am using loaded numpy image, instead of jpeg, that’s only used for illustrative purpose

if i write my own tensorrt python api to handle preprocessing and postprocessing, will it be faster than using ultralytics to handle tensorrt model?

Maybe slightly

Did you try exporting like this:

yolo_model = "yoloe-11s-seg.pt"
model = YOLOE("yoloe-11s.yaml").load(yolo_model)

# Define visual prompts using bounding boxes and their corresponding class IDs.
# Each box highlights an example of the object you want the model to detect.
visual_prompts = dict(
    bboxes=np.array(
        [
            [221.52, 405.8, 344.98, 857.54],  # Box enclosing person
            [120, 425, 160, 445],  # Box enclosing glasses
        ],
    ),
    cls=np.array(
        [
            0,  # ID to be assigned for person
            1,  # ID to be assigned for glassses
        ]
    ),
)

# Run inference on an image, using the provided visual prompts as guidance
results = model.predict(
    "ultralytics/assets/bus.jpg",
    visual_prompts=visual_prompts,
    predictor=YOLOEVPDetectPredictor,
)

exported_path = model.export(
    format="engine",
    int8=True,
)

I never tried to use yoloe-11s.yaml, will it make inference faster?

I tried YOLOEVPDetectPredictor, the result is too bad (bounding box in wrong places even with FP32), so I have to use segmentation head

Yes. YOLOEDetectPredictor is used only when you convert the segmentation model to detection model like showed.

Even on an old 2 core CPU, I get less than 50ms inference speed.

image 1/2 /teamspace/studios/this_studio/ultralytics/ultralytics/assets/bus.jpg: 640x640 3 object0s, 45.1ms
image 2/2 /teamspace/studios/this_studio/ultralytics/ultralytics/assets/zidane.jpg: 640x640 2 object0s, 37.6ms
Speed: 2.1ms preprocess, 41.3ms inference, 1.2ms postprocess per image at shape (1, 3, 640, 640)

So I don’t understand how you are getting 600ms.

Can you post Ultralytics verbose logs from your inference? Maybe your postprocessing is taking a long time.

For my case the tensorrt log shows the inference time is 14ms, but on the flamegraph, the forward in engine/model.py takes ~200ms, I am not able to show the entire flamegraph due to privacy issues

[09/02/2025-09:51:49] [TRT] [I] Loaded engine size: 11 MiB
[09/02/2025-09:51:50] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +534, GPU +509, now: CPU 1100, GPU 9107 (MiB)
[09/02/2025-09:51:50] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +83, GPU +75, now: CPU 1183, GPU 9182 (MiB)
[09/02/2025-09:51:50] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +10, now: CPU 0, GPU 10 (MiB)
[09/02/2025-09:51:50] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1172, GPU 9173 (MiB)
[09/02/2025-09:51:50] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1172, GPU 9173 (MiB)
[09/02/2025-09:51:50] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +14, now: CPU 0, GPU 24 (MiB)

0: 640x640 2 object0s, 14.0ms
Speed: 8.6ms preprocess, 14.0ms inference, 20.1ms postprocess per image at shape (1, 3, 640, 640)

Can you try exporting with nms=True?

I tried with both nms=True and nms=False, does not notice any significant speed improvement

You can try running inference like this:

But your preprocessing and postprocessing is taking a long time. Maybe your Jetson is throttling.

1 Like