Yoloe inference very slow on jetson with tensorrt

code-cp · September 7, 2025, 6:27am

Thanks for the link, that looks very helpful

Regarding jetson throttling, I already set power mode to maxn, gpu usage ranges from 20% to 100%, and cpu usage is 20% for each core

code-cp1 · September 8, 2025, 2:32am

I tried

model = YOLOE(yolo_model)

model.predictor.preprocess([image]) but it says predictor is none,

I also tried

from ultralytics.models.yolo.segment.predict import SegmentationPredictor

predictor = SegmentationPredictor()

predictor.setup_model(model.model)

preprocessed = predictor.preprocess([image])

but it still says preprocess is None,

what’s the correct way to use predictor for yoloe, or does predictor only works with yolo models?

BurhanQ · September 8, 2025, 10:48am

The predictor instance loads automatically when inference is started. If you look at the code that Toxite linked to, you should see

# Run it at least once to create the predictor.
model(save=False, show=False, conf=0.01)

The predictor class isn’t really intended to be initialized manually.

BurhanQ · September 8, 2025, 10:53am

It’s going to be difficult to help troubleshoot what you’re cutting as the problem is you can’t share the flame graph. If inference is reported as 14ms but you say the flame graph shows 200ms, without seeing the flame graph we can only assume what the issue could be.

Despite this, if the 14ms meets your requirements, and you’re not experiencing any issues, then why bother with the flame graph? There could be lots of reasons why it could appear that things are slower in the flame graph, so unless you’re still experiencing an issue with slow inference, it’s probably best to ignore it.

code-cp · September 8, 2025, 11:09am

Thanks for the reply, the ultralytics log says inference time is 14ms, I am satisfied with that

My problem at hand is based on the flame graph, the warmup and forward functions in the engine/model.py takes a long time, and for my use case I want to reduce the segmentation time as much as possible, ideally the flame graph should show inference (warmup + forward) takes ~14ms, not 200ms

code-cp · September 8, 2025, 11:13am

I understand not able to share the entire flamegraph is an issue, maybe I can try to create a minimal example to get rid of the privacy issue

But my priority for now would be to use tensorrt C++ API to use the engine model for segmentation, I would assume C++ API will speed things up compared with python

BurhanQ · September 8, 2025, 12:20pm

I don’t think you’ll get inference + warmup to be that fast. The warmup cycle is only supposed to run once:

github.com/ultralytics/ultralytics

ultralytics/engine/predictor.py

aeeb3ecae


      
          if not self.done_warmup:
              self.model.warmup(
                  imgsz=(1 if self.model.pt or self.model.triton else self.dataset.bs, self.model.ch, *self.imgsz)
              )
              self.done_warmup = True

that’s b/c it adds a few quick inference calls on the model

github.com/ultralytics/ultralytics

ultralytics/nn/autobackend.py

aeeb3ecae


      
          def warmup(self, imgsz: tuple[int, int, int, int] = (1, 3, 640, 640)) -> None:
              """
              Warm up the model by running one forward pass with a dummy input.
          
              Args:
                  imgsz (tuple): The shape of the dummy input tensor in the format (batch_size, channels, height, width)
              """
              warmup_types = self.pt, self.jit, self.onnx, self.engine, self.saved_model, self.pb, self.triton, self.nn_module
              if any(warmup_types) and (self.device.type != "cpu" or self.triton):
                  im = torch.empty(*imgsz, dtype=torch.half if self.fp16 else torch.float, device=self.device)  # input
                  for _ in range(2 if self.jit else 1):
                      self.forward(im)  # warmup

Since this should only occur once, not every run, it should only impact the inference time once. Even the official TensorRT dev docs recommend using warm up, and even provide a time/iteration limit for the warmup cycle. Another area you might want to investigate, is the use of Dynamic Axes/Dimensions for your model. If you don’t need them, then setting fixed dimensions can help, as dynamic values will likely add overhead for each call.

code-cp · September 8, 2025, 1:16pm

Thanks for the helpful tips, I would like to clarify about the dynamic axes, I am following Model Export with Ultralytics YOLO - Ultralytics YOLO Docs when exporting the models and set dynamic to False, here’s the code

    exported_path = model.export(
        format="engine",
        int8=True,
        dynamic=False,
        nms=False,
        imgsz=640, 
        simplify=True,
        device="cuda",
    )

BurhanQ · September 9, 2025, 10:42am

That makes sense. I brought up the dynamic axes b/c when I first integrated the INT8 export process, I had set dynamic=True by default since there were issues when it was not set. Since then it’s been updated to allow int8=True and dynamic=False, so it’s no longer forced by default and shouldn’t be a concern since you’re disabling dynamic axes.

Topic		Replies	Views
Speed up inference time for the model trained with YOLO12x YOLO	30	224	August 22, 2025
How is the up to 5x speedup with TensorRT achived? Discussion code	13	117	September 3, 2025
Yolov8 model slowing down in real time detection after 2 hours Support support , bug-fix , discussion	6	75	August 20, 2025
[Unofficial] Benchmark Results (How fast can you YOLO) Hardware yolov8 , desktop , benchmark	4	1174	August 14, 2024
Slimming with onnxslim 0.1.50 fails with yolo export on Jetson 6.2 L4T Discussion yolo , troubleshooting , resource	4	294	April 26, 2025

Yoloe inference very slow on jetson with tensorrt

Related topics