Yoloe inference very slow on jetson with tensorrt

Thanks for the link, that looks very helpful

Regarding jetson throttling, I already set power mode to maxn, gpu usage ranges from 20% to 100%, and cpu usage is 20% for each core

I tried

model = YOLOE(yolo_model)

model.predictor.preprocess([image]) but it says predictor is none,

I also tried

from ultralytics.models.yolo.segment.predict import SegmentationPredictor

predictor = SegmentationPredictor()

predictor.setup_model(model.model)

preprocessed = predictor.preprocess([image])

but it still says preprocess is None,

what’s the correct way to use predictor for yoloe, or does predictor only works with yolo models?

The predictor instance loads automatically when inference is started. If you look at the code that Toxite linked to, you should see

# Run it at least once to create the predictor.
model(save=False, show=False, conf=0.01)

The predictor class isn’t really intended to be initialized manually.

It’s going to be difficult to help troubleshoot what you’re cutting as the problem is you can’t share the flame graph. If inference is reported as 14ms but you say the flame graph shows 200ms, without seeing the flame graph we can only assume what the issue could be.

Despite this, if the 14ms meets your requirements, and you’re not experiencing any issues, then why bother with the flame graph? There could be lots of reasons why it could appear that things are slower in the flame graph, so unless you’re still experiencing an issue with slow inference, it’s probably best to ignore it.

Thanks for the reply, the ultralytics log says inference time is 14ms, I am satisfied with that

My problem at hand is based on the flame graph, the warmup and forward functions in the engine/model.py takes a long time, and for my use case I want to reduce the segmentation time as much as possible, ideally the flame graph should show inference (warmup + forward) takes ~14ms, not 200ms

I understand not able to share the entire flamegraph is an issue, maybe I can try to create a minimal example to get rid of the privacy issue

But my priority for now would be to use tensorrt C++ API to use the engine model for segmentation, I would assume C++ API will speed things up compared with python

1 Like

I don’t think you’ll get inference + warmup to be that fast. The warmup cycle is only supposed to run once:

that’s b/c it adds a few quick inference calls on the model

Since this should only occur once, not every run, it should only impact the inference time once. Even the official TensorRT dev docs recommend using warm up, and even provide a time/iteration limit for the warmup cycle. Another area you might want to investigate, is the use of Dynamic Axes/Dimensions for your model. If you don’t need them, then setting fixed dimensions can help, as dynamic values will likely add overhead for each call.

1 Like

Thanks for the helpful tips, I would like to clarify about the dynamic axes, I am following Model Export with Ultralytics YOLO - Ultralytics YOLO Docs when exporting the models and set dynamic to False, here’s the code

    exported_path = model.export(
        format="engine",
        int8=True,
        dynamic=False,
        nms=False,
        imgsz=640, 
        simplify=True,
        device="cuda",
    )

That makes sense. I brought up the dynamic axes b/c when I first integrated the INT8 export process, I had set dynamic=True by default since there were issues when it was not set. Since then it’s been updated to allow int8=True and dynamic=False, so it’s no longer forced by default and shouldn’t be a concern since you’re disabling dynamic axes.