Inference speed will depend a lot on the CPU. ONNX is one format, but there are numerous formats
You’ll have to test to see what works best on your specific hardware. Every situation will be different, so there’s no way for anyone to tell you exactly what the correct format to use is, it’s something you’ll have experiment. All the arguments and details for exporting is in the documentation and you may want to also view the Integrations pages for more details.
Other than testing the various export formats, one of the biggest factors to help reduce inference time is imgsz
. The lower the value you can use for imgsz
the faster your inference will be. As an example, I just tested using the default imgsz=640
and imgsz=320
using yolo11n.pt
(no export).
yolo val model=yolo11n.pt device=cpu data=coco128.yaml imgsz=640
>>> Speed:
0.8ms preprocess,
42.2ms inference,
0.0ms loss,
1.5ms postprocess
per image
yolo val model=yolo11n.pt device=cpu data=coco128.yaml imgsz=320
>>> Speed:
0.2ms preprocess,
12.3ms inference,
0.0ms loss,
0.9ms postprocess
per image
Reducing imgsz
by half results in a 3x inference time speed up. It might not be exactly the same on your system due to differences in hardware and environment, and you might not be able to reduce by half, but finding the smallest acceptable imgsz
will help improve inference speeds. Additionally, when exporting a model, including nms=True
can help reduce postprocessing time some, so you might want to test exports with that feature enabled.
To get more explicit help, you’ll need to be clear about your target speeds and hardware. As mentioned previously, it still won’t be definitive, but it might help better understand the situation. As an example, if your target is to achieve <1\text{ms} inference speeds, it might not be possible with a CPU or at least it might not be possible with the specific CPU in the system you’re using.
Remember, it’s not a good idea to ask how to do X when your goal is to accomplish Y. You should share what your true intention, so others have a better understanding of how to help you. The discussion is very different if your goal is “monitor an assembly line that is moving at 3 parts per second” than “lower the inference time as much as possible.” The former provides enough context and detail for others to provide specific advice, where as the latter is open ended and ambiguous.