YOLO to lighter versions conversion for CPU deployement

My current object detection best.pt which was based on yolo11s architecture is too lagging in cpu, but i have to deploy the model in CPU to process realtime footages. So should I try ONNX conversion or is there any better approach to make it lighter and keep accuracy maximum?
If ONNX conversion is better, then is there any guide on onnx conversion with quantization because i tried "from ultralytics import YOLO

model = YOLO(“best.pt”)
model.export(
format=“onnx”, # ONNX target
imgsz=640, # freeze input to 640×640 (matches your letterbox)
half=False, # you can switch to True if you want FP16 speedups
dynamic=False, # keep a static 1×3×640×640 input so session.get_inputs()[0].shape yields ints
simplify=True, # fold constants & strip redundant nodes
opset=17, # modern opset for widest runtime support
nms=False, # you’re doing your own NMS in Python
batch=1, # max batch baked in (ignored if dynamic=True)
device=0, # export on CPU (or “0”/GPU if you trained on GPU)
project=“path”, # explicit filename
name=“best” # explicit filename (default is model name)

)". this was giving me a bigger size onnx without quantization. So where can i get the syntax for the same smaller and lighter versions?

Inference speed will depend a lot on the CPU. ONNX is one format, but there are numerous formats

You’ll have to test to see what works best on your specific hardware. Every situation will be different, so there’s no way for anyone to tell you exactly what the correct format to use is, it’s something you’ll have experiment. All the arguments and details for exporting is in the documentation and you may want to also view the Integrations pages for more details.

Other than testing the various export formats, one of the biggest factors to help reduce inference time is imgsz. The lower the value you can use for imgsz the faster your inference will be. As an example, I just tested using the default imgsz=640 and imgsz=320 using yolo11n.pt (no export).

yolo val model=yolo11n.pt device=cpu data=coco128.yaml imgsz=640
>>> Speed: 
    0.8ms preprocess,
    42.2ms inference,
    0.0ms loss,
    1.5ms postprocess
per image

yolo val model=yolo11n.pt device=cpu data=coco128.yaml imgsz=320
>>> Speed: 
    0.2ms preprocess,
    12.3ms inference,
    0.0ms loss,
    0.9ms postprocess
per image

Reducing imgsz by half results in a 3x inference time speed up. It might not be exactly the same on your system due to differences in hardware and environment, and you might not be able to reduce by half, but finding the smallest acceptable imgsz will help improve inference speeds. Additionally, when exporting a model, including nms=True can help reduce postprocessing time some, so you might want to test exports with that feature enabled.

To get more explicit help, you’ll need to be clear about your target speeds and hardware. As mentioned previously, it still won’t be definitive, but it might help better understand the situation. As an example, if your target is to achieve <1\text{ms} inference speeds, it might not be possible with a CPU or at least it might not be possible with the specific CPU in the system you’re using.

Remember, it’s not a good idea to ask how to do X when your goal is to accomplish Y. You should share what your true intention, so others have a better understanding of how to help you. The discussion is very different if your goal is “monitor an assembly line that is moving at 3 parts per second” than “lower the inference time as much as possible.” The former provides enough context and detail for others to provide specific advice, where as the latter is open ended and ambiguous.