Yolov11 pruning and quantizing

hi i want to deploy yolo11 model to deepstream and i did it very well but i haave some question
the fps was low i wanna know is there anything for pruning or quantizing for making inference time better in deepstream? because the default yolo model in deepstream can quantized in int64
can you give me a guidence about this?

Hello @mohammad_haydari,

Thanks for reaching out to the YOLO community!

To improve inference time when deploying your YOLO11 model to DeepStream, you can indeed optimize your model through quantization. Quantization can convert your model’s weights and activations to lower precision, like 8-bit integers, which reduces the model size and speeds up inference. For detailed guidance, the Ultralytics documentation on model optimization techniques provides insights into how quantization can enhance model performance, particularly on edge devices.

Additionally, you can explore using TensorRT for model optimization as it includes techniques like layer fusion and precision calibration, which are particularly effective on NVIDIA GPUs. The TensorRT integration page provides detailed steps on exporting YOLO11 models to TensorRT format.

I hope this helps you achieve better performance with your DeepStream deployment! Let the Ultralytics team know if you need anything else.

Hello,
Can you also guide me on pruning YOLOv11?
I want to reduce the size of my custom model by pruning.

Have you tried quantizing first? Model pruning is not a simple task to accomplish. Model quantization during export will likely yield quite good results in improving inference speeds and reducing model size/complexity.

It’s always good to state your true goal when seeking help. You mentioned that you want to reduce the size of your model, but didn’t specify why. In most cases, when this type of question or goal is sought, it’s to improve inference times. As mentioned above, exporting to an inference framework (TensorRT, ONNX, OpenVino, RKNN, etc.) that’s appropriate for the hardware, plus quantization will produce the best result for very little effort. Sometimes the goal is to actually make the model smaller on disk or memory, and again exporting to the proper inference framework and quantization down to INT8 will be the quickest and simplest solution.

1 Like