Yolov11 pruning and quantizing

mohammad_haydari · March 8, 2025, 10:52am

hi i want to deploy yolo11 model to deepstream and i did it very well but i haave some question
the fps was low i wanna know is there anything for pruning or quantizing for making inference time better in deepstream? because the default yolo model in deepstream can quantized in int64
can you give me a guidence about this?

pderrenger · March 9, 2025, 1:42am

Hello @mohammad_haydari,

Thanks for reaching out to the YOLO community!

To improve inference time when deploying your YOLO11 model to DeepStream, you can indeed optimize your model through quantization. Quantization can convert your model’s weights and activations to lower precision, like 8-bit integers, which reduces the model size and speeds up inference. For detailed guidance, the Ultralytics documentation on model optimization techniques provides insights into how quantization can enhance model performance, particularly on edge devices.

Additionally, you can explore using TensorRT for model optimization as it includes techniques like layer fusion and precision calibration, which are particularly effective on NVIDIA GPUs. The TensorRT integration page provides detailed steps on exporting YOLO11 models to TensorRT format.

I hope this helps you achieve better performance with your DeepStream deployment! Let the Ultralytics team know if you need anything else.

dariush · April 8, 2025, 1:11pm

Hello,
Can you also guide me on pruning YOLOv11?
I want to reduce the size of my custom model by pruning.

BurhanQ · April 9, 2025, 11:34am

Have you tried quantizing first? Model pruning is not a simple task to accomplish. Model quantization during export will likely yield quite good results in improving inference speeds and reducing model size/complexity.

It’s always good to state your true goal when seeking help. You mentioned that you want to reduce the size of your model, but didn’t specify why. In most cases, when this type of question or goal is sought, it’s to improve inference times. As mentioned above, exporting to an inference framework (TensorRT, ONNX, OpenVino, RKNN, etc.) that’s appropriate for the hardware, plus quantization will produce the best result for very little effort. Sometimes the goal is to actually make the model smaller on disk or memory, and again exporting to the proper inference framework and quantization down to INT8 will be the quickest and simplest solution.

Topic		Replies	Views
I would like to quantize my custom trained model YOLO question	1	261	January 5, 2025
Yolo11 quantization YOLO yolo , question , support , troubleshooting	1	23	June 19, 2025
YOLO to lighter versions conversion for CPU deployement YOLO yolo , question , pytorch , onnx	1	12	June 25, 2025
Quantization of yolov11 model Discussion yolo , question , feature	2	20	June 23, 2025
Post-training quantization methods support for YOLO models in TensorRT format Support yolo , question , support , code	3	117	April 16, 2025

Yolov11 pruning and quantizing

Related topics