YOLOv11n Pruning

Hello everyone,

I am trying to implement an object detection model on an MCU with a NPU using several YOLO models. Right now, I am trying to implement it with YOLOv11n, however, even though i apply int8 quantization, I cannot achieve the size I want to achieve. My MCU can only accept int8, therefore I cannot use floating point as well.

So, I wanted to explore how to do pruning, what are the important steps and what to keep in mind while doing it. I am quite unsure on which layers to prune and so on.

Thank you for your time and help.

On MCUs/NPUs, pruning only helps if it’s structured pruning (removing whole channels/filters so the network is physically smaller). Unstructured pruning (just zeroing individual weights) usually won’t shrink your exported INT8 model or speed it up unless your toolchain has sparse-kernel support. The Ultralytics glossary page on pruning (structured vs unstructured) summarizes this well.

If you’re starting a new deployment, I’d try Ultralytics YOLO YOLO26n first (it’s smaller/faster than YOLO11n) from the YOLO26 docs, and only prune if that’s still too big.

For “what to prune”: in practice you typically don’t prune the very first stem layer or the final detection head, and instead prune repeated conv blocks in the backbone + neck (those usually have the most redundancy). The usual workflow is: train a baseline → apply channel pruning gradually (e.g., small % each round) → fine-tune a bit to recover accuracy → repeat until you hit your size target → export INT8 again.

If you share (1) your target max model size (flash/RAM), (2) your current exported model format/size (e.g., .tflite), and (3) the NPU toolchain (TFLite Micro, vendor SDK, etc.), I can suggest whether pruning will actually move the needle for that stack and what pruning ratio is realistic.

You can read this

Thank you for your reply.

I am aiming a model size below 2 MB. Right now, my exported YOLOv11 model is 2.6 MB for both full_quant model and int8 model. We are using eIQ Neutron SDK for our model implementation on our board.

I have also tried YOLOv26n but I did not achieve different results on that model either. It is also around 2.6 MB in size.

If you want to train a smaller model, you can change this to [0.5, 0.25, 512] and train a new model using the YAML:

It will make it smaller than 2MB after quantization.

This decreased the size incredibly well. But I have a question, what are the possible values for max_channels. Because I have a rather small dataset to train my model for a specific case and this causes the mAP50-95 value to decrase substantially.