Adding a new head to the YOLO11n model to detect very small objects

Hi Everyone,

I’m just trying to add a fourth head to the YOLO11n model for processing a high-resolution feature map(P2) to detect very small objects in the existing model architecture. For this, I added a new extended feature map to the neck and added a new head to process this feature map. I tried this 2 ways, implementing the code directly in python and adding these changes in yolo11.yaml file.

Please find the implementation steps below.

  1. Extended the neck function by adding extra upsample module.
  2. Added a new head module to process the P2 feature map, it consists
    of Conv layers, C3K module and a detect module for predictions.
  3. Modified the forward pass method to include the new head.
  4. Load and train the model using the custom model by initialized and
    loaded with pretrained weights.

Finally, when I try to load the model, getting the following error.
AttributeError: ‘CustomYOLO11n’ object has no attribute ‘extra_upsample’- in the code.

I tried all aspects, but no luck. It seems that DetectionModel class in YOLO11 dynamically builds the model based on the YAML configuration.
And I don’t understand how to register extra_upsample and p2_head modules into the model architecture.

Then I take a different approach, instead of subclassing DetectionModel, I modified YAML file to add a fourth head and loaded the model using modified YAML but still no luck. Please find the yaml below. Getting “RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 16 but got size 64 for tensor number 1 in the list.”

I’m doing this experiment for my project work, Advanced Driver Monitoring System. I need to add 4 new heads and modifying the neck for multitask learning.


Ultralytics YOLO11 object detection model with P3/8 - P5/32

Parameters

nc: 80 # number of classes
scales: # model compound scaling constants, i.e. ‘model=yolo11n.yaml’ will call yolo11.yaml with scale ‘n’

[depth, width, max_channels]

n: [0.50, 0.25, 1024] # summary: 181 layers, 2624080 parameters, 2624064 gradients, 6.6 GFLOPs

YOLO11n backbone

backbone:

[from, repeats, module, args]

  • [-1, 1, Conv, [64, 3, 2]] # 0-P1/2
  • [-1, 1, Conv, [128, 3, 2]] # 1-P2/4
  • [-1, 2, C3k2, [256, False, 0.25]]
  • [-1, 1, Conv, [256, 3, 2]] # 3-P3/8
  • [-1, 2, C3k2, [512, False, 0.25]]
  • [-1, 1, Conv, [512, 3, 2]] # 5-P4/16
  • [-1, 2, C3k2, [512, True]]
  • [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32
  • [-1, 2, C3k2, [1024, True]]
  • [-1, 1, SPPF, [1024, 5]] # 9
  • [-1, 2, C2PSA, [1024]] # 10

YOLO11n head

head:

  • [-1, 1, nn.Upsample, [None, 2, “nearest”]]

  • [[-1, 6], 1, Concat, [1]] # cat backbone P4

  • [-1, 2, C3k2, [512, False]] # 13

  • [-1, 1, nn.Upsample, [None, 2, “nearest”]]

  • [[-1, 4], 1, Concat, [1]] # cat backbone P3

  • [-1, 2, C3k2, [256, False]] # 16 (P3/8-small)

  • [-1, 1, Conv, [256, 3, 2]]

  • [[-1, 13], 1, Concat, [1]] # cat head P4

  • [-1, 2, C3k2, [512, False]] # 19 (P4/16-medium)

  • [-1, 1, Conv, [512, 3, 2]]

  • [[-1, 10], 1, Concat, [1]] # cat head P5

  • [-1, 2, C3k2, [1024, True]] # 22 (P5/32-large)

  • [-1, 1, nn.Upsample, [None, 2, “nearest”]] # 23 added

  • [[-1, 2], 1, Concat, [1]] # cat backbone P2

  • [-1, 3, C3k2, [128, False]] # 25 (P3/8-very small)

  • [-1, 1, Conv, [128, 3, 2]] # New Conv for P2

  • [[16, 19, 22], 1, Detect, [nc]] # Detect(P3, P4, P5)

  • [[23, 26, 27], 1, Detect, [nc]] # Detect(P2, P3, P4, P5)

You can check this PR that lets you define your custom module directly in the YAML file.

Thank you for quick inputs.

I will try with this. If possible, please provide one complete example with backbone, neck and heads changes.

Best Regards,
Venkat

I would also suggest reviewing the yolov8-p2.yaml as a reference.

# Ultralytics 🚀 AGPL-3.0 License - https://ultralytics.com/license

# Ultralytics YOLOv8 object detection model with P2/4 - P5/32 outputs
# Model docs: https://docs.ultralytics.com/models/yolov8
# Task docs: https://docs.ultralytics.com/tasks/detect

# Parameters
nc: 80 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n.yaml' will call yolov8.yaml with scale 'n'
  # [depth, width, max_channels]
  n: [0.33, 0.25, 1024]
  s: [0.33, 0.50, 1024]
  m: [0.67, 0.75, 768]
  l: [1.00, 1.00, 512]
  x: [1.00, 1.25, 512]

# YOLOv8.0 backbone
backbone:
  # [from, repeats, module, args]
  - [-1, 1, Conv, [64, 3, 2]] # 0-P1/2
  - [-1, 1, Conv, [128, 3, 2]] # 1-P2/4
  - [-1, 3, C2f, [128, True]]
  - [-1, 1, Conv, [256, 3, 2]] # 3-P3/8
  - [-1, 6, C2f, [256, True]]
  - [-1, 1, Conv, [512, 3, 2]] # 5-P4/16
  - [-1, 6, C2f, [512, True]]
  - [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32
  - [-1, 3, C2f, [1024, True]]
  - [-1, 1, SPPF, [1024, 5]] # 9

# YOLOv8.0-p2 head
head:
  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [[-1, 6], 1, Concat, [1]] # cat backbone P4
  - [-1, 3, C2f, [512]] # 12

  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [[-1, 4], 1, Concat, [1]] # cat backbone P3
  - [-1, 3, C2f, [256]] # 15 (P3/8-small)

  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [[-1, 2], 1, Concat, [1]] # cat backbone P2
  - [-1, 3, C2f, [128]] # 18 (P2/4-xsmall)

  - [-1, 1, Conv, [128, 3, 2]]
  - [[-1, 15], 1, Concat, [1]] # cat head P3
  - [-1, 3, C2f, [256]] # 21 (P3/8-small)

  - [-1, 1, Conv, [256, 3, 2]]
  - [[-1, 12], 1, Concat, [1]] # cat head P4
  - [-1, 3, C2f, [512]] # 24 (P4/16-medium)

  - [-1, 1, Conv, [512, 3, 2]]
  - [[-1, 9], 1, Concat, [1]] # cat head P5
  - [-1, 3, C2f, [1024]] # 27 (P5/32-large)

  - [[18, 21, 24, 27], 1, Detect, [nc]] # Detect(P2, P3, P4, P5)

Thank you for your inputs. Yeah, I tried `yolov8-p2.yaml’ and yolov8-p6.yaml today and I have no issue. Now, I need to add 3 new heads for Facial Landmark Detection, Face Recognition and DMS states along with Detection head and need to modify the neck for multi-task learning.

I’m little tensed in YOLO11 architecture customization for Advanced DMS tasks. Please guide me to achieve this goal successfully and share your thoughts and set my direction towards it.

Best Regards,
Venkat

You can probably use a YOLO pose model for the facial landmark detection, you’ll just need a dataset to train on. Adding multiple heads like that is not what I would call a simple task, and most people who are doing such changes usually know what they’re doing or are going to figure it out themselves.

It might be worthwhile to consider that using more than one model to accomplish your task could be a means to a solution. Heavy modifications to a YOLO model like you’ve mentioned would likely cause significant slowdowns to inference speeds.

Also, just searching around you might be able find some things that could help you out. I did a quick search on Google and found these:

Thank you for your inputs.

I was trying with YOLO11 object detection and YOLO11 pose models with DMS datasets. I wanted to use 1 among these 2 for multi-task learning. Now, it is confirmed by your statement that I can go ahead with YOLO11 pose model for Objects and Facial Landmark detection.

As a part of literature collection, I reviewed your shared topics too and final outcome taken a decision to go ahead with YOLO11 and other models like MobileNetV2, SqueezeNet, AlexNet.

The following methods have identified so far in a research review process which are more advanced and feasible DMS deployment solutions.

  1. Using multi-task learning (MTL) CNN architecture for face detection, face recognition and facial analysis (Eye gaze estimation, Head pose estimation, Face occlusions).

  2. Using two-stage CNN, first CNN locates and tracks face and eyes, while the second CNN estimates head pose, eye gaze, and occlusions in a multi-task learning framework. The first stage utilizes the modified YOLO11n version.

  3. Using customized YOLO11n for object and face detection and incorporate Mediapipe Face Landmarker and Dlib’s Face Recognition models as a part of the same solution for various DMS states detection.

  4. Using YOLO11n or any other CNN model (Single-task model) for object and face detection, deploy this model into the target device then use Mediapipe Face Landmarker and Dlib’s Face Recognition libraries as a part of DMS Application development to detect various DMS states (Drowsiness, Distraction, Emotions, and Impairment).

Please share your thoughts on this.

Best Regards,
Venkat

This is the proposed Advanced DMS solution.

The standard yolo11n-pose.yaml outputs a 17 keypoints for human pose. For 64 facial landmarks, the output layer needs to modified to handle these points. I changed kpt_shape: [17, 3] to kpt_shape: [64, 2]. Is it enough? or Do we need to add any multi-task loss function for this.

Please share your thoughts on this.

Best Regards,
Venkat

Hi Venkat,

Yes, modifying the kpt_shape parameter in the yolo11n-pose.yaml file is the correct approach to change the number of keypoints the model detects. Changing kpt_shape: [17, 3] to kpt_shape: [64, 2] tells the model to predict 64 keypoints, each with 2 dimensions (likely x, y coordinates).

You generally do not need to add a separate multi-task loss function just for changing the number of keypoints within the pose estimation task. The existing pose loss function should handle the regression for the specified number of keypoints defined by kpt_shape. You will, however, need to train the model on a dataset annotated with your 64 facial landmarks.

Good luck with your Advanced DMS project!

Thank you for your inputs.

Best Regards,
Venkat

I found a research paper on YOLO Architecture Design during my literature review. Please find the paper details below and share this paper for new joiners.

Paper Title: YOLOv8 to YOLO11: A Comprehensive Architecture In-depth Comparative Review
Link: [2501.13400] YOLOv8 to YOLO11: A Comprehensive Architecture In-depth Comparative Review

Best Regards,
Venkat

Can we still use GhostConv for YOLO11n models for model optimization?

Please share your thoughts.

Best Regards,
Venkat

The GhostConv module is still included. It might be good to reference the YOLOv8-ghost config and experiment with applying similar changes to YOLO11

Thank you for confirmation.

Best Regards,
Venkat

Hi Venkat,

Yes, GhostConv is a supported module within the Ultralytics framework and can be integrated into YOLO11 models, including YOLO11n, by modifying the model’s YAML configuration file. This can potentially help optimize the model by reducing parameters and computational cost.

Let us know if you have further questions!

Hi

Modified YAML file for GhostConv module as below.

YOLO11n-ghost backbone

backbone:

[from, repeats, module, args]

  • [-1, 1, Conv, [64, 3, 2]] # 0-P1/2
  • [-1, 1, GhostConv, [128, 3, 2]] # 1-P2/4
  • [-1, 2, C3Ghost, [256, False, 0.25]]
  • [-1, 1, GhostConv, [256, 3, 2]] # 3-P3/8
  • [-1, 2, C3Ghost, [512, False, 0.25]]
  • [-1, 1, GhostConv, [512, 3, 2]] # 5-P4/16
  • [-1, 2, C3Ghost, [512, True]]
  • [-1, 1, GhostConv, [1024, 3, 2]] # 7-P5/32
  • [-1, 2, C3Ghost, [1024, True]]
  • [-1, 1, SPPF, [1024, 5]] # 9
  • [-1, 2, C2PSA, [1024]] # 10

YOLO11n head

head:

  • [-1, 1, nn.Upsample, [None, 2, “nearest”]]

  • [[-1, 6], 1, Concat, [1]] # cat backbone P4

  • [-1, 2, C3Ghost, [512, False]] # 13

  • [-1, 1, nn.Upsample, [None, 2, “nearest”]]

  • [[-1, 4], 1, Concat, [1]] # cat backbone P3

  • [-1, 2, C3Ghost, [256, False]] # 16 (P3/8-small)

Added a new head for extra small

  • [-1, 1, nn.Upsample, [None, 2, “nearest”]]

  • [[-1, 2], 1, Concat, [1]] # cat backbone P2

  • [-1, 2, C3Ghost, [128, False]] # 19 (P2/4-xsmall)

  • [-1, 1, GhostConv, [128, 3, 2]]

  • [[-1, 16], 1, Concat, [1]] # cat head P3

  • [-1, 2, C3Ghost, [256, False]] # 22 (P3/8-small)

  • [-1, 1, GhostConv, [256, 3, 2]]

  • [[-1, 13], 1, Concat, [1]] # cat head P4

  • [-1, 2, C3Ghost, [512, False]] # 25 (P4/16-medium)

  • [-1, 1, GhostConv, [512, 3, 2]]

  • [[-1, 10], 1, Concat, [1]] # cat head P5

  • [-1, 2, C3Ghost, [1024, False]] # 28 (P5/32-large)

  • [[19, 22, 25, 28], 1, Detect, [nc]] # Detect(P2, P3, P4, P5)

However, getting an error
TypeError: empty(): argument ‘size’ failed to unpack the object at pos 2 with error “type must be tuple of ints,but got float”

Please share your inputs.

Best Regards,
Venkat

Generally it is helpful to post the entire stack trace of the error:

                   from  n    params  module                                       arguments
  0                  -1  1      1856  ultralytics.nn.modules.conv.Conv             [3, 64, 3, 2]
  1                  -1  1     38720  ultralytics.nn.modules.conv.GhostConv        [64, 128, 3, 2]

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  
  File "ultralytics/nn/tasks.py", line 1223, in parse_model
    m_ = torch.nn.Sequential(*(m(*args) for _ in range(n))) if n > 1 else m(*args)  # module
  
  File "ultralytics/nn/modules/block.py", line 419, in __init__
    super().__init__(c1, c2, n, shortcut, g, e)
  
  File "ultralytics/nn/modules/block.py", line 332, in __init__
    self.m = nn.Sequential(*(Bottleneck(c_, c_, shortcut, g, k=((1, 1), (3, 3)), e=1.0) for _ in range(n)))
  
  File "ultralytics/nn/modules/block.py", line 332, in <genexpr>
    self.m = nn.Sequential(*(Bottleneck(c_, c_, shortcut, g, k=((1, 1), (3, 3)), e=1.0) for _ in range(n)))
  
  File "ultralytics/nn/modules/block.py", line 471, in __init__
    self.cv2 = Conv(c_, c2, k[1], 1, g=g)
  
  File "ultralytics/nn/modules/conv.py", line 65, in __init__
    self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
  
  File ".venv/lib/site-packages/torch/nn/modules/conv.py", line 447, in __init__
    super().__init__(
  
  File ".venv/lib/site-packages/torch/nn/modules/conv.py", line 134, in __init__
    self.weight = Parameter(torch.empty(

TypeError: empty(): argument 'size' failed to unpack the object at pos 2 with error "type must be tuple of ints,but got float"   

This points to [-1, 2, C3Ghost, [256, False, 0.25]] being an issue and specifically that you’ve put 0.25 as the third argument, where it should be a tuple[int] according to the error.

Thank you for inputs.

I got your point and didn’t change anything in the YOLO8 GhostConv reference, just updated to YOLO11n. The difference is c2f block and C3k2 block and float to int conversion.

Please share if you have any working prototype for YOLO11n, GhostConv?

Best Regards,
Venkata Rao

Hi Venkat,

While GhostConv modules can theoretically be integrated into YOLO11 architectures by modifying the YAML file, we don’t have an official, pre-validated yolo11n-ghost.yaml prototype readily available to share.

Creating custom architectures like this involves careful tuning of the YAML definition, ensuring all module arguments, channel dimensions, and layer connections are compatible. The errors you encountered suggest potential mismatches in how the GhostConv/C3Ghost modules are defined or connected within the YOLO11 structure compared to their usage in YOLOv8.

Debugging the YAML often involves comparing your structure against the base yolo11n.yaml and potentially referencing how modules are parsed in the codebase, for example, within the parse_model function found in ultralytics/nn/tasks.py (view on GitHub). This can help identify issues like the type error you mentioned.

Good luck with your customization!