Replacing the YOLO 11 backbone with ResNet 50

Hi, I’m new to the world of computer vision and object detection. I’m currently trying to replace the yolo 11 backbone with resnet 50, and this is the .yaml file that I’ve created, but I still get an error message: RuntimeError: Given groups=1, weight of size [128, 1280, 1, 1], expected input[1, 576, 64, 64] to have 1280 channels, but got 576 channels instead. Can anyone help?

.yaml code:

nc: 2

scales:

n: [0.33, 0.25, 1024]

s: [0.33, 0.50, 1024]

m: [0.67, 0.75, 768]

l: [1.00, 1.00, 512]

x: [1.00, 1.25, 512]

backbone:

# [from, repeats, module, args]

  • [-1, 1, TorchVision, [3, resnet50, DEFAULT, True, 2, True]] # ResNet50 backbone

  • [0, 1, Index, [256, 4]] # P3/8

  • [0, 1, Index, [512, 5]] # P4/16

  • [0, 1, Index, [1024, 6]] # P5/32

head:

  • [-1, 1, nn.Upsample, [None, 2, ‘nearest’]]

  • [[-1, 1], 1, Concat, [1]] # cat backbone P4

  • [-1, 3, C3k2, [512, False]]

  • [-1, 1, nn.Upsample, [None, 2, ‘nearest’]]

  • [[-1, 0], 1, Concat, [1]] # cat backbone P3

  • [-1, 3, C3k2, [256, False]]

  • [-1, 1, Conv, [256, 3, 2]]

  • [[-1, 2], 1, Concat, [1]] # cat head P4

  • [-1, 3, C3k2, [512, False]]

  • [-1, 1, Conv, [512, 3, 2]]

  • [[-1, 3], 1, Concat, [1]] # cat head P5

  • [-1, 3, C3k2, [1024, True]]

  • [[5, 8, 11], 1, Detect, [nc]]

The indices are all wrong. If you’re new to computer vision, you shouldn’t be doing advanced things like these without understanding how they work.

nc: 2

scales:
  n: [0.33, 0.25, 1024]
  s: [0.33, 0.50, 1024]
  m: [0.67, 0.75, 768]
  l: [1.00, 1.00, 1024]
  x: [1.00, 1.25, 512]

backbone:

  # [from, repeats, module, args]

  - [-1, 1, TorchVision, [3, resnet50, DEFAULT, True, 2, True]] 

  - [0, 1, Index, [256, 5]]   # P3/8

  - [0, 1, Index, [512, 6]]   # P4/16

  - [0, 1, Index, [1024, 7]]  # P5/32

head:
  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [[-1, 2], 1, Concat, [1]] # cat backbone P4
  - [-1, 2, C3k2, [512, False]] #

  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [[-1, 1], 1, Concat, [1]] # cat backbone P3
  - [-1, 2, C3k2, [256, False]] # (P3/8-small)

  - [-1, 1, Conv, [256, 3, 2]]
  - [[-1, 6], 1, Concat, [1]] # cat head P4
  - [-1, 2, C3k2, [512, False]] # (P4/16-medium)

  - [-1, 1, Conv, [512, 3, 2]]
  - [[-1, 3], 1, Concat, [1]] # cat head P5
  - [-1, 2, C3k2, [1024, True]] # (P5/32-large)

  - [[-7, -4, -1], 1, Detect, [nc]] # Detect(P3, P4, P5)
2 Likes

Just as Toxite mentioned, if you’re new to computer vision, jumping in to make modifications to the model structure might not be the best idea. It might be a better use of time to learn more about the fundamentals of computer vision and convolutional neural networks structure first.

I like to use an analogy to help understand why I, and others make this type of recommendation. As a child, once you start walking, you don’t immediately try to run a marathon. In fact, even after becoming a teenager or adult, you don’t just run a marathon, you have to work up to it with lots of training. It’s tempting to skip the fundamentals, but there’s a reason they’re called ‘fundamentals’ and that’s why they shouldn’t be skipped.

Finally, I also recommend you give this post a read. You’ve shared what you’re trying to do, but you haven’t provided the context as to the why. Without explaining what you aim to accomplish, it’s impossible to give you assistance with what your true goal is. Generally, most people have a reason to modify a model, but we don’t know why you want to, and it might be that for what you actually want to accomplish, you don’t need to. You could save yourself a lot of time by sharing more information as to what you are attempting to achieve overall, instead of just what the problem is you’re trying to tackle right now.

1 Like

Hi, thanks for the advice. I’m doing this for academic purposes. My goal is to outperform the original YOLO 11 on mAP scores for fire object detection tasks. Do you have any suggestions regarding this? Where should I start learning, or is it possible to surpass the original YOLO11’s performance? Because based on my experiments, the performance results of yolo 11 are very high, even reaching over 97% in the case of my dataset.

Dear Toxite, thank you for your help. May I know where the numbers 5, 6, and 7 in the following code come from? Is there any documentation I can read to better understand it?

  • [0, 1, Index, [256, 5]] # P3/8
  • [0, 1, Index, [512, 6]] # P4/16
  • [0, 1, Index, [1024, 7]] *# P5/32

And if I want to change the backbone to a different resnet model, do I only need to change the backbone? Or do I need to change the neck as well?

It’s based on the feature map shape

1 Like

Ah i see, then how to integrate with YOLO neck? So what’s the correct way to integrate with the YOLO11 neck and head? Do I need to adjust the channel conv block and c3k2? Like this? Is it correct?

backbone:

# [from, repeats, module, args]

  • [-1, 1, TorchVision, [3, resnet34, DEFAULT, True, 2, True]]

  • [0, 1, Index, [128, 6]] # P3

  • [0, 1, Index, [256, 7]] # P4

  • [0, 1, Index, [512, 8]] # P5

head:

  • [-1, 1, nn.Upsample, [None, 2, “nearest”]]

  • [[-1, 2], 1, Concat, [1]] # cat backbone P4

  • [-1, 2, C3k2, [256, False]] #

  • [-1, 1, nn.Upsample, [None, 2, “nearest”]]

  • [[-1, 1], 1, Concat, [1]] # cat backbone P3

  • [-1, 2, C3k2, [128, False]] # (P3/8-small)

  • [-1, 1, Conv, [128, 3, 2]]

  • [[-1, 6], 1, Concat, [1]] # cat head P4

  • [-1, 2, C3k2, [256, False]] # (P4/16-medium)

  • [-1, 1, Conv, [512, 3, 2]]

  • [[-1, 3], 1, Concat, [1]] # cat head P5

  • [-1, 2, C3k2, [512, True]] # (P5/32-large)

  • [[-7, -4, -1], 1, Detect, [nc]] # Detect(P3, P4, P5)

You’re very close. Two quick fixes and a sanity check:

  1. TorchVision first arg is the backbone output channels, not input. For ResNet-34 use 512, not 3:
backbone:
  - [-1, 1, TorchVision, [512, resnet34, DEFAULT, True, 2, True]]
  - [0, 1, Index, [128, 6]]  # P3 (80x80)
  - [0, 1, Index, [256, 7]]  # P4 (40x40)
  - [0, 1, Index, [512, 8]]  # P5 (20x20)
  1. Your neck/head wiring looks good. Keep straight quotes in Upsample mode (“nearest”). The C3k2 targets 128/256/512 align with P3/P4/P5, and [[ -7, -4, -1 ], 1, Detect, [nc]] is correct.

To double-check the ResNet indices/channels on your setup, quickly probe with an Identity head and print shapes; the process is shown in the TorchVision and Index sections of the Model YAML guide. You can find a compact walkthrough in the Model YAML Configuration Guide’s TorchVision integration and its Index module example.

If you switch to another ResNet, update the Index channel sizes and the first TorchVision arg to that backbone’s last-stage channels (e.g., ResNet-50 uses 256/512/1024 at P3/P4/P5 and 1024 as the TorchVision c2).