Replacing the YOLO 11 backbone with ResNet 50

Naufal_dzakwan · September 12, 2025, 12:26am

Hi, I’m new to the world of computer vision and object detection. I’m currently trying to replace the yolo 11 backbone with resnet 50, and this is the .yaml file that I’ve created, but I still get an error message: RuntimeError: Given groups=1, weight of size [128, 1280, 1, 1], expected input[1, 576, 64, 64] to have 1280 channels, but got 576 channels instead. Can anyone help?

.yaml code:

nc: 2

scales:

n: [0.33, 0.25, 1024]

s: [0.33, 0.50, 1024]

m: [0.67, 0.75, 768]

l: [1.00, 1.00, 512]

x: [1.00, 1.25, 512]

backbone:

# [from, repeats, module, args]

[-1, 1, TorchVision, [3, resnet50, DEFAULT, True, 2, True]] # ResNet50 backbone
[0, 1, Index, [256, 4]] # P3/8
[0, 1, Index, [512, 5]] # P4/16
[0, 1, Index, [1024, 6]] # P5/32

head:

[-1, 1, nn.Upsample, [None, 2, ‘nearest’]]
[[-1, 1], 1, Concat, [1]] # cat backbone P4
[-1, 3, C3k2, [512, False]]
[-1, 1, nn.Upsample, [None, 2, ‘nearest’]]
[[-1, 0], 1, Concat, [1]] # cat backbone P3
[-1, 3, C3k2, [256, False]]
[-1, 1, Conv, [256, 3, 2]]
[[-1, 2], 1, Concat, [1]] # cat head P4
[-1, 3, C3k2, [512, False]]
[-1, 1, Conv, [512, 3, 2]]
[[-1, 3], 1, Concat, [1]] # cat head P5
[-1, 3, C3k2, [1024, True]]
[[5, 8, 11], 1, Detect, [nc]]

Toxite · September 12, 2025, 3:46am

The indices are all wrong. If you’re new to computer vision, you shouldn’t be doing advanced things like these without understanding how they work.

nc: 2

scales:
  n: [0.33, 0.25, 1024]
  s: [0.33, 0.50, 1024]
  m: [0.67, 0.75, 768]
  l: [1.00, 1.00, 1024]
  x: [1.00, 1.25, 512]

backbone:

  # [from, repeats, module, args]

  - [-1, 1, TorchVision, [3, resnet50, DEFAULT, True, 2, True]] 

  - [0, 1, Index, [256, 5]]   # P3/8

  - [0, 1, Index, [512, 6]]   # P4/16

  - [0, 1, Index, [1024, 7]]  # P5/32

head:
  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [[-1, 2], 1, Concat, [1]] # cat backbone P4
  - [-1, 2, C3k2, [512, False]] #

  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [[-1, 1], 1, Concat, [1]] # cat backbone P3
  - [-1, 2, C3k2, [256, False]] # (P3/8-small)

  - [-1, 1, Conv, [256, 3, 2]]
  - [[-1, 6], 1, Concat, [1]] # cat head P4
  - [-1, 2, C3k2, [512, False]] # (P4/16-medium)

  - [-1, 1, Conv, [512, 3, 2]]
  - [[-1, 3], 1, Concat, [1]] # cat head P5
  - [-1, 2, C3k2, [1024, True]] # (P5/32-large)

  - [[-7, -4, -1], 1, Detect, [nc]] # Detect(P3, P4, P5)

BurhanQ · September 12, 2025, 12:36pm

Just as Toxite mentioned, if you’re new to computer vision, jumping in to make modifications to the model structure might not be the best idea. It might be a better use of time to learn more about the fundamentals of computer vision and convolutional neural networks structure first.

I like to use an analogy to help understand why I, and others make this type of recommendation. As a child, once you start walking, you don’t immediately try to run a marathon. In fact, even after becoming a teenager or adult, you don’t just run a marathon, you have to work up to it with lots of training. It’s tempting to skip the fundamentals, but there’s a reason they’re called ‘fundamentals’ and that’s why they shouldn’t be skipped.

Finally, I also recommend you give this post a read. You’ve shared what you’re trying to do, but you haven’t provided the context as to the why. Without explaining what you aim to accomplish, it’s impossible to give you assistance with what your true goal is. Generally, most people have a reason to modify a model, but we don’t know why you want to, and it might be that for what you actually want to accomplish, you don’t need to. You could save yourself a lot of time by sharing more information as to what you are attempting to achieve overall, instead of just what the problem is you’re trying to tackle right now.

Naufal_dzakwan · September 30, 2025, 3:11am

Hi, thanks for the advice. I’m doing this for academic purposes. My goal is to outperform the original YOLO 11 on mAP scores for fire object detection tasks. Do you have any suggestions regarding this? Where should I start learning, or is it possible to surpass the original YOLO11’s performance? Because based on my experiments, the performance results of yolo 11 are very high, even reaching over 97% in the case of my dataset.

Naufal_dzakwan · September 30, 2025, 6:03am

Dear Toxite, thank you for your help. May I know where the numbers 5, 6, and 7 in the following code come from? Is there any documentation I can read to better understand it?

[0, 1, Index, [256, 5]] # P3/8
[0, 1, Index, [512, 6]] # P4/16
[0, 1, Index, [1024, 7]] *# P5/32

And if I want to change the backbone to a different resnet model, do I only need to change the backbone? Or do I need to change the neck as well?

Toxite · September 30, 2025, 6:35am

It’s based on the feature map shape

github.com/ultralytics/ultralytics

Comment by Y-T-G - `ultralytics 8.3.59` Add ability to load any `torchvision` model as module

main ← torchvision-block

For reference, here's a break down of the `convnext_tiny` example config: ```…yaml nc: 1 backbone: - [-1, 1, TorchVision, [768, "convnext_tiny", "DEFAULT", True, 2, False]] # - 0 head: - [0, 1, Index, [192, 4]] # selects 4th output (1, 192, 80, 80) - 1 - [0, 1, Index, [384, 6]] # selects 6th output (1, 384, 40, 40) - 2 - [0, 1, Index, [768, 8]] # selects 8th output (1, 768, 20, 20) - 3 - [[1, 2, 3], 1, Detect, [nc]] # passes all three feature maps to detection head ``` Each layer follows this format: `[from, repeats, module, args]`. - `from`: Indicates the layer providing input. A value of `-1` means the previous layer’s output is used, while `0` or higher refers to a specific earlier layer. - `repeats`: Specifies how many times the layer is repeated. - `module`: The type of layer. It must be defined in `nn/modules` and imported in `tasks.py`. - `args`: The arguments passed to the layer’s constructor. Their meaning depends on how they’re parsed in `parse_model`. The first layer in the config is: `[-1, 1, TorchVision, [768, "convnext_tiny", "DEFAULT", True, 2, False]]`. `from`, `repeats`, and `module` are straightforward. Let’s focus on `args`, which correspond to this constructor in the `TorchVision` module: ```python def __init__(self, c1, c2, model, weights="DEFAULT", unwrap=True, truncate=2, split=False) ``` The function of each argument is found in the docstring: https://github.com/ultralytics/ultralytics/blob/a7f72d3f691810c5ba0fbd951047ac2d1bbc618b/ultralytics/nn/modules/block.py#L1113-L1129 `c1` (input channels) is inferred automatically by `parse_model`, so we don't have to specify that. The provided `args` list starts with `c2` (output channels), which is `768` in this example. This value varies depending on the specific model architecture. To verify it, you can create a simple `yaml` config using `nn.Identity` as the head: ```yaml nc: 1 backbone: - [-1, 1, TorchVision, [768, "convnext_tiny", "DEFAULT", True, 2, False]] head: - [-1, 1, nn.Identity, []] ``` Load this config into Ultralytics and check the output shape: ```python model = YOLO("convnext.yaml", task="detect") img = torch.randn(1, 3, 640, 640) print(model.model(img).shape) ``` This outputs: `torch.Size([1, 768, 20, 20])` showing that the number of channels (`c2`) is `768`. In the `args` list, `"convnext_tiny"` specifies the model name (must match the lowercase name in the [torchvision docs](https://pytorch.org/vision/main/models/convnext.html). `"DEFAULT"` uses pretrained weights. The next argument, `unwrap=True`, unwraps the model, converting it into an `nn.Sequential` block, making it easier to remove the final layers (usually classification-specific ones). `truncate=2` removes the last two layers, typically a pooling layer and a linear classifier. You can view the layers by running: ```python print(model.model.model[0]) ``` The final argument, `split`, when set to `True`, makes the model return intermediate outputs as a list instead of just the final output. Let’s enable `split` and inspect the outputs: ```yaml nc: 1 backbone: - [-1, 1, TorchVision, [768, "convnext_tiny", "DEFAULT", True, 2, True]] head: - [-1, 1, nn.Identity, []] ``` ```python model = YOLO("convnext.yaml", task="detect") img = torch.randn(1, 3, 640, 640) output = model.model(img) # returns a list of tensors print([o.shape for o in output]) ``` Output: ```python [torch.Size([1, 3, 640, 640]), torch.Size([1, 96, 160, 160]), torch.Size([1, 96, 160, 160]), torch.Size([1, 192, 80, 80]), torch.Size([1, 192, 80, 80]), torch.Size([1, 384, 40, 40]), torch.Size([1, 384, 40, 40]), torch.Size([1, 768, 20, 20]), torch.Size([1, 768, 20, 20])] ``` We can now use the `Index` layer to extract any of these outputs. In the final config: ```yaml nc: 1 backbone: - [-1, 1, TorchVision, [768, "convnext_tiny", "DEFAULT", True, 2, True]] # - 0 head: - [0, 1, Index, [192, 4]] # extracts 4th output (1, 192, 80, 80) - 1 - [0, 1, Index, [384, 6]] # extracts 6th output (1, 384, 40, 40) - 2 - [0, 1, Index, [768, 8]] # extracts 8th output (1, 768, 20, 20) - 3 - [[1, 2, 3], 1, Detect, [nc]] # passes all three feature maps to detection head ``` The `Index` layers select outputs based on the previously printed shapes. We also specify the output channels based on that before the index. All the selected outputs with finer to coarser feature map sizes (`80x80`, `40x40`, `20x20`) are then passed to the `Detect` layer. Using multiple feature maps of different sizes allows the model to detect objects at various scales.

Naufal_dzakwan · September 30, 2025, 8:13am

Ah i see, then how to integrate with YOLO neck? So what’s the correct way to integrate with the YOLO11 neck and head? Do I need to adjust the channel conv block and c3k2? Like this? Is it correct?

backbone:

# [from, repeats, module, args]

[-1, 1, TorchVision, [3, resnet34, DEFAULT, True, 2, True]]
[0, 1, Index, [128, 6]] # P3
[0, 1, Index, [256, 7]] # P4
[0, 1, Index, [512, 8]] # P5

head:

[-1, 1, nn.Upsample, [None, 2, “nearest”]]
[[-1, 2], 1, Concat, [1]] # cat backbone P4
[-1, 2, C3k2, [256, False]] #
[-1, 1, nn.Upsample, [None, 2, “nearest”]]
[[-1, 1], 1, Concat, [1]] # cat backbone P3
[-1, 2, C3k2, [128, False]] # (P3/8-small)
[-1, 1, Conv, [128, 3, 2]]
[[-1, 6], 1, Concat, [1]] # cat head P4
[-1, 2, C3k2, [256, False]] # (P4/16-medium)
[-1, 1, Conv, [512, 3, 2]]
[[-1, 3], 1, Concat, [1]] # cat head P5
[-1, 2, C3k2, [512, True]] # (P5/32-large)
[[-7, -4, -1], 1, Detect, [nc]] # Detect(P3, P4, P5)

pderrenger · October 1, 2025, 12:44am

You’re very close. Two quick fixes and a sanity check:

TorchVision first arg is the backbone output channels, not input. For ResNet-34 use 512, not 3:

backbone:
  - [-1, 1, TorchVision, [512, resnet34, DEFAULT, True, 2, True]]
  - [0, 1, Index, [128, 6]]  # P3 (80x80)
  - [0, 1, Index, [256, 7]]  # P4 (40x40)
  - [0, 1, Index, [512, 8]]  # P5 (20x20)

Your neck/head wiring looks good. Keep straight quotes in Upsample mode (“nearest”). The C3k2 targets 128/256/512 align with P3/P4/P5, and [[ -7, -4, -1 ], 1, Detect, [nc]] is correct.

To double-check the ResNet indices/channels on your setup, quickly probe with an Identity head and print shapes; the process is shown in the TorchVision and Index sections of the Model YAML guide. You can find a compact walkthrough in the Model YAML Configuration Guide’s TorchVision integration and its Index module example.

If you switch to another ResNet, update the Index channel sizes and the first TorchVision arg to that backbone’s last-stage channels (e.g., ResNet-50 uses 256/512/1024 at P3/P4/P5 and 1024 as the TorchVision c2).

Topic		Replies	Views
Replacing YOLO 11 backbone with ResNet 50 Discussion support , discussion	1	201	June 18, 2025
Yolo11 backbone Discussion yolo , discussion	4	35	October 23, 2025
Modifying yolo11 architecture to have one backbone and 2 necks and heads Discussion yolo , question , support , discussion , code	6	1225	August 12, 2025
Yolov5 to yolov11 YOLO yolov5 , question , code , yolo11	18	1323	January 16, 2025
Adding a new head to the YOLO11n model to detect very small objects Discussion support , code	23	2360	August 6, 2025

Replacing the YOLO 11 backbone with ResNet 50

Related topics