For reference, here's a break down of the `convnext_tiny` example config:
```…yaml
nc: 1
backbone:
- [-1, 1, TorchVision, [768, "convnext_tiny", "DEFAULT", True, 2, False]] # - 0
head:
- [0, 1, Index, [192, 4]] # selects 4th output (1, 192, 80, 80) - 1
- [0, 1, Index, [384, 6]] # selects 6th output (1, 384, 40, 40) - 2
- [0, 1, Index, [768, 8]] # selects 8th output (1, 768, 20, 20) - 3
- [[1, 2, 3], 1, Detect, [nc]] # passes all three feature maps to detection head
```
Each layer follows this format: `[from, repeats, module, args]`.
- `from`: Indicates the layer providing input. A value of `-1` means the previous layer’s output is used, while `0` or higher refers to a specific earlier layer.
- `repeats`: Specifies how many times the layer is repeated.
- `module`: The type of layer. It must be defined in `nn/modules` and imported in `tasks.py`.
- `args`: The arguments passed to the layer’s constructor. Their meaning depends on how they’re parsed in `parse_model`.
The first layer in the config is:
`[-1, 1, TorchVision, [768, "convnext_tiny", "DEFAULT", True, 2, False]]`.
`from`, `repeats`, and `module` are straightforward. Let’s focus on `args`, which correspond to this constructor in the `TorchVision` module:
```python
def __init__(self, c1, c2, model, weights="DEFAULT", unwrap=True, truncate=2, split=False)
```
The function of each argument is found in the docstring:
https://github.com/ultralytics/ultralytics/blob/a7f72d3f691810c5ba0fbd951047ac2d1bbc618b/ultralytics/nn/modules/block.py#L1113-L1129
`c1` (input channels) is inferred automatically by `parse_model`, so we don't have to specify that. The provided `args` list starts with `c2` (output channels), which is `768` in this example. This value varies depending on the specific model architecture. To verify it, you can create a simple `yaml` config using `nn.Identity` as the head:
```yaml
nc: 1
backbone:
- [-1, 1, TorchVision, [768, "convnext_tiny", "DEFAULT", True, 2, False]]
head:
- [-1, 1, nn.Identity, []]
```
Load this config into Ultralytics and check the output shape:
```python
model = YOLO("convnext.yaml", task="detect")
img = torch.randn(1, 3, 640, 640)
print(model.model(img).shape)
```
This outputs:
`torch.Size([1, 768, 20, 20])`
showing that the number of channels (`c2`) is `768`.
In the `args` list, `"convnext_tiny"` specifies the model name (must match the lowercase name in the [torchvision docs](https://pytorch.org/vision/main/models/convnext.html). `"DEFAULT"` uses pretrained weights. The next argument, `unwrap=True`, unwraps the model, converting it into an `nn.Sequential` block, making it easier to remove the final layers (usually classification-specific ones). `truncate=2` removes the last two layers, typically a pooling layer and a linear classifier. You can view the layers by running:
```python
print(model.model.model[0])
```
The final argument, `split`, when set to `True`, makes the model return intermediate outputs as a list instead of just the final output. Let’s enable `split` and inspect the outputs:
```yaml
nc: 1
backbone:
- [-1, 1, TorchVision, [768, "convnext_tiny", "DEFAULT", True, 2, True]]
head:
- [-1, 1, nn.Identity, []]
```
```python
model = YOLO("convnext.yaml", task="detect")
img = torch.randn(1, 3, 640, 640)
output = model.model(img) # returns a list of tensors
print([o.shape for o in output])
```
Output:
```python
[torch.Size([1, 3, 640, 640]),
torch.Size([1, 96, 160, 160]),
torch.Size([1, 96, 160, 160]),
torch.Size([1, 192, 80, 80]),
torch.Size([1, 192, 80, 80]),
torch.Size([1, 384, 40, 40]),
torch.Size([1, 384, 40, 40]),
torch.Size([1, 768, 20, 20]),
torch.Size([1, 768, 20, 20])]
```
We can now use the `Index` layer to extract any of these outputs. In the final config:
```yaml
nc: 1
backbone:
- [-1, 1, TorchVision, [768, "convnext_tiny", "DEFAULT", True, 2, True]] # - 0
head:
- [0, 1, Index, [192, 4]] # extracts 4th output (1, 192, 80, 80) - 1
- [0, 1, Index, [384, 6]] # extracts 6th output (1, 384, 40, 40) - 2
- [0, 1, Index, [768, 8]] # extracts 8th output (1, 768, 20, 20) - 3
- [[1, 2, 3], 1, Detect, [nc]] # passes all three feature maps to detection head
```
The `Index` layers select outputs based on the previously printed shapes. We also specify the output channels based on that before the index. All the selected outputs with finer to coarser feature map sizes (`80x80`, `40x40`, `20x20`) are then passed to the `Detect` layer. Using multiple feature maps of different sizes allows the model to detect objects at various scales.