The indices are all wrong. If you’re new to computer vision, you shouldn’t be doing advanced things like these without understanding how they work.
nc: 2
scales:
n: [0.33, 0.25, 1024]
s: [0.33, 0.50, 1024]
m: [0.67, 0.75, 768]
l: [1.00, 1.00, 1024]
x: [1.00, 1.25, 512]
backbone:
# [from, repeats, module, args]
- [-1, 1, TorchVision, [3, resnet50, DEFAULT, True, 2, True]]
- [0, 1, Index, [256, 5]] # P3/8
- [0, 1, Index, [512, 6]] # P4/16
- [0, 1, Index, [1024, 7]] # P5/32
head:
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 2], 1, Concat, [1]] # cat backbone P4
- [-1, 2, C3k2, [512, False]] #
- [-1, 1, nn.Upsample, [None, 2, "nearest"]]
- [[-1, 1], 1, Concat, [1]] # cat backbone P3
- [-1, 2, C3k2, [256, False]] # (P3/8-small)
- [-1, 1, Conv, [256, 3, 2]]
- [[-1, 6], 1, Concat, [1]] # cat head P4
- [-1, 2, C3k2, [512, False]] # (P4/16-medium)
- [-1, 1, Conv, [512, 3, 2]]
- [[-1, 3], 1, Concat, [1]] # cat head P5
- [-1, 2, C3k2, [1024, True]] # (P5/32-large)
- [[-7, -4, -1], 1, Detect, [nc]] # Detect(P3, P4, P5)