Yolo26 seg p2 available? possible?

Hello!

I’ve come into a challenge in that I’d like to segment some objects in a small image - like 200x200 px. I wanted to use this as higher accuracy segmentation model. Ie: I segment with default settings on a large image, and if the conf score is low on a segmented area, I make a cutout of that segment and run a “specialist” routine to confirm whether it is the object of interest or not.

I read up on modifying the YAML file to accomplish this, but it seems like everything I try ends up with a ”cannot multiply mat1 and mat2” error with some sizes like 5x1554 and 6124x192 listed (numbers from memory but you get the idea). I’ve tried fumbling through with some coding agents as well to see if I’m missing something obvious without any success other than a change in those mat1/mat2 dimensions.

I want to remove P5 and add P2 to the detection head so that I’m using only P2, P3, and P4. or if necessary then P2, P3, P4, and P5 (like the detection model has a yaml option for)

I’m beginning to think this isn’t possible without modifying Segment26 itself - and I don’t want to go down that path. Any suggestions here? Fingers crossed I’m missing something obvious.

P2 adds a stride that makes the masks twice as large than the hardcoded assumption in Ultralytics, which is why it doesn’t work. You need to modify the Ultralytics code to change that hardcoded value

Like in this PR:

The PR still doesn’t have a fix because there’s no easy way around it. The default hardcoded downsample value is 4. But it needs to be 2 for P2 model. You can change the lines in the diff to 2 if you want an easy way to just make the P2 work. But of course it would break non-P2 models.

What you describe sounds like it could be done using a simple conditional check during inference. If the first pass detects objects, but one or more objects are below a specified threshold or if no objects are detected, it would trigger the conditional path. On the conditional path you’d call a function with the image, and center points of any low confidence detections (empty list/array otherwise) to do a second inference pass. The function would then could either use the same model or load a new model, slice the image into the necessary tiles, and then perform inference on each tile.

This wouldn’t require any modification to the model, and should be relatively straightforward to implement. It would give you an “fast” path, when all detections are above a given threshold (likely the common route), and a “slow” path, running a second inference on the tiles. The slow path could even use SAHI for tiled inference with YOLO, which could add more latency but increase the overall accuracy of the second pass.

Indeed I happened to hear of SAHI just yesterday and I’ll see how well it works.
The issue I’m having is that I havent been able to train a “specialized” model thats better than the model thats running for the full image. For example, I’ve tried training a separate model using cutouts of the objects as training data with various parameters without any luck. But with small objects (~220px), P4 P5 are kind of irrelevant because its dividing up the image too much. That’s why I was hoping to get a P2 head in there.

Would you have suggestions or ideas on how to pursue this “specialized” model (other than SAHI obviously)

It looks like the PR is addressing autodetecting the stride length rather than replacing it with 2. Am I missing something? Wouldn’t this change work?

It’s incomplete.

However, I found a different workaround. You can just swap the layers:

is this transferable to yolo26? have to be honest I dont follow. Could I ask if you could elaborate the issue and how it can be resolved?

oh… looks like this is functioning:

thats kind of neat :slight_smile:

# Ultralytics 🚀 AGPL-3.0 License - https://ultralytics.com

# Ultralytics YOLO26-seg instance segmentation model with P2/4 - P5/32 outputs
# Model docs: https://docs.ultralytics.com
# Task docs: https://docs.ultralytics.com

# Parameters
nc: 80 # number of classes
end2end: True # whether to use end-to-end mode
reg_max: 1 # DFL bins
scales: 
  # [depth, width, max_channels]
  n: [0.50, 0.25, 1024] 
  s: [0.50, 0.50, 1024] 
  m: [0.50, 1.00, 512] 
  l: [1.00, 1.00, 512] 
  x: [1.00, 1.50, 512] 

# YOLO26n backbone
backbone:
  # [from, repeats, module, args]
  - [-1, 1, Conv, [64, 3, 2]] # 0-P1/2
  - [-1, 1, Conv, [128, 3, 2]] # 1-P2/4
  - [-1, 2, C3k2, [256, False, 0.25]] # 2
  - [-1, 1, Conv, [256, 3, 2]] # 3-P3/8
  - [-1, 2, C3k2, [512, False, 0.25]] # 4
  - [-1, 1, Conv, [512, 3, 2]] # 5-P4/16
  - [-1, 2, C3k2, [512, True]] # 6
  - [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32
  - [-1, 2, C3k2, [1024, True]] # 8
  - [-1, 1, SPPF, [1024, 5, 3, True]] # 9
  - [-1, 2, C2PSA, [1024]] # 10

# YOLO26n head
head:
  - [-1, 1, nn.Upsample, [None, 2, "nearest"]] # 11
  - [[-1, 6], 1, Concat, [1]] # 12 cat backbone P4
  - [-1, 2, C3k2, [512, True]] # 13

  - [-1, 1, nn.Upsample, [None, 2, "nearest"]] # 14
  - [[-1, 4], 1, Concat, [1]] # 15 cat backbone P3
  - [-1, 2, C3k2, [256, True]] # 16 (P3/8-small)

  # P2 addition
  - [-1, 1, nn.Upsample, [None, 2, "nearest"]] # 17
  - [[-1, 2], 1, Concat, [1]] # 18 cat backbone P2
  - [-1, 2, C3k2, [128, True]] # 19 (P2/4-xsmall)

  - [-1, 1, Conv, [128, 3, 2]] # 20
  - [[-1, 16], 1, Concat, [1]] # 21 cat head P3
  - [-1, 2, C3k2, [256, True]] # 22 (P3/8-small)

  - [-1, 1, Conv, [256, 3, 2]] # 23
  - [[-1, 13], 1, Concat, [1]] # 24 cat head P4
  - [-1, 2, C3k2, [512, True]] # 25 (P4/16-medium)

  - [-1, 1, Conv, [512, 3, 2]] # 26
  - [[-1, 10], 1, Concat, [1]] # 27 cat head P5
  - [-1, 1, C3k2, [1024, True, 0.5, True]] # 28 (P5/32-large)

  # Segment26 layer with required index ordering: P3, P2, P4, P5
  - [[22, 19, 25, 28], 1, Segment26, [nc, 32, 256]]

1 Like

Yep — that works, and the key detail is exactly what you noted: the input order to Segment26 matters.

Seg models assume the mask prototypes come from a feature map at a specific downsample ratio (historically the “P3-first” layout), and parts of the mask pipeline/validation can implicitly expect that ratio. If you pass P2 (stride 4) as the first input, your proto/mask scaling becomes different and you hit the classic mat1 and mat2 shapes cannot be multiplied during mask IoU. Keeping P3 as the first input (like your [[22, 19, 25, 28], 1, Segment26, ...] i.e. P3, P2, P4, P5) preserves the expected proto scaling while still letting you benefit from P2 features. The Segment26 head wiring is in the head module reference if you want to trace where proto is produced.

For your ~200×200 images, you’ll generally get more stable results if you train with an imgsz that’s a clean multiple of stride (e.g. imgsz=256), rather than 200. If you share your train command + whether you’re fine-tuning from yolo26n-seg.pt, I can suggest the cleanest way to transfer weights into this modified P2 head.