Trying do understand C3k2 block

Sankyuu · December 5, 2024, 3:11pm

Hi Ultralytics Community!
First, I want to apologize if there is any typo.
This is not a really matter problem since I 've trained the model without getting a problem. But, I want to understand the lastest version of Yolo, Yolov11, as well as its architecture.
So far, I’ve searched in the files on Ultralytics’s Github (such as yolov11.yaml, block.py, conv.py) and taken a look at summary of the model (yolo11n.pt) through print().
I’ve found that in C3k2 block (through print()), that

I wonder why that cv2 gets 48 as channel_in meanwhile the previous output is 32?
I’ve searched inside block.py, I’ve found c3k argument is set to False, then C3k2 block is Bottleneck block.
In Bottleneck block (block.py), I’ve found cv2 takes output channel of cv1 as input channel.

I’m very sorry if my question is foolish.

Thank you in advanced!

pderrenger · December 6, 2024, 12:36am

Hi there! No need to apologize—your question is insightful, and it’s great to see your interest in understanding YOLO11’s architecture, especially the C3k2 block. Let’s dive into it!

Understanding the `C3k2` Block:

As you’ve noticed, the C3k2 block is part of the YOLO11 architecture, defined in block.py. When c3k is set to False, C3k2 essentially uses standard Bottleneck blocks rather than the custom C3k implementation.

Why does `cv2` take 48 as `c1` when the previous output is 32?

This discrepancy often relates to how channel dimensions are managed in the architecture. Specifically:

Hidden Layers and Channel Expansion:
In the Bottleneck block, hidden channels (c_) are often determined as c2 * e (where e is the expansion ratio, typically 0.5). However, it’s possible that cv1 is configured to transform input channels (32 in this case) into 48 to facilitate richer feature learning before subsequent operations. This may involve a convolution or similar transformation prior to the actual Bottleneck block computation.
Concatenation or Feature Injections:
Sometimes, additional features (e.g., skip connections or previous outputs) are concatenated or injected into the module input. This could effectively increase the c1 value seen by cv2. For example, torch.cat() operations often lead to channel augmentations. If you’re seeing 48 as c1, some form of feature fusion upstream could be causing this.

You can refer to the forward() methods in both C3k2 and Bottleneck in block.py to confirm how cv1 and cv2 interact in the network pipeline. Here’s an overview:

self.cv1 = Conv(c1, c_, 1, 1)  # First convolution
self.cv2 = Conv(c1, c_, 1, 1)  # Second convolution

Here, c_ is calculated internally, and cv2’s input dimensions (48 in your case) stem from this calculation downstream of cv1.

Checking the Code:

You can walk through the layer operations programmatically:

from ultralytics.nn.modules.block import C3k2

# Example initialization
c3k2_block = C3k2(c1=32, c2=64, n=1, c3k=False)
print(c3k2_block)

Additional Suggestion:

To further debug or track how channels are changing, you might consider placing print statements in the forward pass of the relevant modules:

def forward(self, x):
    print("Input shape:", x.shape)  # Inspect input dimensions
    y = [self.cv2(x), self.cv1(x)]
    print("Output cv2 shape:", y[0].shape)
    print("Output cv1 shape:", y[1].shape)
    ...

Documentation Reference:

You might find the source code reference for C3k2 and its parent classes helpful: C3k2 in block.py. It explains how each component is structured.

Keep exploring! Your efforts to understand YOLO11 details are invaluable, and they’ll undoubtedly deepen your grasp of modern AI architectures. Let us know if you have any follow-up questions—we’re happy to help.

Toxite · December 6, 2024, 3:58am

You can see the forward function.

In [5]: model.model.model[2].forward??
Signature: model.model.model[2].forward(x)
Source:
    def forward(self, x):
        """Forward pass through C2f layer."""
        y = list(self.cv1(x).chunk(2, 1))
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))
File:      /ultralytics/ultralytics/nn/modules/block.py
Type:      method

It runs on an output concatenated from 3 other outputs. So 3 x 16 = 48.

Sankyuu · December 6, 2024, 5:46am

Dear Toxite!
Thanks for your help, your answer may be the key for me to know further. Now, I’m not sure about the true answer is, but your advice will be helpful to me to explore the architecture of Yolov11.
Best regards

BurhanQ · December 7, 2024, 9:41pm

I can tell you that I rely on answers from Toxite, so I would you encourage you to believe him. If you still don’t, then you can always run the forward method yourself of a dummy tensor to verify what the output looks like.

marctornero · July 17, 2025, 6:31pm

Hi Sankyuu,

I just came across this post. The C3k2 block can definitely be tricky, especially since it has two different versions depending on whether c3k=True or False.

I’ve included the diagrams I created for my YOLO11 video series, which helped me better understand and visualize these blocks (along with tensor shapes).

If you’d prefer a video explanation, I cover them in detail in videos 9 and 10, focusing on the C3k2 block with c3k=False and c3k=True, respectively.

While this reply might be a bit late, I hope it’s still useful to you and others!

Marc

Video series link:

pderrenger · July 18, 2025, 1:29pm

Hi @Sankyuu, and a big thank you to @marctornero for sharing those excellent diagrams!

That’s a great question about the C3k2 block’s architecture. It looks like there might be a small misunderstanding in reading the model summary.

Based on your screenshot, both cv1 and cv2 inside the C3k2 block actually take the 32-channel input from the previous layer, as shown by Conv(32, 48, ...).

The data flow for this block (with c3k=False) is designed to process features in parallel. The 32-channel input is fed into two branches (cv1 and cv2), both of which transform it to 48 channels. The second branch’s output is then processed by the Bottleneck layers (m), which is why they operate on 48-channel feature maps.

Finally, the outputs of these two 48-channel branches are concatenated (totaling 96 channels) and passed through the final cv3 convolution to produce the block’s 64-channel output.

I hope this helps clarify the data flow and channel dimensions within the block!

Topic		Replies	Views
🎥 New Video Series: YOLO11 Inference Step-by-Step 🚀 Community Showcase	3	256	April 7, 2025
Modification on yolo11 for OBB Support obb , question , yolo11	2	187	December 27, 2024
Adding a new head to the YOLO11n model to detect very small objects Discussion support , code	21	1258	April 2, 2025
Train a detection YOLO with a fourth input "depth" YOLO question , support , code	2	359	December 1, 2024
Change yaml file YOLO yolo	3	406	November 20, 2024

Trying do understand C3k2 block

Understanding the C3k2 Block:

Why does cv2 take 48 as c1 when the previous output is 32?

Checking the Code:

Additional Suggestion:

Documentation Reference:

Related topics

Understanding the `C3k2` Block:

Why does `cv2` take 48 as `c1` when the previous output is 32?