Hi,

I just read the yolo paper from 2015 in which it states that the predictions are encoded in a tensor of shape `SxSx(5*B + C)`

So that means that for every cell in the grid there are 5 params per box + probabilities for each class.

This is the output shape for training, right?

I would assume that in inference (for task=detection) mode the output for every grid cell should be `B * (5 + C)`

. So that the class probabilities are tied to each box in B.

Or how are the c classes in C connected to b boxes in B?

Thanks

1 Like

@robin_rob96 the YOLOv8 model (going to refer to the COCO pretrained model, but custom models will be *slightly* different) has an output shape of `[N, 84, 8400]`

where `N`

is the batch size for inference (prior to non max suppression (NMS). The 84 represent the four bounding box attrbiutes + 80 class confidence scores, and the 8400 represent all predictions made by the model.

Each row in a Tensor representing a single image (or if `batch=1`

), indexed `[0, 8399]`

will be `[box] + [class-confidence]`

with a shape of `[1, 84]`

or `[84,]`

.

The paper you’re referring to is for the original YOLO model structure, which is different from today’s structure used with Ultralytics YOLO. For a custom model trained on 7 classes, with `batch=1`

then the model output before NMS would be `[1, 11, 8400]`

composed of the four bounding box attributes plus the seven class confidence scores.

During training the output for a YOLOv8 model will be the same as it is for inference. The major difference being that the predictions are used to calculate the loss and update the model, which will not occur during inference.

1 Like