Confusion matrix contains many classes not in dataset

When I run model.val() before training the model with my dataset the confusion matrix contains several classes that are not specified in the dataset’s YAML file. This issue #16695 explains why the background class is present, but why are all these other classes (person, bicycle, car etc) present?

Here’s my code (which I’m running in Google Colab):

!pip install ultralytics
!pip install roboflow

from roboflow import Roboflow
rf = Roboflow(api_key=[api_key_goes_here])
project = rf.workspace("conveyor-550m0").project("conveyor-hhrzw")
version = project.version(3)
dataset = version.download("yolov11")

from ultralytics import YOLO
model = YOLO("yolo11n.pt")

results = model.val(data="/content/conveyor-3/data.yaml")
print(results.confusion_matrix.to_df()) # First confusion matrix
train_results = model.train(data="/content/conveyor-3/data.yaml", epochs=1)
results = model.val()
print(results.confusion_matrix.to_df()) # Second confusion matrix

The YAML file (/content/conveyor-3/data.yaml) contains three classes:

nc: 3
names: ['cardboard box', 'conveyor', 'kartonbox']

The first confusion matrix (before training) looks like this:

┌────────────┬────────┬─────────┬─────┬───┬────────────┬────────────┬────────────┬────────────┐
│ Predicted  ┆ person ┆ bicycle ┆ car ┆ … ┆ teddy_bear ┆ hair_drier ┆ toothbrush ┆ background │
│ ---        ┆ ---    ┆ ---     ┆ --- ┆   ┆ ---        ┆ ---        ┆ ---        ┆ ---        │
│ str        ┆ f64    ┆ f64     ┆ f64 ┆   ┆ f64        ┆ f64        ┆ f64        ┆ f64        │
╞════════════╪════════╪═════════╪═════╪═══╪════════════╪════════════╪════════════╪════════════╡
│ person     ┆ 0.0    ┆ 0.0     ┆ 0.0 ┆ … ┆ 0.0        ┆ 0.0        ┆ 0.0        ┆ 5.0        │
│ bicycle    ┆ 0.0    ┆ 0.0     ┆ 0.0 ┆ … ┆ 0.0        ┆ 0.0        ┆ 0.0        ┆ 0.0        │
│ car        ┆ 0.0    ┆ 0.0     ┆ 0.0 ┆ … ┆ 0.0        ┆ 0.0        ┆ 0.0        ┆ 0.0        │
│ motorcycle ┆ 0.0    ┆ 0.0     ┆ 0.0 ┆ … ┆ 0.0        ┆ 0.0        ┆ 0.0        ┆ 0.0        │
│ airplane   ┆ 0.0    ┆ 0.0     ┆ 0.0 ┆ … ┆ 0.0        ┆ 0.0        ┆ 0.0        ┆ 0.0        │
│ …          ┆ …      ┆ …       ┆ …   ┆ … ┆ …          ┆ …          ┆ …          ┆ …          │
│ scissors   ┆ 0.0    ┆ 0.0     ┆ 0.0 ┆ … ┆ 0.0        ┆ 0.0        ┆ 0.0        ┆ 0.0        │
│ teddy_bear ┆ 0.0    ┆ 0.0     ┆ 0.0 ┆ … ┆ 0.0        ┆ 0.0        ┆ 0.0        ┆ 0.0        │
│ hair_drier ┆ 0.0    ┆ 0.0     ┆ 0.0 ┆ … ┆ 0.0        ┆ 0.0        ┆ 0.0        ┆ 0.0        │
│ toothbrush ┆ 0.0    ┆ 0.0     ┆ 0.0 ┆ … ┆ 0.0        ┆ 0.0        ┆ 0.0        ┆ 0.0        │
│ background ┆ 152.0  ┆ 34.0    ┆ 5.0 ┆ … ┆ 0.0        ┆ 0.0        ┆ 0.0        ┆ 0.0        │
└────────────┴────────┴─────────┴─────┴───┴────────────┴────────────┴────────────┴────────────┘

The second confusion matrix (after training) looks like this:

┌───────────────┬───────────────┬──────────┬───────────┬────────────┐
│ Predicted     ┆ cardboard_box ┆ conveyor ┆ kartonbox ┆ background │
│ ---           ┆ ---           ┆ ---      ┆ ---       ┆ ---        │
│ str           ┆ f64           ┆ f64      ┆ f64       ┆ f64        │
╞═══════════════╪═══════════════╪══════════╪═══════════╪════════════╡
│ cardboard_box ┆ 0.0           ┆ 0.0      ┆ 0.0       ┆ 0.0        │
│ conveyor      ┆ 0.0           ┆ 0.0      ┆ 0.0       ┆ 0.0        │
│ kartonbox     ┆ 0.0           ┆ 0.0      ┆ 0.0       ┆ 0.0        │
│ background    ┆ 167.0         ┆ 46.0     ┆ 5.0       ┆ 0.0        │
└───────────────┴───────────────┴──────────┴───────────┴────────────┘

As expected, the second confusion matrix contains the three classes from the dataset’s YAML file and also the background class. I expected that the first confusion matrix would also only contain those four classes. Even before training, I expected that model validation would use the classes listed in the dataset’s YAML file.

Where does model.val() get all those other classes (person, bicycle, car etc) from?

You can’t use a model trained on different set of classes to validate on your own dataset. The model will always produce the classes it was trained on.

It also doesn’t make sense to run model.val() after model.train() because model.train() already returns the same thing as model.val() at the end of training, unless you’re training on multiple GPUs.

Does the ultralytics Python package provide a way to evaluate the out-of-the-box performance of a pretrained YOLO model on a given dataset? I want to compare performance before training and after training to see how much improvement in performance we get from fine-tuning the pretrained model on our dataset.

This is not possible for a closed set YOLO model trained on different set of classes. A model can only predict classes that it was trained on. The accuracy on any other class will be 0. It only learns to predict your class after training.

The pretrained model was trained on COCO dataset. It can only predict the classes in that dataset.