Very bad validation metrics in custom dataset with fine tuned model

Hi everyone, i´m new using YOLO and i´ve been facinating with this world but i’ve had some troubles :pensive_face:.
I did a fine tunning with 4 classes and yolov10m, 2,000 instances for each one, with a 500 epochs i got a mAP50:90% and mAP50-95: 75%, but when i tested my fine-tunned model with another validation set the results was horrible

                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 8/8 [00:11<00:00,  1.49s/it]
                   all        123        674      0.568      0.084      0.126     0.0651
                   car        115        522      0.806    0.00192      0.129     0.0887
         traffic light         17         25          1          0    0.00257   0.000257
               bicycle         20         21       0.46      0.333       0.33       0.16
             crosswalk         70        106    0.00751   0.000921     0.0423      0.012

i’ve thought it’s my data distribution, maybe many car instances, or the size of my new data, what could be the problem? wich experiments do i have to do to solve it?
validation from my own dataset:

It sounds like overfitting. Does your dataset consist of unique images or did you augment it to reach that number?

What’s the training code you used? Did you start with pretrained model?

1 Like

for my ft dataset i used images taken for this datasets:
bdd100k, KITTI
my training code is:

model = YOLO('yoloModels/yolov10s.pt')
    print(model.args)
    results = model.train(data="dataset\data.yaml", epochs=500, imgsz=640, batch=8, save_period = 50, 
                            project = "entrenamientos", name = "yolov10s_500", device=0) 
results = model.val()

Hi Ernesto,

Thanks for sharing the details of your issue. It’s quite common to see a performance drop when validating on a dataset different from the one used during training, especially if the data distributions or characteristics (like image conditions, object sizes, annotation styles) vary significantly between the two sets.

The large difference between your initial validation results (mAP50: 90%) and the results on the new set (mAP50: 12.6%) strongly suggests a domain gap between your training/original validation data (derived from BDD100k, KITTI) and this new validation set.

To investigate further, you could:

  1. Compare the visual characteristics and annotation quality of your new validation set against your training and original validation sets. Are there noticeable differences in lighting, camera angles, object scales, or how objects are labeled?
  2. Analyze the per-class metrics on the new validation set (as you’ve shown). The very low scores for ‘traffic light’ and ‘crosswalk’ might indicate these classes are particularly different or underrepresented in the new set compared to your training data.
  3. Examine the prediction images (val_batch*_pred.jpg) generated during validation on the new set. These visuals can offer direct insight into how the model is failing (e.g., missing objects, incorrect classifications, poor localization). You can find these images in the validation run directory, usually runs/detect/val/.

Understanding the differences between the datasets is key to addressing the performance gap.

1 Like