Issue: Model Performs Well During Training but Poorly on Evaluation
Problem Description
During the training process, the model shows good performance metrics in the training logs, but during final evaluation, it performs almost completely incorrectly, as if it hasn’t been trained at all.
Training Logs (Last 3 Epochs):
epoch
time
train/box_loss
train/cls_loss
train/dfl_loss
metrics/precision(B)
metrics/recall(B)
metrics/mAP50(B)
metrics/mAP50-95(B)
val/box_loss
val/cls_loss
val/dfl_loss
lr/pg0
lr/pg1
lr/pg2
198
49616.6
0.69604
0.47192
0.88007
0.83006
0.72811
0.81305
0.6184
0.70753
0.48988
0.89207
0.000105495
0.000105495
0.000105495
199
49860.6
0.69621
0.47175
0.88079
0.83219
0.72686
0.81298
0.618
0.7076
0.48996
0.89208
0.000102443
0.000102443
0.000102443
200
50104.7
0.69451
0.47151
0.87952
0.83122
0.72626
0.81188
0.61713
0.70762
0.48988
0.89205
0.000100611
0.000100611
0.000100611
Training Code:
from datetime import datetime
from ultralytics import YOLO
model_yaml_path = "Custom_Model_cfg/yolo11_Modify.yaml"
data = "Custom_dataset_cfg/vehicle_orientation.yaml"
if __name__ == '__main__':
model = YOLO(model_yaml_path)
results = model.train(data=data,
epochs=200,
batch=32,
imgsz=640,
cos_lr=True,
close_mosaic=50,
save=True,
device="0",
name="yolo11_Modify"+datetime.now().strftime("%Y%m%d_%H_%M"))
I suspected that the model might be performing well on the training set but poorly on the test set, so I created a new dataset by sampling 1000 images from the training set and evaluated it using the best trained model. The results were equally poor.
Class
Images
Instances
Box (P)
Box (R)
mAP50
mAP50-95
all
600
2956
0.000716
0.00257
0.000304
7.43e-05
car
581
2431
0.00287
0.0103
0.000798
0.000211
motorcycle
23
27
0
0
0.000298
7.45e-05
bus
34
40
0
0
0
0
truck
275
458
0
0
0.00012
1.2e-05
val code
from ultralytics import YOLO
#
# # Load a model
model_path=r"C:\Users\Hunger\Desktop\ultralytics\runs\detect\yolo11_Modify\weights\last.pt"
data=r"Custom_dataset_cfg/test.yaml"
if __name__ == '__main__':
model=YOLO(model_path)
# Validate the model
metrics = model.val(data=data) # no arguments needed, dataset and settings remembered
metrics.box.map # map50-95
metrics.box.map50 # map50
metrics.box.map75 # map75
metrics.box.maps # a list contains map50-95 of each category
Great catch, and thanks for closing the loop! A broken fuse() will tank eval because YOLO11 fuses layers before validation. When you fix your custom block, mirror the pattern used in core modules: fold BN into Conv, delete the BN, and switch the forward to the fused path. You can see how we do it in the model-level fuse flow in the Model.fuse reference and the BaseModel.fuse reference, plus a concrete example in the RepVGGDW.fuse example.
Quick sanity check you can keep in your tests: predictions should be numerically close before vs after fuse.
from ultralytics import YOLO
import numpy as np
m = YOLO("path/to/your.pt")
img = "ultralytics/assets/bus.jpg"
r0 = m.predict(img, verbose=False)
m.model.fuse() # fused inference path
r1 = m.predict(img, verbose=False)
d = np.abs(r0[0].boxes.xyxy.cpu().numpy() - r1[0].boxes.xyxy.cpu().numpy()).mean()
print(f"Mean abs box diff (unfused vs fused): {d:.4f}") # should be very small
Until your fuse is fixed, your workaround model.model.is_fused = lambda: True before val() is fine. If anything else pops up, update to the latest Ultralytics package and share a minimal repro—happy to take a look.