Yolov8m converge too fast, over fitting occurs before epoch 10, adjusting LR doesn't really help, any suggestions?

Hi,

My task is object detection on 1 class, using grayscale images (created from annotated videos). I have ~400k images in my dataset.

My YOLOv8m model converge after few I tried to adjust lr0, and lrf (lr0=0.00125-lrf=0.00000625) to really low values, as well as cos_lr, Adam optimizer, but it looks like it does not help.

From what I read it usually takes ~100-300 epochs to converge.

Here are my training parameters

Ultralytics YOLOv8.0.55 πŸš€ Python-3.8.5 torch-2.0.0+cu117 CUDA:0 (NVIDIA RTX A5000, 24114MiB)`

yolo/engine/trainer: task=detect, mode=train,` [`model=yolov8m.pt`](http://model=yolov8m.pt)`, data=/workspace/ultralytics/compile_yolov8m/my_ds.yaml, epochs=100, patience=50, batch=64, imgsz=(640, 480), save=True, save_period=1, cache=False, device=0, workers=8, project=yolov8m_drone, name=retrain_yolov8m_30-lr0=0.00125-lrf=0.00000625-warmup_epochs=7-epochs=100-batch=64-cos_lr=Default, exist_ok=False, pretrained=True, optimizer=SGD, verbose=True, seed=0, deterministic=True, single_cls=False, image_weights=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop=False, hide_labels=False, hide_conf=False, vid_stride=1, line_thickness=3, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, boxes=True, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.00125, lrf=6.25e-06, momentum=0.937, weight_decay=0.0005, warmup_epochs=7, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, fl_gamma=0.0, label_smoothing=0.0, nbs=64, hsv_h=0, hsv_s=0, hsv_v=0, degrees=10, translate=0.1, scale=0.3, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=0.0, mixup=0.0, copy_paste=0.0, cfg=None, v5loader=False, tracker=botsort.yaml, save_dir=yolov8m/retrain_yolov8m_30-lr0=0.00125-lrf=0.00000625-warmup_epochs=7-epochs=100-batch=64-cos_lr=Default

Overriding model.yaml nc=80 with nc=1
  • What else can I do to improve my model?
  • Can you suggest a strategy to push it as accurate as possible?
  • Does this behavior normal or it might suggest problem with the data, etc.? If so, can you provide a strategy to fix it?
  • Also note that I load the grayscale images with RGB (due to YOLO architecture), would you recommend changing it? If so, how?

For reference that’s my CLI output (note that it over fit at the very first epochs):

     Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      1/100      23.7G       1.91      1.213     0.8506         26        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [58:01<00:00,  1.59it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [02:26<00:00,  1.93it/s]
                   all      36298      25870      0.778       0.62      0.636      0.224

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      2/100      23.1G      1.648     0.7622      0.821         25        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [57:48<00:00,  1.59it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [02:03<00:00,  2.30it/s]
                   all      36298      25870      0.843      0.624      0.677      0.243

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      3/100      23.1G      1.575     0.7043     0.8166         30        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [57:55<00:00,  1.59it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [02:37<00:00,  1.80it/s]
                   all      36298      25870      0.846        0.6      0.661      0.226

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      4/100      23.1G      1.522     0.6724     0.8135         19        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [59:55<00:00,  1.54it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [02:02<00:00,  2.33it/s]
                   all      36298      25870       0.85        0.6       0.66      0.229

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      5/100      23.1G      1.496     0.6596     0.8127         29        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [57:05<00:00,  1.61it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [02:00<00:00,  2.35it/s]
                   all      36298      25870       0.86      0.598      0.672      0.247

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      6/100      23.1G      1.489     0.6603     0.8128         27        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [57:08<00:00,  1.61it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [01:58<00:00,  2.40it/s]
                   all      36298      25870      0.893      0.562      0.679      0.272

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      7/100      23.1G      1.496     0.6662     0.8148         26        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [57:09<00:00,  1.61it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [01:56<00:00,  2.44it/s]
                   all      36298      25870      0.885      0.504      0.656      0.293

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      8/100      23.2G      1.478      0.659     0.8132         23        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [57:06<00:00,  1.61it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [01:54<00:00,  2.47it/s]
                   all      36298      25870      0.878      0.461      0.632      0.294

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      9/100      23.2G      1.437     0.6347     0.8096         24        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [57:08<00:00,  1.61it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [01:54<00:00,  2.48it/s]
                   all      36298      25870      0.901      0.456      0.623      0.276

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
     10/100      23.2G      1.404     0.6174     0.8066         29        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [57:08<00:00,  1.61it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [01:54<00:00,  2.49it/s]
                   all      36298      25870      0.938      0.463      0.629      0.256

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
     11/100      23.2G      1.377     0.6024     0.8047         29        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [57:08<00:00,  1.61it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [01:53<00:00,  2.50it/s]
                   all      36298      25870      0.931      0.466       0.63      0.236

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
     12/100      23.2G      1.357     0.5903     0.8028         31        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [57:07<00:00,  1.61it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [01:54<00:00,  2.49it/s]
                   all      36298      25870      0.919      0.486      0.636      0.229

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
     13/100      23.2G      1.338     0.5813     0.8007         23        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [57:09<00:00,  1.61it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [01:54<00:00,  2.49it/s]
                   all      36298      25870      0.909      0.494       0.63      0.219

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
     14/100      23.2G      1.319     0.5721     0.8001         28        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [57:08<00:00,  1.61it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [01:53<00:00,  2.49it/s]
                   all      36298      25870      0.898      0.489      0.621      0.213

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
     15/100      23.2G      1.306     0.5656      0.799         26        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [57:08<00:00,  1.61it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [01:53<00:00,  2.50it/s]
                   all      36298      25870      0.887      0.494      0.614      0.204

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
     16/100      23.2G      1.292      0.559     0.7979         23        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [56:59<00:00,  1.62it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [01:53<00:00,  2.51it/s]
                   all      36298      25870      0.863      0.485      0.595      0.189

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
     17/100      23.2G      1.279     0.5525     0.7967         26        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [57:00<00:00,  1.61it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [01:54<00:00,  2.49it/s]
                   all      36298      25870      0.847      0.476      0.578      0.184

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
     18/100      23.2G      1.267     0.5457     0.7958         28        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [57:06<00:00,  1.61it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [01:54<00:00,  2.49it/s]
                   all      36298      25870       0.84      0.478      0.567      0.181

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
     19/100      23.2G      1.257     0.5412      0.795         24        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [57:08<00:00,  1.61it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [01:54<00:00,  2.49it/s]
                   all      36298      25870      0.848      0.475       0.57      0.183

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
     20/100      23.2G      1.248     0.5377     0.7947         31        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [57:07<00:00,  1.61it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [01:53<00:00,  2.50it/s]
                   all      36298      25870      0.845      0.464      0.555      0.171

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
     21/100      23.2G       1.24     0.5331     0.7939         33        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [57:08<00:00,  1.61it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [01:54<00:00,  2.49it/s]
                   all      36298      25870      0.828      0.464      0.551      0.171

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
     22/100      23.2G      1.231      0.528     0.7939         25        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [57:57<00:00,  1.59it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [01:54<00:00,  2.48it/s]
                   all      36298      25870      0.814      0.456      0.546      0.171

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
     23/100      23.2G      1.222     0.5248     0.7928         30        640: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5524/5524 [57:27<00:00,  1.60it/s]s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 284/284 [01:54<00:00,  2.49it/s]
                   all      36298      25870      0.794      0.454      0.543      0.172

created from annotated videos

It’s probably because you sampled very similar frames from video for training. You should only include distinct images in your training set. Otherwise the model will overfit like you’re seeing.

Also use a more recent version of Ultralytics. Your current one is over 2 years old.

lrf is not the exact value of final learning rate. It’s a learning rate factor.

1 Like
  • Yeah it makes sense that the samples are similar since it is taken from video sessions.
  • I know but I cant use the current version due to HAILO chip requirements
  • It affect it though, the final is lrf * lr0

I will have to get more diverse data eventually, but by then I try my best to get good results with what I have.
ATM added some augmentation, increased L2 aggressively, and modified the loss function, any other suggestions?

# exp 31: decrease iou & box, increase cls - trying to get more accurate classes, less accurate bboxes
iou=0.35
box=3.75 # 7.5/2
cls=1 # 0.5*2
# exp 32: add new aug, increase L2 reg *10
weight_decay=0.025 # increase to make L2 more agressive
mosaic=0.8 # cutmix=0.3 # not supported in the docker
hsv_v=0.1
close_mosaic=20 # $((epochs/10))

It’s better to have less data than have redundant images that are almost the same. You can’t really β€œcheat” your way to more data by adding slightly different version of the same images. Redundant data not being informative is why techniques like active learning exist.

The reason I mentioned that lrf isn’t final value is because the value you specified (lrf=0.00000625) seems like you’re treating it as the final value, not a factor. Because if you take a factor, then final learning in your case becomes 0.00000000078125 which is unusually low if that’s what you actually intended.