Need advice

Hi there!

I need an advice.

I have a task to count the passangeers in the bus. I have a camera looking vertically down at the door place. (Pictures attached)

I have my traned model with a single class. So, in my opinion, I can neglect such metrics as BoxLost, Class Lost and DFL Lost (Am I right?)

My training arguments is

task: detect

mode: train

model: /content/drive/MyDrive/YOLO_detection/trainCfg-1000-16_DayPlusNight_mozaic_Y11s_runs/train/weights/last.pt

data: /content/data.yaml

epochs: 1000

time: null

patience: 50

batch: 30

imgsz: 640

save: true

save_period: 50

cache: disk

device: null

workers: 32

project: /content/drive/MyDrive/YOLO_detection/trainCfg-1000-16_DayPlusNight_mozaic_Y11s_runs

name: train

exist_ok: false

pretrained: yolo11s.pt

optimizer: auto

verbose: true

seed: 0

deterministic: true

single_cls: true

rect: false

cos_lr: true

close_mosaic: 10

resume: /content/drive/MyDrive/YOLO_detection/trainCfg-1000-16_DayPlusNight_mozaic_Y11s_runs/train/weights/last.pt

amp: true

fraction: 1.0

profile: false

freeze: null

multi_scale: false

compile: false

overlap_mask: true

mask_ratio: 4

dropout: 0.0

val: true

split: val

save_json: false

conf: null

iou: 0.7

max_det: 300

half: false

dnn: false

plots: true

source: null

vid_stride: 1

stream_buffer: false

visualize: false

augment: true

agnostic_nms: false

classes: null

retina_masks: false

embed: null

show: false

save_frames: false

save_txt: false

save_conf: false

save_crop: false

show_labels: true

show_conf: true

show_boxes: true

line_width: null

format: torchscript

keras: false

optimize: false

int8: false

dynamic: false

simplify: true

opset: null

workspace: null

nms: false

lr0: 0.01

lrf: 0.01

momentum: 0.937

weight_decay: 0.0005

warmup_epochs: 3.0

warmup_momentum: 0.8

warmup_bias_lr: 0.0

box: 0.1

cls: 0.1

dfl: 0.1

pose: 12.0

kobj: 1.0

nbs: 64

hsv_h: 0.015

hsv_s: 0.7

hsv_v: 0.4

degrees: 15.0

translate: 0.5

scale: 0.5

shear: 0.0

perspective: 0.0

flipud: 0.0

fliplr: 0.5

bgr: 0.0

mosaic: 1.0

mixup: 0.15

cutmix: 0.0

copy_paste: 0.3

copy_paste_mode: flip

auto_augment: randaugment

erasing: 0.0

cfg: null

tracker: botsort.yaml

After 500 epoches I have metrics:

Preccision - 0.934

Recall - 0.941

mAP50 - 0.973

mAP50-95 - 0.77

The accuracy on a real video is about 92%

I want to know:

Is this metrics already at its limit or can it be improved?

If can be, what arguments and how I should change?

Thank you for advance!

It’s not useful to dump all the training arguments because it’s hard to tell what were changed and what were kept as default. You should post only the arguments that were updated.

Do you just have single camera? You should get the undistorted image and train the model on those. It will probably work better. During inference, you run the inference on frames after undistorting them.

Thank you!

And YES, I have single camera.

These performance seem quite good, but in all likelihood you are the best person to judge the if it’s “good enough” (maybe if you’re delivering it to someone, they would be the judge). Could it be better, probably. Does it need to be better, again you’re the best person to decide that.

Your camera is fisheye. You can remove the fisheye distortion if you calibrate the camera.

Thank you so much!

But tell me, would it better to calibrate the camera if traning dataset was made without calibration…

You would have to train on undistorted images too. And run inference on undistorted image.

Thank you!

Glad it helped! Two quick closing tips you may find useful:

  • Single-class training: you can largely ignore classification loss, but don’t ignore box and dfl losses—they drive localization quality and your mAP50–95.
  • Fisheye → undistort: calibrate once, undistort every frame, and retrain on undistorted images; then run inference on undistorted frames. This usually boosts high-IoU metrics and counting stability. A minimal OpenCV fisheye snippet:
import cv2, numpy as np
# K, D from calibration; choose K_new via cv2.getOptimalNewCameraMatrix or scale K
map1, map2 = cv2.fisheye.initUndistortRectifyMap(K, D, np.eye(3), K_new, (w, h), cv2.CV_16SC2)
undistorted = cv2.remap(frame, map1, map2, cv2.INTER_LINEAR)

For counting, use tracking with a line/polygon ROI so each person is only counted once:

from ultralytics import YOLO
model = YOLO("best.pt")
model.track(source="bus.mp4", tracker="botsort.yaml", persist=True)
# apply your ROI/line-crossing logic on tracked IDs

If you want a last bit of accuracy: slightly increase imgsz (e.g., 800–960) and reduce overly strong geometric augments for this fixed top-down view. A short overview of why calibration helps is in our guide on camera calibration, and practical steps for region-based counting are outlined in our region-based counting article:

  • See the camera fundamentals in the 2025 Vision AI camera calibration guide on the Ultralytics blog.
  • Explore practical region-based counting workflows in the Region-Based Object Counting with YOLO11 article.

If you share one short undistorted clip plus your current best.pt, we can sanity-check results.

Thamk you so much. I will try!