I trained YOLOv8n, YOLO11n, YOLO11s and YOLO26n on a custom hand gesture dataset for robot control.
Classes:
sit
down
up
bang
The validation results are very strong (Precision/Recall ≈ 0.99, mAP50 ≈ 0.99), and the confusion matrices show that the gesture classes themselves are recognized very well.
However, during real-world testing, background scenes were sometimes classified as gestures, resulting in more false positives than expected.
I did not use a separate background class and only included a limited number of background-only images.
My questions:
Can this happen even with very high mAP values?
Could missing background-only images explain the false positives?
Would you recommend adding more negative/background images without labels?
For static hand gestures, would MediaPipe or another keypoint-based approach generally be more suitable than YOLO?
Yes — this can absolutely happen with Ultralytics YOLO. High mAP only means the model does well on your labeled validation split; it does not guarantee good background rejection in deployment. The model testing guide covers this well: real-world testing often exposes overfitting, leakage, or missing negative cases that validation metrics miss.
In your case, limited background-only images is a very likely cause. If the model mostly saw hands/gestures during training, it may learn “something hand-like = one of 4 classes” and produce false positives on cluttered scenes.
Yes, I’d recommend adding more hard negatives: background-only images, arms without gestures, partial hands, tools, shadows, weird lighting, robot workspace frames, etc. In detection, empty images with no labels are valid and useful negatives.
I’d also do two quick things:
run a true held-out test split with lots of real background frames, and try a higher inference threshold, e.g. conf=0.5 or 0.6, since that often cuts false positives fast.
For static gestures, a keypoint approach like MediaPipe can be better if hands are close/visible and your main problem is pose classification. YOLO is usually better when you also need robust hand detection/localization in messy scenes. A hybrid pipeline is often
thank you very much for your quick and helpful reply.
That makes sense and confirms my suspicion that the main issue is not necessarily the gesture classes themselves, but the limited amount of real negative/background examples in the dataset.
I’ll follow your suggestions and test more hard negatives, a real held-out background test split, and a higher confidence threshold.
You could get high mAP if your validation set is small. But it doesn’t mean the performance is good in the real-world. You need a large and representative validation set to get correct mAP.
You could add the images with false positives that you obtain during real world testing to your training dataset and then retrain.