How does YOLOv8 reduce false recognition? How does it reduce false recognition of similar objects

Hello everyone, I’d like to ask a question and hope to receive your help. Thank you all in advance. When recognizing the scissors gesture, I had 50,000 samples and obtained a dataset of 150,000 samples by rotating ±90°. I first trained this dataset of 150,000 (with no background images and completely correct annotations) for 100 rounds. After the training is completed, the model will misrecognize other gestures. For example, it will misrecognize the ok gesture, palm gesture, fist gesture, and gesture with only two or three fingers extended.
My first solution is:
Based on the last.pt model trained in these 100 rounds, after adding 10k misidentified images as background images to the dataset, the misrecognition was significantly reduced. However, the model still misrecognizes some images that look like scissors gestures. Moreover, at long distances, as long as there are gestures, regardless of whether they are scissors gestures or not, it will recognize them. At first, I intended to add the misrecognition of these long-distance targets as background images. However, I considered that as the amount of background image data added increased, the positive recognition effect would deteriorate (because I had tried before that when more than 20,000 images were added, the positive recognition was very poor).
Later, I tried to change my thinking, so there was the second way
My second solution is:
I marked all the misidentified gestures and labeled them as “other”. Among them, the scissors gestures are also divided more meticulously. The label of the scissors gesture on the back of the hand is “Reverse_scissor”, and the label of the gesture on the front of the palm is “Forward_scissor”. The sample quantities of them are respectively: other: 90,000 pieces, Forward_scissor: 60,000 copies, Reverse_scissor: 50,000 copies. However, after 100 rounds of training and using the model for reasoning, I found that the model was a little overfitted to “other”. Because I found that sometimes when I raised the scissors gesture, it was also recognized as the “other” category. I suspect it’s due to the sample size. I have now reduced the sample size to around 50,000 and started training for 200 rounds. I’m very worried that the effect of this training will also be very poor. So I sincerely hope to receive your suggestions.
I also checked the official documentation of ultralytics and the issues of ultralytics. The links are as follows: box, cls, and dfl loss gain · Issue #10375 · ultralytics/ultralytics · GitHub, inside documentation about in parameter CLS and box, whether I can set the CLS makes the model more attention classification? Issue the 如何设置多目标检测类别损失权重解决数据集分布不均衡的问题 · Issue #15615 · ultralytics/ultralytics · GitHub mentioned can set parameters in the YAML, but I don’t know how to set up. I’m extremely eager to know how to reduce the misrecognition of other gestures now. I’ve been troubled by this for several days. If any friends see this post, please kindly help me. Thank you very much!
The following is the configuration during my training and the configuration of my dataset
Training configuration
from ultralytics import YOLO
import time

if name == ‘main’:
model = YOLO(“yolov8n.pt”)
model.train(
data=r"E:\Project_Gesture\model_script\scissors.yaml",
imgsz=640,
device=0,
Lr0 = 0.01,
epochs=200,
batch=64,
close_mosaic=10,
name=“3clss_sciss_yolo8n”,
Fliplr = 0.5,
Flipud = 0.5,
degrees=15,

mosaic=0.5

mixup=0.3

plots=True,
Scale = 0.5,
)

Configuration of the yaml file of the dataset:
train: [E:\All_Data\Gesture_Data\train.txt]
val: [E:\All_Data\Gesture_Data\val.txt]

cls class_weights class_weights: [0.67, 1.0, 1.2]

nc: 3

names: [‘Forward_scissor’, ‘Reverse_scissor’,‘other’]

Hello! Thanks for the detailed post. It’s great to see the systematic approach you’re taking to solve this problem.

Your second method of creating an explicit other class for negative samples is a very effective strategy for reducing false positives. The issue of overfitting to the other class can often be traced back to class imbalance, so your decision to balance the number of samples across all classes is an excellent step.

To further improve your results, you might consider adjusting your data augmentation. The mosaic augmentation, which is enabled by default (mosaic=1.0), is highly effective for teaching the model about different object scales and contexts. This could be particularly helpful for the issues you’re seeing with long-distance detections. You can find more details on this and other augmentations in our Data Augmentation guide.

You also asked about emphasizing classification. You’re on the right track. You can adjust the weight of the classification loss by modifying the cls hyperparameter in your training command. Increasing its value (the default is 0.5) encourages the model to prioritize correct classification over bounding box precision.

Here’s a quick example of how you might adjust your training call:

from ultralytics import YOLO

if __name__ == '__main__':
    model = YOLO("yolov8n.pt")
    model.train(
        data=r"E:\Project_Gesture\model_script\scissors.yaml",
        epochs=200,
        batch=64,
        imgsz=640,
        close_mosaic=10,  # Good practice to disable mosaic late in training
        mosaic=1.0,       # Ensure mosaic is enabled for most of the training
        cls=0.75,         # Increase classification loss weight (default is 0.5)
        # ... other args
    )

You can find a full list of these training arguments and their descriptions in our Model Training documentation.

Keep up the great work, and let us know how it goes! The community’s strength lies in members like you who share their challenges and solutions.

First, modifying the model/training parameters is worth experimenting, however there are other “cheaper” (easier to test) methods you should explore first.

  1. Try training without augmenting your dataset. The YOLO training cycle includes augmentations during training, so it could be redundant and causing overfitting.
    • Additionally, a ±90° degree rotation is technically a different hand gesture :victory_hand:
  2. Unless you have a specific need to use YOLOv8n, try using a larger model, like YOLOv8s. Often times a slightly larger model can help considerably with classification performance.
    • Note: if using YOLOv8s causes and out of memory error, you can lower the batch size or you can freeze layers model.train(..., freeze=10) (freezes first 10 layers of model weights) which will reduce memory usage.
  3. Label the other gestures as what they truly are, ‘okay’, ‘palm’, ‘fist’, etc. instead of just “other”. During model training, it optimizes for the best filters that will separate the objects into their respective classes. Since you’re working with hand gestures, the model will likely develop filters that are very good at recognizing hands versus any other object, but not necessarily a specific hand gesture. Without giving the model any additional information about the possible other states a hand could be shown (other hand gestures), it is more likely to falsely classify an incorrect hand gesture. When you gave the “other” category, it overfit to that category, because there are many more variations of the “not scissors” hand gesture. You should try labeling the other hand gestures you expect the model to see, even when you don’t care about them, as it will help the model separate those instances from the one you want.
  4. It would take a bit more work, but you could also try to use a segmentation model instead. A segmentation model needs contour data, which might help the model better distinguish the gesture you want versus all others. That’s because embedded into the contour information is the shape of the object, and for a hand gesture like “scissors”, it might be enough. I assume you have bounding box data at the moment, which you can use SAM2 to help with converting bounding boxes into contour annotations.

Thank you for providing valuable suggestions. I will try your method in the future. I rechecked the dataset today. The data label were all correct, but I found that there were a few gestures from other angles. So I rotated the images and labels through a script to make the gesture angles as many as possible, and then adopted the method of a blogger to solve the misrecognition. His approach is to add negative samples in the Mosaic enhancement, which can ensure that there are all mischecked images in each round and each image. It can solve the problem of missed detection in fewer training rounds. If it succeeds in the future, I will share it with everyone. The general principle is as follows:
yolov8 supports direct training of “unlabeled negative samples” (referred to as “background” in v8), but by default, they are only mixed into positive samples for practice. This will result in new false image detectives not being fully trained. Therefore, after modifying the code, false image detectives will occur every time during training

Hello! It’s great to see the detailed and thoughtful approach you’ve taken to solve this challenge. Your second method of creating an ‘other’ class for misidentified gestures is a very effective strategy for this kind of problem.

To build on that, you could try increasing the classification loss weight to encourage the model to focus more on distinguishing between classes. You can do this by adjusting the cls hyperparameter in your training script. For example, you might try increasing it from the default of 0.5:

# Increase classification loss gain
model.train(data='scissors.yaml', cls=1.0, ...) 

Also, ensure your ‘other’ class is diverse and contains many ‘hard negatives’—gestures that look very similar to your target ‘scissors’ gestures. This will help the model learn the subtle differences that distinguish the classes.

The issue with misidentifying distant gestures is common. Make sure your training data includes examples of gestures at various scales and distances. The mosaic augmentation, which is enabled by default, helps with this, and your use of close_mosaic is a great practice to stabilize training at the end.

You can find more details on all available training settings in our Configuration Guide and the Model Training Guide.

Keep up the great work, and we’d love to hear about your final results

Thank you all for the suggestions. I plan to use your method in the subsequent plans. I have completed the training using that blogger’s method. Before that, I took some samples myself. The purpose of these samples was to make up for the problem of insufficient angles, and then I trained using that blogger’s method. His approach is to modify the code related to the Mosaic part in yolov8, which can ensure that there will be a background image every time the training is conducted. After the training was completed, I found that the mAP was very high and no overfitting phenomenon was observed. Although there were almost no misidentifications, when the model was actually used, its positive recognition effect was very poor. I have no idea what’s going on either, so I’m sending you all the screenshots of my training. I hope all my friends can help analyze it.
The number of background images added among them is 10k. The neg_num=-1 in the code indicates that at least 0 to 1 background image will be added to the training in each training session.
The training code is as follows:
from ultralytics import YOLO

if name == ‘main’:
model = YOLO(“yolov8n.pt”)
model.train(
data=r"E:\All_Data\Gesture_Data\to_autodl\scissors.yaml",
neg_dir=r"E:\All_Data\Gesture_Data\to_autodl\new_neg",
name=‘gestures’,
neg_num=-1,
imgsz=640,
device=0,
epochs=200,
batch=64,
degrees=10,
plots=True,
close_mosaic=10
)

The screenshots of the training are below:







Once the code is modified, it does make it a lot more difficult to diagnose issues. You’ll need to share examples of both the background negative images and positive images with labels.

The two confusion matrix plots you shared are strange, as the first shows no values for background detections, but does show 2 reverse_scissor true samples as being predicted with background class. The second confusion matrix shows 0.47 and 0.53 background true classes, that were predicted as being reverse_scissor and forward_scissor.
The mismatch of information is odd, but assuming the second one is correct for this training, it does show the problem you’re facing. The model is incorrectly predicting background class samples as belonging to one of the classes, when it shouldn’t make a prediction (which would place it in the background class). The model is predicting many false positives, which means that it’s still not able to distinguish whatever samples you have as background from your positive classes.

Thank you very much for your suggestions. I have been trying other methods these days, but the results have not been very good. Finally, I still considered the method you mentioned and classified the misidentification as “other”. And in order to verify whether this method is effective, my sample was reduced to 20k, among which Forward_scissor and Reverse_scissor were 8.7k and 10.3k respectively. Then, the misidentified pictures were set to the “other” category, and its quantity was also approximately 9k. After 150 rounds of training, I tested it with the computer’s camera. At different distances and from different angles, some gestures that belong to the “other” category (misrecognition) were still recognized as the “scissors hand” gesture. Moreover, some gestures are particularly prone to misrecognition. For instance, the gesture of three fingers is originally of the “other” category, but it is still recognized as the “scissors hand” gesture. So I thought about whether to continue increasing the number of “other”, but if it is increased, the sample imbalance will occur. So if I do it in your way, what methods can be used to improve my situation? I have tried all the methods I can think of so far, but the results are not very good
Sample aspect:
. My scissors hand gestures samples, as well as other samples using the public data sets, public data sets link below: GitHub - hukenovs/hagrid: HAnd Gesture Recognition Image Dataset. Moreover, it was obtained by using mobile phone cameras, computer cameras, and cameras, and then labeled with a pre-trained pt model after random rotation.
Training script:
from ultralytics import YOLO

if name == ‘main’:
model = YOLO(“yolov8n.pt”)
model.train(
data=r"E:\All_Data\Gesture_Data\test\scissors.yaml",
imgsz=640,
device=0,
lr0=0.01,
epochs=150,
batch=64,
close_mosaic=10,
name=“test”,
fliplr=0.5,
flipud=0.5,
degrees=10,
# mosaic=0.5,
# mixup=0.3,
plots=True,
scale=0.5,
cls=1.0
)

The following is the result after my training. Please take a look:






Some angles can lead to misidentification. For example, in the following two pictures (taking the identified content from the pictures), the one with three fingers should be of the “other” category, but it is identified as “Reverse_scissor”, while the scissors gesture is mistakenly regarded as “other”

Given you’re using a pre-annotated dataset, I would recommend keeping the original classes defined in the repository for the dataset. As I mentioned before, there are many variations of hand gestures and lumping them together can cause more confusion for the model. Given the classes are already annotated, there’s very minimal cost to train using the separate classes. This should help the model better distinguish the hand gestures and reduce the false positive rate.

I have tried the method you mentioned in the past few days. Each type of sample of mine is approximately 10,000 pieces, and I trained it for 50 rounds. For the scissors gesture, I performed random rotation operations, while for the other gestures, I directly used the images in the dataset and labeled them using the gesture model without performing rotation operations. Then it was found that the model still had the situation of misidentification between classes. This might be due to insufficient training rounds and the fact that other gestures did not perform rotation operations. According to your approach, there are already too many types of gestures, and there are also many types to be labeled. I think it might not be appropriate to use the exhaustive method for all gestures. Can yolov8n classify so many types? I do have special requirements and can only use the N-type model. That’s why I classified all the images that are not in the scissors gesture as the “other” category. And with your approach, if each sample contains 10,000 images, the data might amount to over a hundred thousand, as it is divided into several targets and the time cost is also very high. I’m wondering if those with relatively similar gestures can be classified into one category. In this way, the number of categories will be reduced. Although the required dataset is still the same, at least the model will not be under as much pressure when classifying
The following picture shows me training 50 rounds a few days ago following the method you mentioned

The pretrained COCO models are trained to detect 80 different object types. The OpenImagesv7 pretrained models are trained to detect over 600 different object types. You can train for as many classes as needed, although I would suspect that over 1,000 might be a bit excessive.

My recommendation is to use the default training settings, don’t add any extra rotations, and set the epochs for 200-300 at least. With a large number of classes and samples, it’s very likely the model will take much longer to train completely. Be patient. The COCO models I linked to above, would take ~500 epochs to completely finish training with 80 object classes. You also need good a baseline (using the defaults without rotating images) to compare everything else against. The default settings work very well for most situations.

Finally, keep in mind that there are likely always going to be some misclassifications or missed detections. No model is 100% perfect all the time.

Thank you! I’ve decided to give it another try. If I succeed, I’ll reply to you again. Then I’ll tell everyone about the processes of these failures and the reasons I guessed myself. (Misidentification is really a devil! :smiling_face_with_horns:) :sob:

1 Like

Hello! Thanks for the detailed updates and for sharing your journey with the community. This is indeed a challenging problem, and your persistence is commendable. Misidentification between similar classes can be very tricky to solve.

Your idea of grouping similar gestures is excellent. Instead of one single “other” class, creating a few distinct negative classes for the most common false positives (like ‘three_fingers’, ‘fist’, etc.) is a very effective strategy. This helps the model learn the specific, subtle features that distinguish ‘scissors’ from its closest look-alikes without the complexity of labeling every possible gesture.

Also, you’re right to consider model capacity. For tasks with many classes or very subtle differences, a slightly larger model like yolov8s might provide the extra capacity needed to learn more complex decision boundaries compared to yolov8n.

You are on the right track. Keep up the great work, and we’re all looking forward to hearing about your success

Hello, everyone! I think I’m close to success! :smiley: After several days of unremitting efforts, the effect of the model is much better than that of adding background images before. The method adopted in this training is to label each gesture. There are a total of 19 gestures. The number and type of samples for each category of gestures are almost balanced, and the standard ones are accurate. But currently, I still want to improve the performance of this model because the model trained this time did not stop naturally but stopped prematurely. I’m wondering if my loss function is too small. This training session was approximately 150 rounds, divided into two sessions. The first session consisted of 100 rounds (originally, epochs=500 was set as the training was conducted on a cloud service, but due to network issues, it was disconnected at the 100th round, requiring training based on last.pt). The second session was set with epochs=400. However, the model was stopped after only 50 rounds of training. Currently, I want to train again on the second-trained last.pt. I aim to enhance the model’s performance rather than stop it early. But I’m also worried that it might overfit, so I hope you can roughly help me analyze it. Thank you all! :revolving_hearts:

Screenshots of early stop and loss





Training configuration
from ultralytics import YOLO
import time

if name == ‘main’:
model = YOLO(‘/root/ultralytics/runs/detect/multi_gesture/weights/last.pt’)
model.train(
data=‘/root/autodl-tmp/gestures.yaml’,
imgsz=640,
device=0,
lr0=0.007,
epochs=500,
batch=64,
name=“multi_gesture”,
fliplr=0.5,
flipud=0.5,
degrees=10,
plots=True,
scale=0.5,
cls=0.75
)

1 Like

Congrats! Those results are really good so far.

1 Like