hi guys.
i have trained model with YOLOV8 in which used training set as my validation set, but i get very good result over unseen data.
now when i split the data into the training and validating, i get overfitted model and it doesn’t predict well.
is it chance or something else?
This is quite an unusual result. That said, it is possible that with a small dataset you don’t have enough samples in the training set when splitting the data. Can you share more information about the dataset size, number of classes, and number of instances?
i am trying to read persian car plates and my whole dataset consists of 420 images of plates which 60 of them are considered as validation set.
and also about 20 classes.
420 images should be considered a starting dataset, but not necessarily a full dataset. I recommend reading thru this guide from the Docs:
Even tho it’s a YOLOv5 guide, the principles are the same for any object detection model. Additionally be sure to check out the article linked at the end as well as it further expands on the peculiarities of training neural network models.
The critical point here is that the model needs to “discover” the proper filters to correctly detect and classify features that will accurately sort objects into the classes. To accomplish this, there needs to be a very high number of samples, otherwise the model will not be able to generalize well. Data annotations are unfortunately still one of the most burdensome aspects of training models, but you can use model assisted labeling to help speed up your annotation process (basically use your 400 images to train a model, use that model to help annotate more images that you correct as needed).
thanks for your help.
i have imbalanced data as well, because each plate has one character only but 7 numbers inside it.
some of my classes have 30(usually alphabets) instances but some of them have 200 instances(mostly numbers).
but my question is : how does the model which is trained without distinct validation set is working very well over new data which network haven’t seen during training?
Unfortunately @Nima_Chelongar it’s not feasible to know the exact reason why this occurs. The most likely reason is because you have more samples when the validation set is a mirror of the training set, since you have a small number of overall samples. Loss is calculated for both training and validation, and when you have a small number of overall samples, this will have an impact on the model’s overall performance.
My recommendation here is that if you need to understand the why, you’ll have to do a lot of work and analysis to achieve that level of understanding, but if you only need to have a model that performs well, start collecting and annotating more data. I presume that most people will be looking to have a model that performs well and not necessarily need to understand the “why” for this situation, as it will be highly time consuming and doesn’t provide you with any actionable information to help your model improve; the best path to take for improving model performance is to collect more data.
thank you very much for your great information.
@Nima_Chelongar strange result. Best practices is to split your data to allow validation metrics to correctly predict generalization capability on unseen data.