I’m working on training a YOLOv11 model for object detection on a custom dataset of 11 classes. I have 334 annotated images formatted in the “YOLOV8 DETECTION 1.0” label format using CVAT. However, my data is unbalanced.
Here’s the training code I used:
from google.colab import drive
drive.mount('/content/drive')
dataset_path = '/content/drive/MyDrive/Files/dataset'
from ultralytics import YOLO
model = YOLO("yolo11n.pt")
model.train(data=f"{dataset_path}/data.yaml", epochs=200, imgsz=640)
val_results = model.val(data=f"{dataset_path}/data.yaml")
test_results = model.val(data=f"{dataset_path}/data.yaml", split='test')
metrics = model.val()
metrics.box.map # map50-95
metrics.box.map50 # map50
metrics.box.map75 # map75
metrics.box.maps # a list contains map50-95 of each category
Unfortunately, the current model performance isn’t meeting my expectations (see attached results). I’m hoping to improve the accuracy and I’d appreciate any guidance from the community.
Specifically, I have the following questions:
How many more annotated images are typically needed for good transfer learning results?
My classes mostly match the YOLO classes with the same IDs. Why is my trained model (best.pt) not detecting objects (cars, buses, etc.) that the pre-trained model (yolo11n.pt) does?
Is there a way to get a precise accuracy measurement in percentage?
There is no absolute number that anyone can tell you with certainty in how many images are required for a model to train well. Essentially you have to continue adding annotated images until the model performs to the level at which you expect or require.
Unless you have included all the data from the COCO dataset as part of your training dataset, it won’t perform as well. The model is trained on the data that you provide it during training and anything prior to that is lost when the weights are updated. You might be able to retain some performance but using freeze=N where N is the number of layers to freeze during training, however without including all the COCO data, your model might not perform as well. You’ll need to add more data or use multiple models.
It depends on what you mean by “accuracy” as there are multiple metrics of performance. They’re reported on the final validation after training and you can read more about them in the docs.
Add more and diverse data. There’s a training argument called patience which defaults to 100 and if you’re seeing overfitting during training, you can lower the value to something like patience=50 to help prevent overfitting.
It’s great to see your proactive approach to improving your model’s performance!
Regarding the number of annotated images, it is difficult to provide an exact number, as this varies greatly depending on the complexity of the objects and the variability within your dataset. However, datasets in the thousands of images per class are common for good results.
If best.pt is not detecting objects that yolo11n.pt does, ensure your dataset labels correctly correspond to the pretrained model’s classes. You can cross-check using the pretrained weights as a starting point. See YOLO Common Issues - Ultralytics YOLO Docs.
For a precise accuracy measurement, the mAP (mean Average Precision) metrics already provide a percentage-based value reflecting the model’s accuracy. These are standard metrics for object detection models. Val - Ultralytics YOLO Docs
To address overfitting, since you mentioned your dataset is unbalanced, regularly assess the class distribution. A significant imbalance could cause the model to favor the majority class.
I hope these brief points help guide you. Let the Ultralytics team and YOLO community know if you have more questions.
It’s great to see your proactive approach to improving your model’s performance. The Ultralytics team and the YOLO community are here to support you.
Regarding your questions:
The number of annotated images needed for good transfer learning results can vary. It is important to regularly assess the distribution of classes within your dataset. If there’s a class imbalance, there’s a risk that the model will develop a bias towards the more prevalent class. See Class Distribution under “Key Considerations for Effective Model Training” for more information.
If your trained model (best.pt) isn’t detecting objects that the pre-trained model (yolo11n.pt) does, even though the classes match, it might be due to several factors, including the need for further training or adjustments, dataset quality, or model convergence. You can find more details on these factors in the YOLO Common Issues guide.
For a precise accuracy measurement, you can use the validation metrics provided by the val mode. Metrics like mAP50-95, mAP50, and mAP75 give a comprehensive evaluation of your model’s accuracy. Val - Ultralytics YOLO Docs
To prevent overfitting, ensure your model reaches a satisfactory level of convergence. This might necessitate a longer training duration, with more epochs, compared to when you’re fine-tuning an existing model. More details can be found at the Model Convergence section of the YOLO Common Issues guide.
I hope this helps! Let the community know if you have further questions.
I have already used val for model evaluation, but I need to calculate the accuracy in percentage using the confusion matrix results, is there any short code to calculate model’s accuracy in percentage.
However mAP is considered a more appropriate metric for measuring the performance on the results of an object detection model. Calculating mAP is similar, but is As a reminder, precision (P) and recall (R) are defined as:
P = \frac{\text{correct predictions}_{cls}}{\text{all predictions}_{cls}} = \frac{TP}{TP + FP}
R = \frac{\text{correct predictions}_{cls}}{\text{all true labels}_{cls}} = \frac{TP}{TP + FN}
Meaning that precision P is the ratio of how often the model predicted correctly when it makes a prediction. Where recall R is the ratio of correct model predictions against the “real” label.
The average precision (AP) is the precision as a function of recall P(R)_{cls} on the interval [0,1] for each class. This is also known as the “precision-recall area under the curve” or \text{PR }AUC.
AP = \text{PR }AUC = \intop_{0}^{1} P(r)_{cls} dr
Calculating the mean of average precision AP of all classes n will return the mAP result.
mAP = \frac{\sum_{cls=0}^{n}AP_{cls}}{n}
In case it’s still not clear, it can be useful to think of it as a way to evaluate how often a model predicts correctly without missing object(s) that should be detected. If the model predicts too often, it will likely misclassify or make predictions for objects it shouldn’t (or no object at all). This is why the mAP score is a concise way to evaluate an object detection model’s performance for all classes in a single value.
If you want to evaluate how well an object detection model performs relative to the “true labels” and don’t care that it will get predictions incorrect then you should look at recall R. If it’s more important that the model is correct when it makes a prediction and less important that it misses detections, then you can look at precision P.
The default YOLO11n model trained on the COCO dataset has an mAP^{50-95}=0.395 and is quite good at detecting several classes. Ultimately, it’s up to you to determine if the score is sufficient for your use case. These metrics are indicators of performance, but they’re not going to provide an answer to “is this good enough?” because as an engineer or data scientist, it’s your job to answer that question.