Helping with elevating YOLOv11's performance in Human Detection task

Hi everyone, I’m currently working on a project of detecting human from CCTV input stream, I used the pre-trained YOLOv11 from ultralytics official page to perform the task.

Upon testing, the model occasionally mistook canines for human with pretty high confidence score

Some of the methods I have tried include:

  • Testing other versions of YOLO (v5, v8)
  • Finetuning YOLOv11 on person-only datasets, sources include:
    • Roboflow datasets (~7000 images of human in different environment settings)
    • Custom dataset (~4000 images from 50 different videos): for this dataset, I crawl some CCTV livestreams, ect., cropped the frames and manually labeled each picture. I only labeled people who appear with full-body, big enough and is mostly in standing posture.

→ Both methods didn’t show any improvement, if not making the model worse. Especially with the finetuning method, the model even falsely detected the cases it didn’t before and failed to detect human.

Looking at the results, I also have some assumptions, would be great if anyone can confirm any of these:

  • I suspect that by finetuning with person-only datasets, I’m lowering the probabilities of other classes and guiding the model to classify everything as human, thus, the model detected more dogs as human.
  • Besides, setting out rules for labels restricts the ability to detect human in various form of the model.

I’m really appreciated if someone can suggest guidance to overcome these problem. If it is data-related, please be as specific as possible because I’m really new to computer vison (data’s properties, how should I label the data, etc.)

Once again, thank you.

The pretrained COCO model does quite well with detecting the person class in most cases because it’s training data contains over 650,000 person instances. When you train a YOLO model (what you call finetune, but I think that term gives the wrong idea so I just call it “training”), the information in the classification head of the model is updated based on training from the data provided at training time.

This means that when you train using images with a lower instance count, and possibly lower variety in orientations, objects, environments, etc., the classification performance will actually decrease. If you wish to build on top of the performance of the original pretrained model, you would need to add all images with the person class from the COCO dataset into your training data.

Additionally, when you source data from online sources, you should analyze the quality of those annotations. It’s very possible that the annotations are low quality and could harm the model’s performance. You should also check to see if there are duplicate images or augmented images in the downloaded dataset, as these could lead to early overfitting.

1 Like

Thanks for the reply, your insight about including the initial COCO dataset is really helpful, and I will definitely try that. I would like to ask a few more questions related to the topic:

  1. Suppose I want to train YOLOv11 to detect a brand new class, would including COCO dataset is still needed?
  2. For the additional data, would the person-only dataset I have collected is enough to overcome the problem of falsely detecting canines, or should I include images of dogs, cats, etc., too? And if yes, what will be the ideal ratio among the classes? If most images in the dataset don’t have multiple classes (eg: an image of dogs only), will it still contribute to the training?
  3. I intended to train on my custom dataset only (which is not included data from sources like roboflow). This dataset includes ~4000 images, 8000 instances of person class, and the images are cropped from 50 CCTV videos at 1fps rate. However, the videos are not in 640x640 px, will this affect the training result? Moreover, albeit having lots of images, a lot of them share the same background setting (surroundings, lighting, ect.) and many images have same person instances; the videos are different camera angles. What would you think of this dataset?
  4. I first used COCO YOLOv11 to label all the images and then manually modify the labels. I omit all person label that are too close/far to the camera, have missing body parts or have “abnormal” postures (is sitting, bending, etc.). Is this a good labelling practice?
  5. Can you share any detailed, or reliable sources I can read about preparing dataset, labeling, and evaluating the “clealiness” of the dataset? I am very new to this training on custom dataset, and preparing my own data, so kinda at a loss here :joy:

If you find my questions are too much/nonsense, please feel free to skip them :joy:, as your previous reply was already more than helpful to me.

Again, thank you so much for your reply. Have a good day!

Yes. This is probably the most common question I’ve heard. Consider the information about classes “lost” from any prior training.

There is no “ideal ratio” only what works for your application. You will have to collect and label data until the model performs as well as you need it to. Remember that no model is perfect, and there will still be some likelihood of misclassification. Including animals will be helpful, as you provide examples to the model on what a person is not.

I think having much more variety in your image dataset can help. It’s likely the lack of variety in data leading to the model to overfit and stop training earlier than it should. Supplementing your dataset with additional images will be a big help. It would be good to remove any duplicate or near-duplicate images from your dataset, as they won’t add much value, only training time.

They’re still a person, so they should still be labeled as such. If you can tell it’s a person that label should be there. Otherwise it will ‘confuse’ the model, as it has detected a something that resembles a person, but the information you provided says it is not.

fiftyone FiftyOne — FiftyOne 1.3.0 documentation is a great tool for evaluating datasets and find errors. There are other libraries you can find with a simple web search, but I would recommend fiftyone.

1 Like