The issue is that the sheep are packed closely together, and there’s a lot of overlap between them, so trying to define a bounding box for each one has been a nightmare. I’m doing the labeling process in Roboflow.
I was thinking of simplifying things by using the sheep’s heads as the bounding box and then adding two additional key points outside the box to estimate the body position.
Does this approach make sense? Could having key points outside the bounding box cause any issues when training a YOLO Pose Estimation Model?
Do you think this would be a better approach for this case?
I’m a bit worried that, since later I’ll need to do quite accurate tracking without losing any sheep, a segmentation-based approach might make more errors than simply being able to detect the head and body orientation.
The goal of the project is to study the sheep’s entry times and the distribution of their speeds, so accuracy in detection is a key factor.
I think that given the crowding of the sheep. Practically speaking, you’ll have to test it out to see how well it does or doesn’t work. If the view will always be a top-down view like you’ve shown, then you can probably use pose estimation, but if you run into trouble with reliability, you could try using bounding box annotationing on the heads and then use any number of key points localized to the head (eyes, ears, or just center), but you’d lose track of one of their head got covered, like how maybe by one sheep would jump onto another (I’m just speculating here, I’m not an expert in sheep behavior).
You could quickly try some segmentation using models like SAM2 or FastSAM to help you generate segmentation data to train with. The useful thing about having those segments is that they can be used to get bounding boxes as well.
Regarding your question about keypoints outside the bounding box, this is supported in YOLOv11 pose estimation. The model predicts keypoints and can handle situations where keypoints fall outside the detected bounding box.
For your project involving tracking sheep, pose estimation could indeed be a viable approach, especially if you focus on detecting heads and adding keypoints for body orientation. The calculations for keypoint loss occurs in the calculate_keypoints_loss function, which you can read more about here. You can also examine the Pose forward pass here.
Ultimately, the best approach depends on your specific data and goals. Since accuracy is crucial for your project, I recommend experimenting with both pose estimation and segmentation to see which performs better in your real-world conditions.