The issue is that the sheep are packed closely together, and there’s a lot of overlap between them, so trying to define a bounding box for each one has been a nightmare. I’m doing the labeling process in Roboflow.
I was thinking of simplifying things by using the sheep’s heads as the bounding box and then adding two additional key points outside the box to estimate the body position.
Does this approach make sense? Could having key points outside the bounding box cause any issues when training a YOLO Pose Estimation Model?
Do you think this would be a better approach for this case?
I’m a bit worried that, since later I’ll need to do quite accurate tracking without losing any sheep, a segmentation-based approach might make more errors than simply being able to detect the head and body orientation.
The goal of the project is to study the sheep’s entry times and the distribution of their speeds, so accuracy in detection is a key factor.
I think that given the crowding of the sheep. Practically speaking, you’ll have to test it out to see how well it does or doesn’t work. If the view will always be a top-down view like you’ve shown, then you can probably use pose estimation, but if you run into trouble with reliability, you could try using bounding box annotationing on the heads and then use any number of key points localized to the head (eyes, ears, or just center), but you’d lose track of one of their head got covered, like how maybe by one sheep would jump onto another (I’m just speculating here, I’m not an expert in sheep behavior).
You could quickly try some segmentation using models like SAM2 or FastSAM to help you generate segmentation data to train with. The useful thing about having those segments is that they can be used to get bounding boxes as well.