Hand skeleton tracking accuracy

I’m trying to train yolo11n-pose to detect hand skeletons.
I’m using 900 manually annotated images (640x480) of hand joints.
I’m getting a > 0.98 pose score on validation.

When looking at the resulting skeleton, it seems to not track tiny individual finger motion (when the rest of the joints are fixed). So it seems that the inferenced fingers are some kind of average. I admit that training data may not contain a wide enough distribution of tiny finger motions. However, are the yolo models sensitive to such tiny motions given that the resolution is reduced along the layers?

Yes — what you’re seeing is common, and it’s usually more a data + resolution issue than a hard “YOLO can’t do it” limit.

If the hand only occupies a small part of a 640 x 480 frame, then tiny finger motion may be just a few pixels, and yolo11n-pose can regress toward an average joint location. A very high val score can still happen because the metric may look good even when fingertip motion is visually too smooth.

For this kind of problem, the biggest wins are usually: crop tighter around the hand, train at a larger imgsz like 960 or 1280, use a larger pose model, and add more examples of subtle finger articulation. For new runs I’d recommend switching to Ultralytics YOLO26 pose rather than YOLO11, since it’s the current recommended model and generally better for pose work. The built-in Hand Keypoints dataset is also a useful reference for label layout and scale.

Something like this is worth trying:

from ultralytics import YOLO

model = YOLO("yolo26s-pose.pt")
model.train(data="your_hand.yaml", imgsz=960, epochs=200, batch=16)

If you want, I can help you sanity-check whether the bottleneck is hand size in-frame, label noise, or model capacity.

What you are saying makes sense. My hand do occupy probably less than 20% of the image.
I’ll try your suggestions.

Thank you very much.