Understanding Keypoint Decode

In keypoint decode, we apply:
kpts_out * 2 + anchor - 0.5

I assume -0.5 is to center to the middle of a pixel. However, I do not understand why in all decoding functions of keypoints, a factor *2 is applied? Seems like an unnecessary operation?

Hello!

Great question! The factor of *2 in the keypoint decoding process is indeed related to scaling the keypoint coordinates. This operation helps adjust the coordinates from a normalized range to the actual pixel space of the image.

The -0.5 adjustment centers the keypoint to the middle of a pixel, as you correctly assumed. This is a common practice in computer vision to ensure more precise localization.

For a deeper dive into how keypoints are handled, you might find the Ultralytics Pose Estimation documentation helpful. It provides insights into the model architecture and processing steps.

If you have more questions or need further clarification, feel free to ask! :blush:

Hi

Thanks for the quick reply!
If you’re open to it, I’d like to dive deeper into the following:

Since the input values are outputs from a nn.Conv2d layer without further activation or processing, why couldn’t the *2 factor be learned by the network and thus avoid an additional operation? I don’t fully understand why the keypoints require a *2 factor for denormalization before decoding.

Any insights are welcome, and feel free to get technical!

@Daan_Seuntjens I don’t know exactly why the points are scaled by 2, but suspect it’s similar to what was done in YOLOv5 for the bounding box centerpoint prediction, which was to help address a few issues.

The point offset range is adjusted from (0, 1) to (-0.5, 1.5). Therefore, offset can easily get 0 or 1.

image

So instead of being bound to the range [0, 1] the values are shifted and scaled to [-0.5, 1.5] which was supposed to help reduce grid sensitivity for predictions. It’s not directly answering your question, but you can see some of the discussion around it on this issue Want to figure out critical algorithm of Detect layer · Issue #471 · ultralytics/yolov5 · GitHub which might give you a bit more context about the formula and how it was decided on.

This appears correct. YOLOv5 normalizes the keypoint outputs to the range 0–1 using a sigmoid, so applying *2 - 0.5 makes sense. However, in YOLOv8, the keypoint model output is not normalized. This seems like a remnant of YOLOv5’s legacy code. Therefore, omitting the *2 - 0.5 step in YOLOv8 could save some computational resources, albeit insignificantly.

@Daan_Seuntjens you may be correct that it could be removed, and if you have the time/interest, it would be good to test if removing it has any impact on model performance. If it doesn’t, you could open up a PR and as long as all the tests pass, I suspect the Team would be happy to merge it since removing computational or redundant steps is greatly valued.

A quick look at the code and the formula, I’m guessing that the strides and anchors in this section of the decode_keypoints() method could be a part of why the constant 2.0 is multiplied on the keypoints. Easiest way to know for certain would be to change it to 1.0 or remove completely to see if prediction, validation, and/or training performance changes.

Will do some experiments and open a PR if relevant. May be a week or 2 before this is finished. Many thanks for the insights!

1 Like