The need for specific bbox formatting

I have a general question on object detection in computer vision. As you all know, for detection, you’ll need images and associated text files for labels. To train a model from scratch (no transfer learning involved) that you construct yourself, does it matter what the format of the labels are? In other words, using x1/x2/y1y2, x1/y1/x2/y2, center/width/length, or any other format matters? My understanding is that as long as there is consistency and that the coordinates reflect object location, format for labeling objects on an image will not matter. Please clarify. Thank you so much,

Ralf

There is a specific YOLO format, you can read more about in the docs

The annotations in the text file must use image normalized coordinates, and bounding box coordinates:

class x_center y_center width height

Right, for YOLO. What I meant was for any other model. It seems to me that if you create a model that accepts some consistent coordinate representation of an object’s location, the model will learn what the object is after proper training. Am I correct? Thx,

Ralf

I understand. The format representation of bounding boxes will vary by model, and might depend on the structure of the model as to what’s needed/best. In all likelihood, the coordinates will need to be normalized going into the model for training.