Ultralytics has the convert_coco function which converts a COCO dataset split into an Ultralytics dataset split
But a crucial part of an Ultralytics dataset is the data.yaml file which defines the path to spits and the object categories dictionary, but this is not generated from convert_coco is there an different function in Ultralytics package that auto creates data.yaml based on .coco.json’s files [“categories”] field
note by ‘dataset split’ i mean one of the “train”, “test”, “validate” splits
Afaik, there presently isn’t a function that generates a data.yaml but if there was a COCO JSON file, that seems like it could be plausible to include. I recommend opening a feature request! Even better if you wanted to help write such a function
what should the function be called, I was thinking coco_json_to_data_yaml
should this included in convert_coco, my thinking is it that it should not because convert_coco needs to be called each time for the split anyway
in what file should this function be included (I am not familiar with the Ultralytics library file structure, but ultralytics/data/converter seems reasonable), also where should the test file be located
is the definition of data.yaml documented anywhere (what fields does it have, and values can they take)
Is this function definition reasonable, (if yes I will move to implementation, testing and PR)
def coco_json_to_data_yaml(coco_json_file_path, output_data_yaml_file_path):
"""Create a Ultralytics YOLO `data.yaml` file from a COCO annotations JSON.
This helper reads the `categories` section of a COCO annotations file (the
`_annotations.coco.json` produced by many labeling tools) and writes a
`data.yaml` compatible with Ultralytics YOLO training / validation routines.
The resulting YAML contains:
- `path`: Root path (relative in this script) to the converted dataset directory
- `train`, `val`, `test`: Subdirectory paths (relative to `path`) pointing to the image folders
- `names`: Mapping from category id (int) to human‑readable class name
Notes
-----
* The function does not validate that the train/val/test folders actually
exist—`convert_coco` (from `ultralytics.data.converter`) is typically run
beforehand to create the YOLO directory structure.
* The YAML is overwritten if it already exists.
Parameters
----------
coco_json_file_path : str | PathLike
Path to a COCO annotations JSON containing a top-level `categories` list.
output_data_yaml_file_path : str | PathLike
Destination path for the generated `data.yaml` file.
Raises
------
FileNotFoundError
If `coco_json_file_path` does not exist.
KeyError
If the JSON does not contain a `categories` key.
json.JSONDecodeError
If the file is not valid JSON.
Returns
-------
None
Example
-------
>>> from ultralytics.data.converter import coco_json_to_data_yaml, convert_coco
>>> # Assume a COCO dataset under ./my_coco_dataset with train/valid/test splits
>>> COCO_DIR = './my_coco_dataset'
>>> YOLO_DIR = './my_ultralytics_dataset'
>>> for split in ['train', 'valid', 'test']:
... convert_coco(f"{COCO_DIR}/{split}", f"{YOLO_DIR}/{split}", use_segments=False)
...
>>> # After conversion, create (or overwrite) the unified data.yaml
>>> coco_json_to_data_yaml(f"{COCO_DIR}/train/_annotations.coco.json", f"{YOLO_DIR}/data.yaml")
>>> # You can now train: `yolo detect train data={YOLO_DIR}/data.yaml model=yolo12n.pt`
"""
coco_categories = json.load(open(coco_json_file_path, "r"))['categories']
data = {
"path": ULTRALYTICS_DATASET_DIRECTORY,
"train": "train/images",
"val": "valid/images",
"test": "test/images",
"names": {coco_category['id']: coco_category['name'] for coco_category in coco_categories}
}
with open(output_data_yaml_file_path, "w") as f:
yaml.dump(data, f)
... # TODO actually implement
Naming functions can be tough and very subjective, but I like it (but I won’t be the one who decides anything if it were to be added)
No the functions should be separate, however it might warrant creating a class with both methods, in case there’s relevant information to track from one to another
The converter file makes sense to me, whomever reviews the PR should let you know if it should go elsewhere
The docs has examples for each, and the source code contains various YAML files for several datasets. They’ll mostly be the same, with key point datasets being a notable difference with a couple additional keys. Probably okay to start with bounding boxes and segmentation datasets to start, even just one is okay. Better to get early feedback before putting in too much work.
Seems very reasonable! Only note is that it might make sense to use enumerate while iterating the categories so you can use the numbers as the key and the category name as the value for the mapping
Hope that helps! Lemme know if you have any additional questions