Auto create`data.yaml` file from COCO dataset

Kallinteris-Andreas · August 14, 2025, 7:38am

Ultralytics has the convert_coco function which converts a COCO dataset split into an Ultralytics dataset split

But a crucial part of an Ultralytics dataset is the data.yaml file which defines the path to spits and the object categories dictionary, but this is not generated from convert_coco is there an different function in Ultralytics package that auto creates data.yaml based on .coco.json’s files [“categories”] field

note by ‘dataset split’ i mean one of the “train”, “test”, “validate” splits

BurhanQ · August 14, 2025, 11:32pm

Afaik, there presently isn’t a function that generates a data.yaml but if there was a COCO JSON file, that seems like it could be plausible to include. I recommend opening a feature request! Even better if you wanted to help write such a function

Kallinteris-Andreas · August 15, 2025, 3:53am

what should the function be called, I was thinking coco_json_to_data_yaml
should this included in convert_coco, my thinking is it that it should not because convert_coco needs to be called each time for the split anyway
in what file should this function be included (I am not familiar with the Ultralytics library file structure, but ultralytics/data/converter seems reasonable), also where should the test file be located
is the definition of data.yaml documented anywhere (what fields does it have, and values can they take)
Is this function definition reasonable, (if yes I will move to implementation, testing and PR)

def coco_json_to_data_yaml(coco_json_file_path, output_data_yaml_file_path):
    """Create a Ultralytics YOLO `data.yaml` file from a COCO annotations JSON.

    This helper reads the `categories` section of a COCO annotations file (the
    `_annotations.coco.json` produced by many labeling tools) and writes a
    `data.yaml` compatible with Ultralytics YOLO training / validation routines.

    The resulting YAML contains:
      - `path`: Root path (relative in this script) to the converted dataset directory
      - `train`, `val`, `test`: Subdirectory paths (relative to `path`) pointing to the image folders
      - `names`: Mapping from category id (int) to human‑readable class name

    Notes
    -----
    * The function does not validate that the train/val/test folders actually
      exist—`convert_coco` (from `ultralytics.data.converter`) is typically run
      beforehand to create the YOLO directory structure.
    * The YAML is overwritten if it already exists.

    Parameters
    ----------
    coco_json_file_path : str | PathLike
        Path to a COCO annotations JSON containing a top-level `categories` list.
    output_data_yaml_file_path : str | PathLike
        Destination path for the generated `data.yaml` file.

    Raises
    ------
    FileNotFoundError
        If `coco_json_file_path` does not exist.
    KeyError
        If the JSON does not contain a `categories` key.
    json.JSONDecodeError
        If the file is not valid JSON.

    Returns
    -------
    None

    Example
    -------
    >>> from ultralytics.data.converter import coco_json_to_data_yaml, convert_coco
    >>> # Assume a COCO dataset under ./my_coco_dataset with train/valid/test splits
    >>> COCO_DIR = './my_coco_dataset'
    >>> YOLO_DIR = './my_ultralytics_dataset'
    >>> for split in ['train', 'valid', 'test']:
    ...     convert_coco(f"{COCO_DIR}/{split}", f"{YOLO_DIR}/{split}", use_segments=False)
    ...
    >>> # After conversion, create (or overwrite) the unified data.yaml
    >>> coco_json_to_data_yaml(f"{COCO_DIR}/train/_annotations.coco.json", f"{YOLO_DIR}/data.yaml")
    >>> # You can now train: `yolo detect train data={YOLO_DIR}/data.yaml model=yolo12n.pt`
    """
    coco_categories = json.load(open(coco_json_file_path, "r"))['categories']
    data = {
        "path": ULTRALYTICS_DATASET_DIRECTORY,
        "train": "train/images",
        "val": "valid/images",
        "test": "test/images",
        "names": {coco_category['id']: coco_category['name'] for coco_category in coco_categories}
    }

    with open(output_data_yaml_file_path, "w") as f:
        yaml.dump(data, f)
    ...  # TODO actually implement

BurhanQ · August 15, 2025, 9:00pm

Naming functions can be tough and very subjective, but I like it (but I won’t be the one who decides anything if it were to be added)
No the functions should be separate, however it might warrant creating a class with both methods, in case there’s relevant information to track from one to another
The converter file makes sense to me, whomever reviews the PR should let you know if it should go elsewhere
The docs has examples for each, and the source code contains various YAML files for several datasets. They’ll mostly be the same, with key point datasets being a notable difference with a couple additional keys. Probably okay to start with bounding boxes and segmentation datasets to start, even just one is okay. Better to get early feedback before putting in too much work.
Seems very reasonable! Only note is that it might make sense to use enumerate while iterating the categories so you can use the numbers as the key and the category name as the value for the mapping

Hope that helps! Lemme know if you have any additional questions

Topic		Replies	Views
Fine tune a Ultralytics Model with a HuggingFace dataset Support yolo , question , feature , code	3	46	July 11, 2025
YOLO 12 Id'ing images not on my list Support yolo , question , support	15	147	April 1, 2025
Train a "clean" Yolov12 model (not pre-trained) on a custom dataset YOLO yolo , question , support , code	3	99	June 6, 2025
About Yolo Configuration File (YAML) Discussion discussion	2	395	September 30, 2024
How to use just the evaluation pipeline on predictions made by another model? YOLO yolo , question , support	3	161	March 14, 2025

Auto create`data.yaml` file from COCO dataset

Related topics