Croissant is a high-level format for machine learning datasets that brings together four rich layers. https://mlcommons.org/croissant
Find a file
Muqing Zhou 70fd59e26a
Fix custom JSON-LD type-checking compatibility for Python 3.14 upgrade (#1031)
### Motivation
This PR prepares Python 3.14 upgrade by moving custom JSON-LD
type-checking validation from runtime (inside the jsonld_fields
generator) to class definition/import time (inside the
@mlc_dataclasses.dataclass decorator).

### Background Context
Python 3.14 introduces deferred/lazy evaluation of type annotations
([PEP 649](https://peps.python.org/pep-0649/)).

In mlcroissant, type-checking validation was previously executed
dynamically at runtime when traversing fields. If a downstream user or
test suite dynamically mocks a type globally (for instance, patching
datetime.datetime to return a custom subclass like MockDatetime during
unit tests), then resolving type annotations dynamically at runtime will
point to the mocked class MockDatetime, while the static class fields
established during module import time still point to the original
datetime.datetime. This mismatch would result in a strict type-checking
TypeError at runtime. (e.g. TypeError: Field "Metadata.date_created"
should have type MockDatetime | None. Got datetime.datetime | None)

Moving this check to class definition time ensures all type annotations
are evaluated immediately when the module is imported/loaded, making the
library robust against runtime type mocking/patching (e.g., for testing
components like croissant_pipelines_test internally) and enforcing a
clean, fail-fast architecture for static class definitions.
2026-05-27 12:50:32 -07:00
.devcontainer @id migration (#536) 2024-02-27 11:33:42 +01:00
.github fix: Update visualize workflow to push to gh-pages (#1023) 2026-04-28 10:25:20 +02:00
.vscode Add crawler and visualizer to monitor the health of the Croissant ecosystem. (#521) 2024-02-15 15:56:09 +01:00
croissant-rdf Further Croissant RDF integration (#858) 2025-05-12 14:16:39 +02:00
datasets Fix flores-200 nonhermetic loading tests failures (#1034) 2026-05-26 10:43:49 +02:00
docs Push GeoCroissant spec and ttl file (#1000) 2026-04-03 11:47:51 +02:00
eclair Eclair additions (#941) 2025-10-06 11:20:59 +02:00
editor External vocabularies (#971) 2025-11-26 14:04:01 +01:00
health Fix a typo and Markdown heading levels (#802) 2025-05-07 15:31:27 +02:00
python/mlcroissant Fix custom JSON-LD type-checking compatibility for Python 3.14 upgrade (#1031) 2026-05-27 12:50:32 -07:00
tasks Create tasks sub-folder with .ttl schema.org and SHACL files + basic python validator (#1016) 2026-05-04 14:14:48 +02:00
.gitignore Bring croissant-rdf into Croissant repo (#848) 2025-04-22 10:02:10 +02:00
_config.yml RAI spec markdown (#586) 2024-03-06 16:39:37 +01:00
CONTRIBUTING.md Fix a typo and Markdown heading levels (#802) 2025-05-07 15:31:27 +02:00
gen_pages.py First version of consolidated documentation (#1008) 2026-05-21 17:09:20 +02:00
LICENSE.md Update license header 2023-05-19 11:09:51 -05:00
mkdocs.yml First version of consolidated documentation (#1008) 2026-05-21 17:09:20 +02:00
README.md revise licensing section in README.md (#1009) 2026-02-05 06:56:23 -05:00
requirements.txt First version of consolidated documentation (#1008) 2026-05-21 17:09:20 +02:00

Croissant 🥐

CI Python 3.10+

Summary

Croissant 🥐 is a high-level format for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file; it works with existing datasets to make them easier to find, use, and support with tools. Croissant builds on schema.org, and its Dataset vocabulary, a widely used format to represent datasets on the Web, and make them searchable. You can find a gentle introduction in the companion paper Croissant: A Metadata Format for ML-Ready Datasets.

Trying It Out

Croissant is currently under development by the community. You can try the Croissant implementation, mlcroissant:

Installation (requires Python 3.10+):

pip install mlcroissant

Loading an example dataset:

import mlcroissant as mlc
ds = mlc.Dataset("https://raw.githubusercontent.com/mlcommons/croissant/main/datasets/1.0/gpt-3/metadata.json")
metadata = ds.metadata.to_json()
print(f"{metadata['name']}: {metadata['description']}")
for x in ds.records(record_set="default"):
    print(x)

Use it in your ML workflow:

# 1. Point to a local or remote Croissant file
import mlcroissant as mlc
url = "https://huggingface.co/api/datasets/zalando-datasets/fashion_mnist/croissant"
# 2. Inspect metadata
print(mlc.Dataset(url).metadata.to_json())
# 3. Use Croissant dataset in your ML workload
import tensorflow_datasets as tfds
builder = tfds.core.dataset_builders.CroissantBuilder(
    jsonld=url,
    record_set_ids=["fashion_mnist"],
    file_format='array_record',
)
builder.download_and_prepare()
# 4. Split for training/testing
train, test = builder.as_data_source(split=['default[:80%]', 'default[80%:]'])

Please see the notebook recipes for more examples.

Why a standard format for ML datasets?

Datasets are the source code of machine learning (ML), but working with ML datasets is needlessly hard because each dataset has a unique file organization and method for translating file contents into data structures and thus requires a novel approach to using the data. We need a standard dataset format to make it easier to find and use ML datasets and especially to develop tools for creating, understanding, and improving ML datasets.

The Croissant Format

Croissant 🥐 is a high-level format for machine learning datasets. Croissant brings together four rich layers (in a tasty manner, we hope 😉):

  • Metadata: description of the dataset, including responsible ML aspects
  • Resources: one or more files or other sources containing the raw data
  • Structure: how the raw data is combined and arranged into data structures for use
  • ML semantics: how the data is most often used in an ML context

Simple Format Example

Here is an extremely simple example of the Croissant format, with comments showing the four layers. The @context preamble is not included for simplicity. Complete croissant definitions for a wide range of datasets are included in the datasets folder of this repository.

{
  "@type": "sc:Dataset",
  "name": "minimal_example_with_recommended_fields",
  "description": "This is a minimal example, including the required and the recommended fields.",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "url": "https://example.com/dataset/recipes/minimal-recommended",
  "conformsTo": "http://mlcommons.org/croissant/1.0",
  "distribution": [
    {
      "@type": "cr:FileObject",
      "@id": "minimal.csv",
      "name": "minimal.csv",
      "contentUrl": "data/minimal.csv",
      "encodingFormat": "text/csv",
      "sha256": "48a7c257f3c90b2a3e529ddd2cca8f4f1bd8e49ed244ef53927649504ac55354"
    }
  ],
  "recordSet": [
    {
      "@type": "cr:RecordSet",
      "name": "examples",
      "description": "Records extracted from the example table, with their schema.",
      "field": [
        {
          "@type": "cr:Field",
          "name": "name",
          "description": "The first column contains the name.",
          "dataType": "sc:Text",
          "source": {
            "fileObject": { "@id": "minimal.csv" },
            "extract": {
              "column": "name"
            }
          }
        },
        {
          "@type": "cr:Field",
          "name": "age",
          "description": "The second column contains the age.",
          "dataType": "sc:Integer",
          "source": {
            "fileObject": { "@id": "minimal.csv" },
            "extract": {
              "column": "age"
            }
          }
        }
      ]
    }
  ]
}

Resources

Getting involved

Integrations

Governance

Croissant is being developed by the community as a Task Force of the MLCommons Association Datasets Working Group. The Task Force is open to anyone (as is the parent Datasets working group). The Task Force is co-chaired by Omar Benjelloun and Elena Simperl.

Contributors

Albert Villanova (Hugging Face), Andrew Zaldivar (Google), Baishan Guo (Meta), Carole Jean-Wu (Meta), Ce Zhang (ETH Zurich), Costanza Conforti (Google), D. Sculley (Kaggle), Dan Brickley (Schema.Org), Eduardo Arino de la Rubia (Meta), Edward Lockhart (Deepmind), Elena Simperl (King's College London), Goeff Thomas (Kaggle), Joan Giner-Miguelez (UOC), Joaquin Vanschoren (TU/Eindhoven, OpenML), Jos van der Velde (TU/Eindhoven, OpenML), Julien Chaumond (Hugging Face), Kurt Bollacker (MLCommons), Lora Aroyo (Google), Luis Oala (Dotphoton), Meg Risdal (Kaggle), Natasha Noy (Google), Newsha Ardalani (Meta), Omar Benjelloun (Google), Peter Mattson (MLCommons), Pierre Marcenac (Google), Pierre Ruyssen (Google), Pieter Gijsbers (TU/Eindhoven, OpenML), Prabhant Singh (TU/Eindhoven, OpenML), Quentin Lhoest (Hugging Face), Steffen Vogler (Bayer), Taniya Das (TU/Eindhoven, OpenML), Michael Kuchnik (Meta)

Thank you for supporting Croissant! 🙂

Citation

@inproceedings{NEURIPS2024_9547b09b,
 author = {Akhtar, Mubashara and Benjelloun, Omar and Conforti, Costanza and Foschini, Luca and Gijsbers, Pieter and Giner-Miguelez, Joan and Goswami, Sujata and Jain, Nitisha and Karamousadakis, Michalis and Krishna, Satyapriya and Kuchnik, Michael and Lesage, Sylvain and Lhoest, Quentin and Marcenac, Pierre and Maskey, Manil and Mattson, Peter and Oala, Luis and Oderinwale, Hamidah and Ruyssen, Pierre and Santos, Tim and Shinde, Rajat and Simperl, Elena and Suresh, Arjun and Thomas, Goeffry and Tykhonov, Slava and Vanschoren, Joaquin and Varma, Susheel and van der Velde, Jos and Vogler, Steffen and Wu, Carole-Jean and Zhang, Luyao},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
 pages = {82133--82148},
 publisher = {Curran Associates, Inc.},
 title = {Croissant: A Metadata Format for ML-Ready Datasets},
 url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/9547b09b722f2948ff3ddb5d86002bc0-Paper-Datasets_and_Benchmarks_Track.pdf},
 volume = {37},
 year = {2024}
}

Licensing

  • Croissant implementation and examples are licensed under Apache 2.
  • Croissant specification© 2024-2026 by MLCommons Association and contributors is licensed under CC BY-ND 4.0

    Note: The CC BY-ND license was selected to facilitate widespread adoption and use of the Croissant specification while maintaining a canonical reference version. However, this license can raise questions around what downstream uses are permissible. MLCommons wants to assure all prospective users that they are free to remix and adapt the Croissant specification for their internal use. If users want to distribute something they have created based on or that adds to the specification, they can as long as the Croissant specification is referenced through a link, (i.e., not incorporated directly) and the specification itself isn't changed. Just remember to include the attribution. Dont hesitate to reach out if you have any questions.