Skip to main content

MLCommons datasets format.

Project description

🥐 ML Croissant

Python requirements

Python version >= 3.10.

If you do not have a Python environment:

python3 -m venv ~/py3
source ~/py3/bin/activate

Install

python -m pip install ".[dev]"

Verify/load a Croissant dataset

python scripts/validate.py --file ../../datasets/titanic/metadata.json

The command:

  • Exits with 0, prints Done and displays encountered warnings, when no error was found in the file.
  • Exits with 1 and displays all encountered errors/warnings, otherwise.

Similarly, you can generate a dataset by launching:

python scripts/load.py \
    --file ../../datasets/titanic/metadata.json \
    --record_set passengers \
    --num_records 10

Programmatically build JSON-LD files

You can programmatically build Croissant JSON-LD files using the Python API.

import mlcroissant as mlc
metadata=mlc.nodes.Metadata(
  name="...",
)
metadata.to_json()  # this returns the JSON-LD file.

For a full working example, refer to the script to convert Hugging Face datasets to Croissant files. This script uses the Python API to programmatically build JSON-LD files.

Run tests

All tests can be run from the Makefile:

make tests

Design

The most important modules in the library are:

  • mlcroissant/_src/structure_graph is responsible for the static analysis of the Croissant files. We convert Croissant files to a Python representation called "structure graph" (using NetworkX). In the process, we catch any static analysis issues (e.g., a missing mandatory field or a logic problem in the file).
  • mlcroissant/_src/operation_graph is responsible for the dynamic analysis of the Croissant files (i.e., actually loading the dataset by yielding examples). We convert the structure graph into an "operation graph". Operations are the unit transformations that allow to build the dataset (like Download, Extract, etc).

Other important modules are:

For the full design, refer to the design doc for an overview of the implementation.

Contribute

All contributions are welcome! We even have good first issues to start in the project. Refer to the GitHub project for more detailed user stories.

The development workflow goes as follow:

  • Read above how the repo is designed.
  • Fork the repository: https://github.com/mlcommons/croissant.
  • Clone the newly forked repository:
    git clone git@github.com:<YOUR_GITHUB_LDAP>/croissant.git
    
  • Create a new branch:
    cd croissant/
    git checkout -b feature/my-awesome-new-feature
    
  • Install the repository and dev tools:
    cd python/mlcroissant
    pip install -e .[dev]
    
  • Code the feature. We support VS Code with pre-set settings.
  • Push to GitHub:
    git add .
    git push --set-upstream origin feature/my-awesome-new-feature
    
  • Update your code until all tests are green:
    • pytest runs unit tests.
    • pytype -j auto runs pytype.
  • Open a pull request (PR) with the main branch of https://github.com/mlcommons/croissant, and ask for feedback!

Debug

You can debug the validation of the file using the --debug flag:

python scripts/validate.py --file ../../datasets/titanic/metadata.json --debug

This will:

  1. print extra information, like the generated nodes;
  2. save the generated structure graph to a folder indicated in the logs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

mlcroissant-0.0.1-py2.py3-none-any.whl (71.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file mlcroissant-0.0.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for mlcroissant-0.0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 5d306203ca63711b52ade93c178e89ea3be040c3b88174acdf43cc21df152b8d
MD5 50d80495273b7a60953a5a02d65e897a
BLAKE2b-256 45e25024f3560755730ed71bae798a7a84b6efaad70ac3d3c118f4a402ba40c7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page