Skip to main content

MLCommons datasets format.

Project description

mlcroissant 🥐

Discover mlcroissant 🥐 with this introduction tutorial in Google Colab.

Python requirements

Python version >= 3.10.

If you do not have a Python environment:

python3 -m venv ~/py3
source ~/py3/bin/activate

Install

python -m pip install ".[dev]"

The command can fail, for example, due to missing dependencies, e.g.:

Failed to build pygraphviz
ERROR: Could not build wheels for pygraphviz, which is required to install pyproject.toml-based projects

This can be fixed by running

sudo apt-get install python3-dev graphviz libgraphviz-dev pkg-config

Conda installation

Conda can help create a consistent environment. It can also be useful to install packages without root access. To use Conda, run:

conda create --name croissant python=3.10 -y
conda activate croissant
conda install graphviz
python3 -m pip install ".[dev]"

Verify/load a Croissant dataset

mlcroissant validate --jsonld ../../datasets/titanic/metadata.json

The command:

  • Exits with 0, prints Done and displays encountered warnings, when no error was found in the file.
  • Exits with 1 and displays all encountered errors/warnings, otherwise.

Similarly, you can generate a dataset by launching:

mlcroissant load \
    --jsonld ../../datasets/titanic/metadata.json \
    --record_set passengers \
    --num_records 10

Loading a distribution via git+https

If the encodingFormat of a distribution is git+https, please provide the username and password by setting the CROISSANT_GIT_USERNAME and CROISSANT_GIT_PASSWORD environment variables. These will be used to construct the authentication necessary to load the distribution.

Note that, for datasets hosted on HuggingFace, CROISSANT_GIT_USERNAME and CROISSANT_GIT_PASSWORD should correspond respectively to your HuggingFace's username and User Access Token. User Access Tokens can be generated following this guide.

Loading a distribution via HTTP with Basic Auth

If the contentUrl of a distribution requires authentication via Basic Auth, please provide the username and password by setting the CROISSANT_BASIC_AUTH_USERNAME and CROISSANT_BASIC_AUTH_PASSWORD environment variables. These will be used to construct the authentication necessary to load the distribution.

Programmatically build JSON-LD files

You can programmatically build Croissant JSON-LD files using the Python API.

import mlcroissant as mlc
metadata=mlc.nodes.Metadata(
  name="...",
)
metadata.to_json()  # this returns the JSON-LD file.

Run tests

All tests can be run from the Makefile:

make tests

Note that git lfs should be installed to successfully pass all tests:

git lfs install

Design

The most important modules in the library are:

  • mlcroissant/_src/structure_graph is responsible for the static analysis of the Croissant files. We convert Croissant files to a Python representation called "structure graph" (using NetworkX). In the process, we catch any static analysis issues (e.g., a missing mandatory field or a logic problem in the file).
  • mlcroissant/_src/operation_graph is responsible for the dynamic analysis of the Croissant files (i.e., actually loading the dataset by yielding examples). We convert the structure graph into an "operation graph". Operations are the unit transformations that allow to build the dataset (like Download, Extract, etc).

Other important modules are:

For the full design, refer to the design doc for an overview of the implementation.

Caching. By default, all downloaded/extracted files are cached in ~/.cache/croissant, but you can overwrite this by setting the environment variable $CROISSANT_CACHE.

Contribute

All contributions are welcome! We even have good first issues to start in the project. Refer to the GitHub project for more detailed user stories and read above how the repo is designed.

An easy way to contribute to mlcroissant is using Croissant's configured codespaces. To start a codespace:

  • On Croissant's main repo page, click on the <Code> button and select the Codespaces tab. You can start a new codespace by clicking on the + sign on the left side of the tab. By default, the codespace will start on Croissant's main branch, unless you select otherwise from the branches drop-down menu on the left side.
  • While building the environment, your codespaces will install all mlcroissant's required dependencies - so that you can start coding right away! Of course, you can further personalize your codespace.
  • To start contributing to Croissant:
    • Create a new branch from the Terminal tab in the bottom panel of your codespace with git checkout -b feature/my-awesome-new-feature
    • You can create new commits, and run most git commands from the Source Control tab in the left panel of your codespace. Alternatively, use the Terminal in the bottom panel of your codespace.
    • Iterate on your code until all tests are green (you can run tests with make pytest or form the Tests tab in the left panel of your codespace).
    • Open a pull request (PR) with the main branch of https://github.com/mlcommons/croissant, and ask for feedback!

Alternatively, you can contribute to mlcroissant using the "classic" GitHub workflow:

Debug

You can debug the validation of the file using the --debug flag:

mlcroissant validate --jsonld ../../datasets/titanic/metadata.json --debug

This will:

  1. print extra information, like the generated nodes;
  2. save the generated structure graph to a folder indicated in the logs.

Publishing packages

To publish a package,

  1. Bump the version in croissant/python/mlcroissant/pyproject.toml.
  2. Publish a release in GitHub. The workflow script python-publish.yml will trigger and publish the package to PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlcroissant-1.0.3.tar.gz (83.3 kB view details)

Uploaded Source

Built Distribution

mlcroissant-1.0.3-py2.py3-none-any.whl (122.5 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file mlcroissant-1.0.3.tar.gz.

File metadata

  • Download URL: mlcroissant-1.0.3.tar.gz
  • Upload date:
  • Size: 83.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for mlcroissant-1.0.3.tar.gz
Algorithm Hash digest
SHA256 de861945886494a546098476c0e06c02fb2fab192f4b683cade9a1f697613dc9
MD5 8cd5bad88067fc2eb8cc09706a2b574f
BLAKE2b-256 e60d93fff1c57e483b9282b55589e154a447fcc258052b7c2dc1c831e1494ed3

See more details on using hashes here.

File details

Details for the file mlcroissant-1.0.3-py2.py3-none-any.whl.

File metadata

  • Download URL: mlcroissant-1.0.3-py2.py3-none-any.whl
  • Upload date:
  • Size: 122.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for mlcroissant-1.0.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 691cf41ae257691cef3363316fad755bff49e6f8f06426af5fe3ba1881f69ce6
MD5 699d6abd6e24a62efaebe2b539285f7b
BLAKE2b-256 310f2822170889a452f2962971cbf52f807d7a2d0f924124aee4a17c746778c6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page