Skip to main content

Set of Machine Learning versioning helpers

Project description

Machine Learning Versioning Tools - MLV-tools

Public repository for versioning machine learning data.

Installing

MLV-tools can be installed from PyPi:

pip install ml-versioning-tools

It is also possible to install it directly from sources:

git clone https://github.com/peopledoc/ml-versioning-tools.git
cd ml-versioning-tools

    make develop
OR
    make package
    pip install ./package/*.whl

Tutorial

A tutorial is available to showcase how to use the tools. See MLV-tools tutorial.

Keywords

Step metadata: in this document it refers to the first code cell when it is used to declare metadata such as parameters, dvc inputs/outputs, etc.

Working directory: the project's working directory. Files specified in the user configuration are relative to this directory. The --working-directory (or -w) flag is used to specify the working directory.

Tools

ipynb_to_python: this command converts a given Jupyter Notebook to a parameterized and executable Python3 script (see specific syntax in section below)

ipynb_to_python -n [notebook_path] -o [python_script_path]

gen_dvc: this command creates a DVC command which call the script generated by ipynb_to_python.

gen_dvc -i [python_script] --out-py-cmd [python_command] \
              --out-bash-cmd [dvc_command]

export_pipeline: this command exports the pipeline corresponding to the given DVC meta file into a bash script. Pipeline steps are called sequentially in a dependency order. Only for local steps.

export_pipeline --dvc [DVC target meta file] -o [pipeline script]

ipynb_to_dvc: this command converts a given Jupyter Notebook to a parameterized and executable Python3 script and a DVC command. It is the combination of ipynb_to_python and gen_dvc. It only works with a configuration file.

ipynb_to_dvc -n [notebook_path]

check_script_consistency and check_all_scripts_consistency: those commands ensure consitency between a Jupyter notebook and its generated python script. It is possible to use them as git hook or in the project continuous integration. The consistency check ignores blank lines and comments.

check_script_consistency -n [notebook_path] -s [script_path]

check_all_scripts_consistency -n [notebook_directory]
# Works only with a configuration file (provided or auto-detected)

Configuration

A configuration file can be provided, but it is not mandatory. Its default location is [working_dir]/.mlvtools. Use the flag --conf-path (or -c) on the command line to specify a specific configuration file path.

The configuration file format is JSON

{
  "path":
  {
    "python_script_root_dir": "[path_to_the_script_directory]",
    "dvc_cmd_root_dir": "[path_to_the_dvc_cmd_directory]",
    "dvc_metadata_root_dir": "[path_to_the_dvc_metadata_directory]" [optional]
  },
  "ignore_keys: ["keywords", "to", "ignore"],
  "dvc_var_python_cmd_path": "MLV_PY_CMD_PATH_CUSTOM",
  "dvc_var_python_cmd_name": "MLV_PY_CMD_NAME_CUSTOM",
  "docstring_conf": "./docstring_conf.yml"
}

All given paths must be relative to the working directory

  • path_to_the_script_directory: is the directory where Python 3 script will be generated using ipynb_to_script command. The Python 3 script name is based on the notebook name.

      ipynb_to_script -n ./data/My\ Notebook.ipynb
    
      Generated script: `[path_to_the_script_directory]/my_notebook.py`
    
  • path_to_the_dvc_cmd_directory: is the directory where DVC commands will be generated using gen_dvc command. Generated command names are based on Python 3 script names.

      gen_dvc -i ./scripts/my_notebook.py
    
      Generated commands: `[path_to_the_python_cmd_directory]/my_notebook_dvc`
    
  • path_to_the_dvc_metadata_directory: is the directory where DVC metadata files will be generated when executing gen_dvc commands. This value is optional, by default DVC metadata files will be saved in the working directory. Generated DVC metadata file names are based on Python 3 script names.

      ./[path_to_the_python_cmd_directory]/my_notebook_dvc
    
      Generated files: `[path_to_the_dvc_metadata_directory]/my_notebook.dvc`
    
  • ignore_keys: list of keywords use to discard a cell. Default value is ['# No effect ]. (See Discard cell section)

  • dvc_var_python_cmd_path, dvc_var_python_cmd_name, dvc_var_meta_filename: they allow to customize variable names which can be used in dvc-cmd Docstring parameter. They respectively correspond to the variables holding the python command file path, the file name and the variable holding the DVC default meta file name. Default values are 'MLV_PY_CMD_PATH', 'MLV_PY_CMD_NAME' and 'MLV_DVC_META_FILENAME'. (See DVC Command/Complex cases section for usage)

  • docstring_conf: the path to the docstring configuration used for Jinja templating (see DVC templating section). This parameter is not mandatory.

Jupyter Notebook syntax

The Step metadata cell is used to declare script parameters and DVC outputs and dependencies. This can be done using basic Docstring syntax. This Docstring must be the first statement is this cell, only comments can be writen above.

Good practices

Avoid using relative paths in your Jupyter Notebook because they are relative to the notebook location which is not the same when it will be converted to a script.

Python Script Parameters

Parameters can be declared in the Jupyter Notebook using basic Docstring syntax. This parameters description is used to generate configurable and executable python scripts.

Parameters declaration in Jupyter Notebook:

Jupyter Notebook: process_files.ipynb

#:param [type]? [param_name]: [description]?
"""
:param str input_file: the input file
:param output_file: the output_file
:param rate: the learning rate
:param int retry:
"""

Generated Python3 script:

[...]
def process_file(input_file: str, output_file, rate, retry:int):
    """
     ...
    """
[...]

Script command line parameters:

my_script.py -h

usage: my_cmd [-h] --input-file INPUT_FILE --output-file OUTPUT_FILE --rate
             RATE --retry RETRY

Command for script [script_name]

optional arguments:
  -h, --help            show this help message and exit
  --input-file INPUT_FILE
                        the input file
  --output-file OUTPUT_FILE
                        the output_file
  --rate RATE           the rate
  --retry RETRY

All declared arguments are required.

DVC command

A DVC command is a wrapper over dvc run command called on a Python 3 script generated with ipynb_to_python command. It is a step of a pipeline.

It is based on data declared in notebook metadata, 2 modes are available: - describe only input/output for simple cases (recommended) - describe full command for complex cases

Simple cases

Syntax

:param str input_csv_file: Path to input file
:param str output_csv_file: Path to output file
[...]

[:dvc-[in|out][\s{related_param}]?:[\s{file_path}]?]*
[:dvc-extra: {python_other_param}]?

:dvc-in: ./data/filter.csv
:dvc-in input_csv_file: ./data/info.csv
:dvc-out: ./data/train_set.csv
:dvc-out output_csv_file: ./data/test_set.csv
:dvc-extra: --mode train --rate 12

Provided {file_path} path can be absolute or relative to the working directory.

The {related_param} is a parameter of the corresponding Python 3 script, it is filled in for the python script call

The dvc-extra allows to declare parameters which are not dvc outputs or dependencies. Those parameters are provided to the call of the Python 3 command.

pushd /working-directory

INPUT_CSV_FILE="./data/info.csv"
OUTPUT_CSV_FILE="./data/test_set.csv"

dvc run \
-d ./data/filter.csv\
-d $INPUT_CSV_FILE\
-o ./data/train_set.csv\
-o $OUTPUT_CSV_FILE\
gen_src/python_script.py --mode train --rate 12
        --input-csv-file $INPUT_CSV_FILE
        --output-csv-file $OUTPUT_CSV_FILE

Complex cases

Syntax

:dvc-cmd: {dvc_command}

:dvc-cmd: dvc run -o ./out_train.csv -o ./out_test.csv
    "$MLV_PY_CMD_PATH -m train --out ./out_train.csv &&
     $MLV_PY_CMD_PATH -m test --out ./out_test.csv"

This syntax allows to provide the full dvc command to generate. All paths can be absolute or relative to the working directory. The variables $MLV_PY_CMD_PATH and $MLV_PY_CMD_NAME are available. They respectively contains the path and the name of the corresponding python command. The variable $MLV_DVC_META_FILENAME contains the default name of the DVC meta file.

pushd /working-directory
MLV_PY_CMD_PATH="gen_src/python_script.py"
MLV_PY_CMD_NAME="python_script.py"

dvc run -f $MLV_DVC_META_FILENAME -o ./out_train.csv \
    -o ./out_test.csv \
    "$MLV_PY_CMD_PATH -m train --out ./out_train.csv && \
    $MLV_PY_CMD_PATH -m test --out ./out_test.csv"
popd

DVC templating

It is possible to use Jinja2 template in DVC Docstring part. For example, it can be useful to declare all steps dependencies, outputs and extra parameters.

Example:

# Docstring in Jupyter notebook
"""
[...]
:dvc-in: {{ conf.train_data_file_path }}
:dvc-out: {{ conf.model_file_path }}
:dvc-extra: --rate {{ conf.rate }}
"""

# Docstring configuration file (Yaml format): ./dc_conf.yml

train_data_file_path: ./data/trainset.csv
model_file_path: ./data/model.pkl
rate: 45

# DVC command generation
gen_dvc -i ./python_script.py --docstring-conf ./dc_conf.yml

The Docstring configuration file can be provided through the main configuration or using --docstring-conf argument. This feature is only available for gen_dvc command.

Discard cell

Some cells in Jupyter Notebook are executed only to watch intermediate results. In a Python 3 script those are statements with no effect. The comment # No effect allows to discard a whole cell content to avoid waste of time running those statements. It is possible to customize the list of discard keywords, see Configuration section.

Contributing

We happily welcome contributions to MLV-tools. Please see our contribution guide for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ml-versioning-tools-2.0.1.tar.gz (24.3 kB view details)

Uploaded Source

Built Distribution

ml_versioning_tools-2.0.1-py2.py3-none-any.whl (27.8 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file ml-versioning-tools-2.0.1.tar.gz.

File metadata

  • Download URL: ml-versioning-tools-2.0.1.tar.gz
  • Upload date:
  • Size: 24.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for ml-versioning-tools-2.0.1.tar.gz
Algorithm Hash digest
SHA256 e71547cd894c3b8cb4f073eb1a3b24220ac651b6b80f49b32f19c30b4e7b0966
MD5 274f9e935efb9bfa1144494de83b5502
BLAKE2b-256 a5468c978a3a7a9104c588940aa4755d4c44586c4a43cf02561bae50347dcd62

See more details on using hashes here.

File details

Details for the file ml_versioning_tools-2.0.1-py2.py3-none-any.whl.

File metadata

  • Download URL: ml_versioning_tools-2.0.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 27.8 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for ml_versioning_tools-2.0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 afee813e44a37b00dc1a0dd28a22cf6b48dbd27427218be4cea62f9d5e3a3367
MD5 b88a1bcd65c8ad10a9d6e5ffbb592178
BLAKE2b-256 3fc49d3a463cde0715ce8b0d434f2dea74d600052a9d02fb8e31754d3e95b45b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page