Skip to main content

Machine Learning libraries for Information Retrieval

Project description

ml4ir: Machine Learning Library for Information Retrieval

Setup

Requirements

  • python3.6+
  • pip3
  • docker (version 18.09+ tested)

Using PIP

ml4ir can be installed as a pip package by using the following command

pip install ml4ir

This will install ml4ir-0.0.1 (the current version) from PyPI.

Docker (Recommended)

We have set up a docker-compose.yml file for building and using docker containers to train models.

First, change working directory to the python package

cd python/

To run unit tests

docker-compose up

To invoke ml4ir with custom arguments with docker, run

/bin/bash tools/run_docker.sh ml4ir \
	python3 ml4ir/base/pipeline.py
    <args>

For ranking applications, specifically, use

/bin/bash tools/run_docker.sh ml4ir \
	python3 ml4ir/applications/ranking/pipeline.py
    <args>

Refer to usage section below for details on how to run ml4ir - ranking

Check ml4ir/applications/ranking/scripts/example_run.sh for a predefined example run.

To run example invocation of ranking application with docker,

/bin/bash ml4ir/applications/ranking/scripts/example_run.sh

Virtual Environment

Install virtualenv

pip3 install virtualenv

Create new python3 virtual environment inside your git repo (it's .gitignored, don't worry)

cd $PLACE_YOU_CAlLED_GIT_CLONE/ml4ir
python3 -m venv python/env/.ml4ir_venv3

Activate virtualenv

cd python/
source env/.ml4ir_venv3/bin/activate

Install all dependencies (carefully)

pip3 install --upgrade setuptools
pip install --upgrade pip
pip3 install -r requirements.txt

Note, there are some AWS incompatibilities, gotta fix that, but you can ignore them for now

ERROR: botocore 1.14.9 has requirement docutils<0.16,>=0.10, but you'll have docutils 0.16 which is incompatible.
ERROR: awscli 1.17.9 has requirement docutils<0.16,>=0.10, but you'll have docutils 0.16 which is incompatible.
ERROR: awscli 1.17.9 has requirement rsa<=3.5.0,>=3.1.2, but you'll have rsa 4.0 which is incompatible.
ERROR: tensorflow-probability 0.8.0 has requirement cloudpickle==1.1.1, but you'll have cloudpickle 1.2.2 which is incompatible.
ERROR: apache-beam 2.18.0 has requirement dill<0.3.2,>=0.3.1.1, but you'll have dill 0.3.0 which is incompatible.
ERROR: apache-beam 2.18.0 has requirement httplib2<=0.12.0,>=0.8, but you'll have httplib2 0.17.0 which is incompatible.
ERROR: apache-beam 2.18.0 has requirement pyarrow<0.16.0,>=0.15.1; python_version >= "3.0" or platform_system != "Windows", but you'll have pyarrow 0.14.1 which is incompatible.
ERROR: tfx-bsl 0.15.3 has requirement absl-py<0.9,>=0.7, but you'll have absl-py 0.9.0 which is incompatible.
ERROR: tfx-bsl 0.15.3 has requirement apache-beam[gcp]<2.17,>=2.16, but you'll have apache-beam 2.18.0 which is incompatible.
ERROR: tensorflow-transform 0.15.0 has requirement absl-py<0.9,>=0.7, but you'll have absl-py 0.9.0 which is incompatible.

Note that pre-commit-hooks are required, and installed as a requirement if needed. If an error results that they didn't install, execute pre-commit install to install git hooks in your .git/ directory.

Set the PYTHONPATH environment variable

export PYTHONPATH=$PYTHONPATH:`pwd`/python

Usage

The entrypoint into the training or evaluation functionality of ml4ir is through ml4ir/base/pipeline.py and for application specific overrides, look at `ml4ir/applications/<eg: ranking>/pipeline.py

ml4ir Library

To use ml4ir as a deep learning library to build relevance models, look at the following walkthroughs under notebooks/

  • Learning to Rank : The PointwiseRankingDemo notebook walks you through building, training, saving, and the entire life cycle of a RelevanceModel from the bottom up. You can also find details regarding the architecture of ml4ir in it.
  • Text Classification : The EntityPredictionDemo notebook walks you through training a model to predict entity given a user context and query.

Enter the following command to spin up Jupyter notebook on your browser to run the above notebooks

jupyter-notebook

Applications - Ranking

Examples

Using TFRecord

python ml4ir/applications/ranking/pipeline.py \
--data_dir ml4ir/applications/ranking/tests/data/tfrecord \
--feature_config ml4ir/applications/ranking/tests/data/config/feature_config.yaml \
--run_id test \
--data_format tfrecord \
--execution_mode train_inference_evaluate

Using CSV

python ml4ir/applications/ranking/pipeline.py \
--data_dir ml4ir/applications/ranking/tests/data/csv \
--feature_config ml4ir/applications/ranking/tests/data/config/feature_config.yaml \
--run_id test \
--data_format csv \
--execution_mode train_inference_evaluate

Running in inference mode using the default serving signature

python ml4ir/applications/ranking/pipeline.py \
--data_dir ml4ir/applications/ranking/tests/data/tfrecord \
--feature_config ml4ir/applications/ranking/tests/data/config/feature_config.yaml \
--run_id test \
--data_format tfrecord \
--model_file `pwd`/models/test/final/default \
--execution_mode inference_only

NOTE: Make sure to add the right data and feature config before training models.
TODO: describe how to do this

Applications - Classification

Examples

Using TFRecord

python ml4ir/applications/classification/pipeline.py \
--data_dir ml4ir/applications/classification/tests/data/tfrecord \
--feature_config ml4ir/applications/classification/tests/data/configs/feature_config.yaml \
--model_config ml4ir/applications/classification/tests/data/configs/model_config.yaml \
--batch_size 32 \
--run_id test \
--data_format tfrecord \
--execution_mode train_inference_evaluate

Using CSV

python ml4ir/applications/classification/pipeline.py \
--data_dir ml4ir/applications/classification/tests/data/tfrecord \
--feature_config ml4ir/applications/classification/tests/data/configs/feature_config.yaml \
--model_config ml4ir/applications/classification/tests/data/configs/model_config.yaml \
--batch_size 32 \
--run_id test \
--data_format csv \
--execution_mode train_inference_evaluate

Running in inference mode using the default serving signature

-python ml4ir/applications/classification/pipeline.py \
--data_dir ml4ir/applications/classification/tests/data/tfrecord \
--feature_config ml4ir/applications/classification/tests/data/configs/feature_config.yaml \
--model_config ml4ir/applications/classification/tests/data/configs/model_config.yaml \
--batch_size 32 \
--run_id test \
--data_format tfrecord \
--model_file `pwd`/models/test/final/default \
--execution_mode inference_only

NOTE: Make sure to add the right data and feature config before training models.

Running Tests

To run all the python based tests under ml4ir

python3 -m pytest

To run specific tests,

python3 -m pytest /path/to/test/module

Project Organization

The following structure is a little out of date (TODO(jake) - fix it!)

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── ml4ir                <- Source code for use in this project.
│   ├── __init__.py    <- Makes ml4ir a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.testrun.org

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ml4ir-0.0.2.tar.gz (78.0 kB view details)

Uploaded Source

Built Distribution

ml4ir-0.0.2-py3-none-any.whl (111.5 kB view details)

Uploaded Python 3

File details

Details for the file ml4ir-0.0.2.tar.gz.

File metadata

  • Download URL: ml4ir-0.0.2.tar.gz
  • Upload date:
  • Size: 78.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.7

File hashes

Hashes for ml4ir-0.0.2.tar.gz
Algorithm Hash digest
SHA256 cb2c05da1bd42012d9b2cd6d8182fca9923157fa3741eafb5dfa3df64261ac86
MD5 7e3d22ecd82b993502da9ed4ba13f5b2
BLAKE2b-256 21b211a784b0cdaa89a633660fcafc965fb7ebfb389d8d2b928cd5b8fc489c9c

See more details on using hashes here.

File details

Details for the file ml4ir-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: ml4ir-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 111.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.7

File hashes

Hashes for ml4ir-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6d386d50d50920f35d0fd752513c723667636a92aeabb98a6422a2a0523630fd
MD5 cf6ddb0a3aee5664d8ce06611012acf3
BLAKE2b-256 8a0ca061ab696706130dbe19ab790427085376d4ef10836e03a1a759b532f6a3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page