Machine Learning libraries for Information Retrieval
Project description
ml4ir: Machine Learning Library for Information Retrieval
Setup
Requirements
- python3.6+
- pip3
- docker (version 18.09+ tested)
Using PIP
ml4ir can be installed as a pip package by using the following command
pip install 'git+https://git@github.com/salesforce/ml4ir#egg=ml4ir&subdirectory=python'
This will install ml4ir-0.0.1 (the current version). In future, when this package is available on PyPI, it will be as simple as pip install ml4ir
Docker (Recommended)
We have set up a docker-compose.yml
file for building and using docker containers to train models.
To run unit tests
docker-compose up
To invoke ml4ir with custom arguments with docker, run
/bin/bash tools/run_docker.sh ml4ir \
python3 ml4ir/base/pipeline.py
<args>
For ranking applications, specifically, use
/bin/bash tools/run_docker.sh ml4ir \
python3 ml4ir/applications/ranking/pipeline.py
<args>
Refer to usage section below for details on how to run ml4ir - ranking
Check ml4ir/applications/ranking/scripts/example_run.sh
for a predefined example run.
To run example invocation of ranking application with docker,
/bin/bash python/ml4ir/applications/ranking/scripts/example_run.sh
Virtual Environment
Install virtualenv
pip3 install virtualenv
Create new python3 virtual environment inside your git repo (it's .gitignored, don't worry)
cd $PLACE_YOU_CAlLED_GIT_CLONE/ml4ir
python3 -m venv python/env/.ml4ir_venv3
Activate virtualenv
cd python/
source env/.ml4ir_venv3/bin/activate
Install all dependencies (carefully)
pip3 install --upgrade setuptools
pip install --upgrade pip
pip3 install -r requirements.txt
Note, there are some AWS incompatibilities, gotta fix that, but you can ignore them for now
ERROR: botocore 1.14.9 has requirement docutils<0.16,>=0.10, but you'll have docutils 0.16 which is incompatible.
ERROR: awscli 1.17.9 has requirement docutils<0.16,>=0.10, but you'll have docutils 0.16 which is incompatible.
ERROR: awscli 1.17.9 has requirement rsa<=3.5.0,>=3.1.2, but you'll have rsa 4.0 which is incompatible.
ERROR: tensorflow-probability 0.8.0 has requirement cloudpickle==1.1.1, but you'll have cloudpickle 1.2.2 which is incompatible.
ERROR: apache-beam 2.18.0 has requirement dill<0.3.2,>=0.3.1.1, but you'll have dill 0.3.0 which is incompatible.
ERROR: apache-beam 2.18.0 has requirement httplib2<=0.12.0,>=0.8, but you'll have httplib2 0.17.0 which is incompatible.
ERROR: apache-beam 2.18.0 has requirement pyarrow<0.16.0,>=0.15.1; python_version >= "3.0" or platform_system != "Windows", but you'll have pyarrow 0.14.1 which is incompatible.
ERROR: tfx-bsl 0.15.3 has requirement absl-py<0.9,>=0.7, but you'll have absl-py 0.9.0 which is incompatible.
ERROR: tfx-bsl 0.15.3 has requirement apache-beam[gcp]<2.17,>=2.16, but you'll have apache-beam 2.18.0 which is incompatible.
ERROR: tensorflow-transform 0.15.0 has requirement absl-py<0.9,>=0.7, but you'll have absl-py 0.9.0 which is incompatible.
Note that pre-commit-hooks are required, and installed as a requirement if needed.
If an error results that they didn't install, execute pre-commit install
to install git hooks in your .git/ directory.
Set the PYTHONPATH environment variable
export PYTHONPATH=$PYTHONPATH:`pwd`/python
Usage
The entrypoint into the training or evaluation functionality of ml4ir is through ml4ir/base/pipeline.py
and for application specific overrides, look at `ml4ir/applications/<eg: ranking>/pipeline.py
ml4ir Library
To use ml4ir as a deep learning library to build relevance models, look at the walkthrough under notebooks/PointwiseRankingDemo.ipynb
or notebooks/PointwiseRankingDemo.html
(contains architecture diagrams). The notebook walks one through building, training, saving, and the entire life cycle of a RelevanceModel
from the bottom up. Additionally, the HTML version also sheds light on the design of ml4ir and the data format used.
Applications - Ranking
Examples
Using TFRecord
python ml4ir/applications/ranking/pipeline.py \
--data_dir ml4ir/applications/ranking/tests/data/tfrecord \
--feature_config ml4ir/applications/ranking/tests/data/config/feature_config.yaml \
--run_id test \
--data_format tfrecord \
--execution_mode train_inference_evaluate
Using CSV
python ml4ir/applications/ranking/pipeline.py \
--data_dir ml4ir/applications/ranking/tests/data/csv \
--feature_config ml4ir/applications/ranking/tests/data/config/feature_config.yaml \
--run_id test \
--data_format csv \
--execution_mode train_inference_evaluate
Running in inference mode using the default serving signature
python ml4ir/applications/ranking/pipeline.py \
--data_dir ml4ir/applications/ranking/tests/data/tfrecord \
--feature_config ml4ir/applications/ranking/tests/data/config/feature_config.yaml \
--run_id test \
--data_format tfrecord \
--model_file `pwd`/models/test/final/default \
--execution_mode inference_only
NOTE: Make sure to add the right data and feature config before training models.
TODO: describe how to do this
Running Tests
To run all the python based tests under ml4ir
python3 -m pytest
To run specific tests,
python3 -m pytest /path/to/test/module
Project Organization
The following structure is a little out of date (TODO(jake) - fix it!)
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported
├── ml4ir <- Source code for use in this project.
│ ├── __init__.py <- Makes ml4ir a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │ └── build_features.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
│
└── tox.ini <- tox file with settings for running tox; see tox.testrun.org
Project based on the cookiecutter data science project template. #cookiecutterdatascience
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ml4ir-0.0.1.tar.gz
.
File metadata
- Download URL: ml4ir-0.0.1.tar.gz
- Upload date:
- Size: 59.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 82efeacfa4cf3751589fcfe6a9677c59fd89bbfc498f26f39c44dc97f6bd4ad8 |
|
MD5 | 206d414c739ed3fd56095ac840858f23 |
|
BLAKE2b-256 | 9d4ae7502bf2e33c7848e1c8ad6df5924c014c3a6c6f6b3a79c1fa4b83c66ada |
File details
Details for the file ml4ir-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: ml4ir-0.0.1-py3-none-any.whl
- Upload date:
- Size: 84.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6bf685c9b07f84ba06884179e0bf0ef739a34fcc66c97a93463af11549625730 |
|
MD5 | 80e07e7da0fd83925d52100ff37800f9 |
|
BLAKE2b-256 | e8894f3e5008e39e20120e2f15c18b5c1be86d6ef2a9ddc2f31bc7307a1e4ad0 |