Machine Learning libraries for Information Retrieval

These details have not been verified by PyPI

Project links

Homepage

Project description

ml4ir: Machine Learning Library for Information Retrieval

Setup

Requirements

python3.6+
pip3
docker (version 18.09+ tested)

Using PIP

ml4ir can be installed as a pip package by using the following command

pip install  'git+https://git@github.com/salesforce/ml4ir#egg=ml4ir&subdirectory=python'

This will install ml4ir-0.0.1 (the current version). In future, when this package is available on PyPI, it will be as simple as pip install ml4ir

Docker (Recommended)

We have set up a docker-compose.yml file for building and using docker containers to train models.

To run unit tests

docker-compose up

To invoke ml4ir with custom arguments with docker, run

/bin/bash tools/run_docker.sh ml4ir \
	python3 ml4ir/base/pipeline.py
    <args>

For ranking applications, specifically, use

/bin/bash tools/run_docker.sh ml4ir \
	python3 ml4ir/applications/ranking/pipeline.py
    <args>

Refer to usage section below for details on how to run ml4ir - ranking

Check ml4ir/applications/ranking/scripts/example_run.sh for a predefined example run.

To run example invocation of ranking application with docker,

/bin/bash python/ml4ir/applications/ranking/scripts/example_run.sh

Virtual Environment

Install virtualenv

pip3 install virtualenv

Create new python3 virtual environment inside your git repo (it's .gitignored, don't worry)

cd $PLACE_YOU_CAlLED_GIT_CLONE/ml4ir
python3 -m venv python/env/.ml4ir_venv3

Activate virtualenv

cd python/
source env/.ml4ir_venv3/bin/activate

Install all dependencies (carefully)

pip3 install --upgrade setuptools
pip install --upgrade pip
pip3 install -r requirements.txt

Note, there are some AWS incompatibilities, gotta fix that, but you can ignore them for now

ERROR: botocore 1.14.9 has requirement docutils<0.16,>=0.10, but you'll have docutils 0.16 which is incompatible.
ERROR: awscli 1.17.9 has requirement docutils<0.16,>=0.10, but you'll have docutils 0.16 which is incompatible.
ERROR: awscli 1.17.9 has requirement rsa<=3.5.0,>=3.1.2, but you'll have rsa 4.0 which is incompatible.
ERROR: tensorflow-probability 0.8.0 has requirement cloudpickle==1.1.1, but you'll have cloudpickle 1.2.2 which is incompatible.
ERROR: apache-beam 2.18.0 has requirement dill<0.3.2,>=0.3.1.1, but you'll have dill 0.3.0 which is incompatible.
ERROR: apache-beam 2.18.0 has requirement httplib2<=0.12.0,>=0.8, but you'll have httplib2 0.17.0 which is incompatible.
ERROR: apache-beam 2.18.0 has requirement pyarrow<0.16.0,>=0.15.1; python_version >= "3.0" or platform_system != "Windows", but you'll have pyarrow 0.14.1 which is incompatible.
ERROR: tfx-bsl 0.15.3 has requirement absl-py<0.9,>=0.7, but you'll have absl-py 0.9.0 which is incompatible.
ERROR: tfx-bsl 0.15.3 has requirement apache-beam[gcp]<2.17,>=2.16, but you'll have apache-beam 2.18.0 which is incompatible.
ERROR: tensorflow-transform 0.15.0 has requirement absl-py<0.9,>=0.7, but you'll have absl-py 0.9.0 which is incompatible.

Note that pre-commit-hooks are required, and installed as a requirement if needed. If an error results that they didn't install, execute pre-commit install to install git hooks in your .git/ directory.

Set the PYTHONPATH environment variable

export PYTHONPATH=$PYTHONPATH:`pwd`/python

Usage

The entrypoint into the training or evaluation functionality of ml4ir is through ml4ir/base/pipeline.py and for application specific overrides, look at `ml4ir/applications/<eg: ranking>/pipeline.py

ml4ir Library

To use ml4ir as a deep learning library to build relevance models, look at the walkthrough under notebooks/PointwiseRankingDemo.ipynb or notebooks/PointwiseRankingDemo.html(contains architecture diagrams). The notebook walks one through building, training, saving, and the entire life cycle of a RelevanceModel from the bottom up. Additionally, the HTML version also sheds light on the design of ml4ir and the data format used.

Applications - Ranking

Examples

Using TFRecord

python ml4ir/applications/ranking/pipeline.py \
--data_dir ml4ir/applications/ranking/tests/data/tfrecord \
--feature_config ml4ir/applications/ranking/tests/data/config/feature_config.yaml \
--run_id test \
--data_format tfrecord \
--execution_mode train_inference_evaluate

Using CSV

python ml4ir/applications/ranking/pipeline.py \
--data_dir ml4ir/applications/ranking/tests/data/csv \
--feature_config ml4ir/applications/ranking/tests/data/config/feature_config.yaml \
--run_id test \
--data_format csv \
--execution_mode train_inference_evaluate

Running in inference mode using the default serving signature

python ml4ir/applications/ranking/pipeline.py \
--data_dir ml4ir/applications/ranking/tests/data/tfrecord \
--feature_config ml4ir/applications/ranking/tests/data/config/feature_config.yaml \
--run_id test \
--data_format tfrecord \
--model_file `pwd`/models/test/final/default \
--execution_mode inference_only

NOTE: Make sure to add the right data and feature config before training models.
TODO: describe how to do this

Running Tests

To run all the python based tests under ml4ir

python3 -m pytest

To run specific tests,

python3 -m pytest /path/to/test/module

Project Organization

The following structure is a little out of date (TODO(jake) - fix it!)

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── ml4ir                <- Source code for use in this project.
│   ├── __init__.py    <- Makes ml4ir a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.testrun.org

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.16

Mar 7, 2023

0.1.15

Feb 6, 2023

0.1.14

Nov 21, 2022

0.1.13

Oct 18, 2022

0.1.12

Apr 26, 2022

0.1.11

Jan 21, 2022

0.1.10

Dec 30, 2021

0.1.9

Dec 29, 2021

0.1.8

Oct 22, 2021

0.1.6

Jul 16, 2021

0.1.4

Jul 1, 2021

0.1.3

Jun 24, 2021

0.1.2

Jun 17, 2021

0.1.0

Mar 4, 2021

0.0.5

Feb 17, 2021

0.0.4

Feb 17, 2021

0.0.3

Oct 8, 2020

0.0.2

Sep 23, 2020

This version

0.0.1

Jun 17, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ml4ir-0.0.1.tar.gz (59.3 kB view details)

Uploaded Jun 17, 2020 Source

Built Distribution

ml4ir-0.0.1-py3-none-any.whl (84.1 kB view details)

Uploaded Jun 17, 2020 Python 3

File details

Details for the file ml4ir-0.0.1.tar.gz.

File metadata

Download URL: ml4ir-0.0.1.tar.gz
Upload date: Jun 17, 2020
Size: 59.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.4

File hashes

Hashes for ml4ir-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`82efeacfa4cf3751589fcfe6a9677c59fd89bbfc498f26f39c44dc97f6bd4ad8`
MD5	`206d414c739ed3fd56095ac840858f23`
BLAKE2b-256	`9d4ae7502bf2e33c7848e1c8ad6df5924c014c3a6c6f6b3a79c1fa4b83c66ada`

See more details on using hashes here.

File details

Details for the file ml4ir-0.0.1-py3-none-any.whl.

File metadata

Download URL: ml4ir-0.0.1-py3-none-any.whl
Upload date: Jun 17, 2020
Size: 84.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.4

File hashes

Hashes for ml4ir-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6bf685c9b07f84ba06884179e0bf0ef739a34fcc66c97a93463af11549625730`
MD5	`80e07e7da0fd83925d52100ff37800f9`
BLAKE2b-256	`e8894f3e5008e39e20120e2f15c18b5c1be86d6ef2a9ddc2f31bc7307a1e4ad0`