Sotastream is a command line tool that augments a batch of text and produces infinite stream of records.
Project description
Sotastream
Introduction
Sotastream is a tool for data augmentation for training
pipeline. It uses infinibatch
internally to generate an infinite
stream of shuffled training data and provides a means for on-the-fly
data manipulation, augmentation, mixing, and sampling.
Cloning and initialization
To begin, clone the repository:
git clone https://github.com/marian-nmt/sotastream
You can then install it as follows.
cd sotastream
python -m pip install .
python -m pip install --no-deps . # install without dependencies
If you already have your own version of requirements, add --no-deps / --no-dependencies
flag to skip installing dependencies.
Entry points
- As a module:
python -m sotastream
- As a bin in your $PATH:
sotastream
- Via path to script:
python path/to/cli.py
. For convenience, cli.py is in the root of repository
Development
Install development tools
python -m pip install -e .[dev,test] # editable mode
Editable mode (-e / --editable
) is recommended for development purposes, pip
creates symbolic link to your source code in a way that any edits made are reflected directly to the installed package. [dev,test]
installs depencies for development and tests which includes black
, pytest
etc.
We use black
to reformat code to a common code style.
make reformat
Before creating any pull requests, run
make check # runs reformatter and tests
Running tests
make test # run unit tests
make regression # run regression tests
See Makefile
for more details.
Usage examples
A folder like split/parallel
contains training data in tsv format (src<tab>tgt
) split into
*.gz
files of around 100,000 lines for better shuffling. The below will output an infinite
stream of data generated from the gzipped files in these folders, according to the "wmt" recipe
found in sotastream/pipelines/example_pipeline.py
.
python -m sotastream example split/parallel split/backtrans
You can also provide compressed TSV files directly, in which case sotastream will split them
to checksummed folders under /tmp/sotastream/{checksum}
:
python -m sotastream example parallel.tsv.gz backtrans.tsv.gz
(The garbage file is assumed to have just a single column of data, which is copied).
There are currently two main pipelines: "default", and "wmt". These vary according to the data sources they take as well as the other options available to them.
There are global options that control behavioral aspects such as splitting and parallelization, and also pipeline-specific arguments. You can see these by running
# see global options
python -m sotastream -h
# see default pipeline options
python -m sotastream default -h
# see wmt pipeline options
python -m sotastream wmt -h
Don't cross the streams!
Sotastream workflows build a directed acyclic graph (DAG) consisting of cascades of generators that pass through mutable lines from the graph inputs to the pipeline output. Since each step provides transformations and manipulations of each input line, the only requirement is that modifications along separate branches must not be merged into a single node in the graph, or at least, that great care should be taken when doing so. An example is the Mixer, which does not actually merge modifications from alternate branches, but instead selects across multiple incoming branches using a provided probability distribution.
Custom/private pipelines from own (private) directory
You can create a custom pipeline by adding a file in the current (invocation)
directory with a file name matching the pattern "*_pipeline.py". This should
follow the interface defined in sotastream/pipelines
, namely:
- Call
@pipeline("name")
to give your pipeline a name. This name must not conflict with existing names. - Inherit from
Pipeline
base class fromsotastream.pipeline
. For document pipelines, useDocumentPipeline
as base class.
You can find some examples in test/dummy_pipeline.py
, as well as the real examples in sotastream/pipelines
.
Authors
Sotastream is developed by TextMT Team @ Microsoft Translator.
- Roman Grundkiewicz
- Thamme Gowda
- Rohit Jain
- Huda Khayrallah
- Matt Post
- Marcin Junczys-Dowmunt
We are finishing up a paper that describes
sotastream
in detail; it will be linked here.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file sotastream-1.0.0.tar.gz
.
File metadata
- Download URL: sotastream-1.0.0.tar.gz
- Upload date:
- Size: 30.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 327ef665253bbd85afed1fce6d70ab8c29fb3ec7479e30ab29d6e5243d86c1b4 |
|
MD5 | 73327b80d06c49bb1aed1e755f1bc480 |
|
BLAKE2b-256 | 17627a9ced93e75cf0123b77ebae0f2d5d7ab5562bf262d940795685e10693ae |
File details
Details for the file sotastream-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: sotastream-1.0.0-py3-none-any.whl
- Upload date:
- Size: 27.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f72055111b2a0bfb2985d055bc8ed9f546745ee43434de7d3fe016e1a0f02700 |
|
MD5 | d81c35ae624f5a9f4e2264da76e51051 |
|
BLAKE2b-256 | d7ebc17a94f778aaddf242ef3cf1600e080b70cd1f9a4b76bf5708b4ddbe849a |