mtdata is a tool to download datasets for machine translation
Project description
MTData
MTData tool automate the dataset collection and preparation for machine translation research. It provides CLI and python APIs, so it can be used as a standalone tool or embedded into python apps for preparing MT experiments.
This tool knows:
- From where to download data sets: WMT tests and devs for [2014, 2015, ... 2020], Paracrawl, Europarl, News Commentary, WikiTitles, Tilde Model corpus ...
- How to extract files : .tar, .tar.gz, .tgz, .zip, ...
- How to parse .tmx, .sgm and such XMLs, or .tsv ... Checks if they have same number of segments.
- Whether parallel data is in one .tsv file or two sgm files.
- Whether data is compressed in gz, xz or none at all.
- Whether the source-target is in the same order or is it swapped as target-source order.
- How to map code to ISO language codes! Using ISO 639_3 that has space for 7000+ languages of our planet.
- (And more of such tiny details over the time.)
MTData is here to:
- Automate the MT training data creation by taking out human intervention. Inspired by SacreBLEU that takes out human intervention in evaluation stage.
- A reusable tool instead of dozens of use-once shell scripts spread across multiple repos.
Limitations (as of now):
- Only publicly available datasets that do not need login are supported. No LDC yet.
- No tokenizers are integrated. (It should be fairly easy to get those integrated)
Installation
# from the source code on github
git clone https://github.com/thammegowda/mtdata
cd mtdata
pip install --editable .
# from pypi
pip install mtdata
Current Status:
These are the summary of datasets from various sources (Updated: May 10 2020). The list is incomplete and meant to see as start. Here I (/TG) have picked some commonly used datasets that I use for my work, you are welcome to add more.
Source | # of datasets |
---|---|
Statmt | 355 |
Paracrawl | 30 |
Tilde | 519 |
OPUS$1 | 53,351 |
JW300$2 | 44,663 |
GlobalVoices 2018Q4 | 812 |
Joshua Indian Corpus | 29 |
UnitedNations$3 | 30 |
WikiMatrix | 1,617 |
Other | 7 |
---- | ---- |
Total | 101,383 |
- $1 - OPUS contains duplicate entries from other listed sources, but they are often older releases of corpus.
- $2 - JW300 is also retrieved from OPUS, however handled differently due to the difference in the scale and internal format.
- $3 - Only test sets are included
CLI Usage
- After pip installation, the CLI can be called using
mtdata
command orpython -m mtdata
- There are two sub commands:
list
for listing the datasets, andget
for getting them
mtdata list
Lists datasets that are known to this tool.
mtdata list -h
usage: mtdata list [-h] [-l L1-L2] [-n [NAME [NAME ...]]]
[-nn [NAME [NAME ...]]] [-f]
optional arguments:
-h, --help show this help message and exit
-l L1-L2, --langs L1-L2
Language pairs; e.g.: deu-eng (default: None)
-n [NAME [NAME ...]], --names [NAME [NAME ...]]
Name of dataset set; eg europarl_v9. (default: None)
-nn [NAME [NAME ...]], --not-names [NAME [NAME ...]]
Exclude these names (default: None)
-f, --full Show Full Citation (default: False)
# List everything
mtdata list
# List a lang pair
mtdata list -l deu-eng
# List a dataset by name(s)
mtdata list -n europarl_v9
mtdata list -n europarl_v9 news_commentary_v14
# list by both language pair and dataset name
mtdata list -l deu-eng -n europarl_v9 news_commentary_v14 newstest201{4,5,6,7,8,9}_deen
# get citation of a dataset (if available in index.py)
mtdata list -l deu-eng -n newstest2019_deen --full
mtdata get
This command downloads datasets specified by names for languages to a directory.
You will have to make definite choice for --train
and --test
arguments
mtdata get -h
usage: mtdata get [-h] -l L1-L2 [-tr [NAME [NAME ...]]]
[-tt [NAME [NAME ...]]] -o OUT
optional arguments:
-h, --help show this help message and exit
-l L1-L2, --langs L1-L2
Language pairs; e.g.: deu-eng (default: None)
-tr [NAME [NAME ...]], --train [NAME [NAME ...]]
Names of datasets separated by space, to be used for *training*.
e.g. -tr news_commentary_v14 europarl_v9 .
All these datasets gets concatenated into one big file.
(default: None)
-tt [NAME [NAME ...]], --test [NAME [NAME ...]]
Names of datasets separated by space, to be used for *testing*.
e.g. "-tt newstest2018_deen newstest2019_deen".
You may also use shell expansion if your shell supports it.
e.g. "-tt newstest201{8,9}_deen." (default: None)
-o OUT, --out OUT Output directory name (default: None)
Example
See what datasets are available for deu-eng
$ mtdata list -l deu-eng # see available datasets
europarl_v9 deu-eng http://www.statmt.org/europarl/v9/training/europarl-v9.deu-eng.tsv.gz
news_commentary_v14 deu-eng http://data.statmt.org/news-commentary/v14/training/news-commentary-v14.deu-eng.tsv.gz
wiki_titles_v1 deu-eng http://data.statmt.org/wikititles/v1/wikititles-v1.deu-eng.tsv.gz
wiki_titles_v2 deu-eng http://data.statmt.org/wikititles/v2/wikititles-v2.deu-eng.tsv.gz
newstest2014_deen deu-eng http://data.statmt.org/wmt20/translation-task/dev.tgz dev/newstest2014-deen-src.de.sgm,dev/newstest2014-deen-ref.en.sgm
newstest2015_ende en-de http://data.statmt.org/wmt20/translation-task/dev.tgz dev/newstest2015-ende-src.en.sgm,dev/newstest2015-ende-ref.de.sgm
newstest2015_deen deu-eng http://data.statmt.org/wmt20/translation-task/dev.tgz dev/newstest2015-deen-src.de.sgm,dev/newstest2015-deen-ref.en.sgm
...[truncated]
Get these datasets and store under dir deu-eng
$ mtdata get --langs deu-eng --train europarl_v10 wmt13_commoncrawl news_commentary_v14 --test newstest201{4,5,6,7,8,9}_deen --out deu-eng
# ...[truncated]
INFO:root:Train stats:
{
"total": 4565929,
"parts": {
"wmt13_commoncrawl": 2399123,
"news_commentary_v14": 338285,
"europarl_v10": 1828521
}
}
INFO:root:Dataset is ready at deu-eng
To reproduce this dataset again in the future or by others, please refer to <out-dir>>/mtdata.signature.txt
:
$ cat deu-eng/mtdata.signature.txt
mtdat get -l deu-eng -tr europarl_v10 wmt13_commoncrawl news_commentary_v14 -ts newstest2014_deen newstest2015_deen newstest2016_deen newstest2017_deen newstest2018_deen newstest2019_deen -o <out-dir>
mtdata version 0.1.1
See what the above command has accomplished:
$ find deu-eng -type f | sort | xargs wc -l
3003 deu-eng/tests/newstest2014_deen.deu
3003 deu-eng/tests/newstest2014_deen.eng
2169 deu-eng/tests/newstest2015_deen.deu
2169 deu-eng/tests/newstest2015_deen.eng
2999 deu-eng/tests/newstest2016_deen.deu
2999 deu-eng/tests/newstest2016_deen.eng
3004 deu-eng/tests/newstest2017_deen.deu
3004 deu-eng/tests/newstest2017_deen.eng
2998 deu-eng/tests/newstest2018_deen.deu
2998 deu-eng/tests/newstest2018_deen.eng
2000 deu-eng/tests/newstest2019_deen.deu
2000 deu-eng/tests/newstest2019_deen.eng
1828521 deu-eng/train-parts/europarl_v10.deu
1828521 deu-eng/train-parts/europarl_v10.eng
338285 deu-eng/train-parts/news_commentary_v14.deu
338285 deu-eng/train-parts/news_commentary_v14.eng
2399123 deu-eng/train-parts/wmt13_commoncrawl.deu
2399123 deu-eng/train-parts/wmt13_commoncrawl.eng
4565929 deu-eng/train.deu
4565929 deu-eng/train.eng
ISO 639 3
Internally all language codes are mapped to ISO-639 3 codes.
The mapping can be inspected with python -m mtdata.iso
$ python -m mtdata.iso -h
usage: python -m mtdata.iso [-h] [langs [langs ...]]
ISO 639-3 lookup tool
positional arguments:
langs Language code or name that needs to be looked up. When no
language code is given, all languages are listed.
optional arguments:
-h, --help show this help message and exit
# list all 7000+ languages and their 3 letter codes
$ python -m mtdata.iso
...
# lookup codes for some languages
$ python -m mtdata.iso ka kn en de xx english german
Input ISO639_3 Name
ka kat Georgian
kn kan Kannada
en eng English
de deu German
xx -none- -none-
english eng English
german deu German
How to extend, modify, or contribute:
Please help grow the datasets by adding missing+new datasets to index
module.
Here is an example listing europarl-v9 corpus.
Note: the language codes such as de
en
etc will be mapped to 3 letter ISO codes deu
eng
internally
from mtdata.index import INDEX as index, Entry
EUROPARL_v9 = 'http://www.statmt.org/europarl/v9/training/europarl-v9.%s-%s.tsv.gz'
for pair in ['de en', 'cs en', 'cs pl', 'es pt', 'fi en', 'lt en']:
l1, l2 = pair.split()
index.add_entry(Entry(langs=(l1, l2), name='europarl_v9', url=EUROPARL_v9 % (l1, l2)))
If a datset is inside an archive such as zip
or tar
from mtdata.index import INDEX as index, Entry
wmt_sets = {
'newstest2014': [('de', 'en'), ('cs', 'en'), ('fr', 'en'), ('ru', 'en'), ('hi', 'en')],
'newsdev2015': [('fi', 'en'), ('en', 'fi')]
}
for set_name, pairs in wmt_sets.items():
for l1, l2 in pairs:
src = f'dev/{set_name}-{l1}{l2}-src.{l1}.sgm'
ref = f'dev/{set_name}-{l1}{l2}-ref.{l2}.sgm'
name = f'{set_name}_{l1}{l2}'
index.add_entry(Entry((l1, l2), name=name, filename='wmt20dev.tgz', in_paths=[src, ref],
url='http://data.statmt.org/wmt20/translation-task/dev.tgz'))
# filename='wmt20dev.tgz' -- is manually set, because url has dev.gz that can be confusing
# in_paths=[src, ref] -- listing two sgm files inside the tarball
# in_ext='sgm' will be auto detected fropm path. set in_ext='txt' to explicitly set format as plain text
Refer to paracrawl, tilde, or statmt for examples.
If citation is available for a dataset, please incl.deu
cite = r"""bib tex here""
Entry(... cite=cite)
For adding a custom parser, or file handler look into parser.read_segs()
and cache
for dealing with a new archive/file type that is not already supported.
Developers:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mtdata-0.2.5.tar.gz
.
File metadata
- Download URL: mtdata-0.2.5.tar.gz
- Upload date:
- Size: 284.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 12b86877488da7a117918177b028db928c66d7a897dee29686f78d7b5818e32b |
|
MD5 | 89d0d69d3ecfb39ab1de082bc414902a |
|
BLAKE2b-256 | 59ce8046bf871e28c824592dc9e4d1aeb3562f9074099cd12e8aa44b804a7a09 |
File details
Details for the file mtdata-0.2.5-py3-none-any.whl
.
File metadata
- Download URL: mtdata-0.2.5-py3-none-any.whl
- Upload date:
- Size: 296.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 15afc6ed9aae98fccaeb73889521a96448d721852623d39ec45490e40eb6e582 |
|
MD5 | 23fa4a73ce4451c6931d78323689a382 |
|
BLAKE2b-256 | 6256b0e887e193e916a0cf43f6b22005430ecfa9075dec9ee599e9e9934ad70b |