mtdata is a tool to download datasets for machine translation
Project description
MTData
MTData tool is written to reduce the burden of preparing the datasets for machine translation. It provides commandline and python APIs that can be either used as a standalone tool, or call it from shell scripts or embed it in python application for preparing MT experiments.
With this you DON'T have to :
- Know where the URLs are for data sets: WMT tests and devs for [2014, 2015, ... 2020], Paracrawl, Europarl, News Commentary, WikiTitles ...
- Know how to extract files : .tar, .tar.gz, .tgz, .zip, .gz, ...
- Know how to parse .tmx, .sgm, .tsv
- Know if parallel data is in one .tsv file or two sgm files
- (And more over the time. Create an issue discuss more of such "you dont have to" topics)
because, MTData does all the above under the hood.
Installation
# coming soon to pypi
# pip install mtdata
git clone https://github.com/thammegowda/mtdata
cd mtdata
pip install . # add "--editable" flag for development mode
CLI Usage
- After pip installation, the CLI can be called using
mtdata
command orpython -m mtdata
- There are two sub commands:
list
for listing the datasets, andget
for getting them
mtdata list
mtdata list -h
usage: mtdata list [-h] [-l LANGS] [-n [NAMES [NAMES ...]]]
optional arguments:
-h, --help show this help message and exit
-l LANGS, --langs LANGS
Language pairs; e.g.: de-en
-n [NAMES [NAMES ...]], --names [NAMES [NAMES ...]]
Name of dataset set; eg europarl_v9.
# List everything
mtdata list
# List a lang pair
mtdata list -l de-en
# List a dataset by name(s)
mtdata list -n europarl_v9
mtdata list -n europarl_v9 news_commentary_v14
# list by both language pair and dataset name
mtdata list -l de-en -n europarl_v9 news_commentary_v14 newstest201{4,5,6,7,8,9}_deen
mtdata get
mtdata get -h
usage: mtdata get [-h] -l LANGS [-n [NAMES [NAMES ...]]] -o OUT
optional arguments:
-h, --help show this help message and exit
-l LANGS, --langs LANGS
Language pairs; e.g.: de-en
-n [NAMES [NAMES ...]], --names [NAMES [NAMES ...]]
Name of dataset set; eg europarl_v9.
-o OUT, --out OUT Output directory name
Here is an example showing collection and preparation of DE-EN datasets.
mtdata get -l de-en -n europarl_v9 news_commentary_v14 newstest201{4,5,6,7,8,9}_deen -o de-en
How to extend:
Please help grow the datasets by adding missing+new datasets to index.py
module.
Here is an example listing europarl-v9 corpus.
from mtdata.index import entries, Entry
EUROPARL_v9 = 'http://www.statmt.org/europarl/v9/training/europarl-v9.%s-%s.tsv.gz'
for pair in ['de en', 'cs en', 'cs pl', 'es pt', 'fi en', 'lt en']:
l1, l2 = pair.split()
entries.append(Entry(langs=(l1, l2), name='europarl_v9', url=EUROPARL_v9 % (l1, l2)))
If a datset is inside an archive such as zip
or tar
from mtdata.index import entries, Entry
wmt_sets = {
'newstest2014': [('de', 'en'), ('cs', 'en'), ('fr', 'en'), ('ru', 'en'), ('hi', 'en')],
'newsdev2015': [('fi', 'en'), ('en', 'fi')]
}
for set_name, pairs in wmt_sets.items():
for l1, l2 in pairs:
src = f'dev/{set_name}-{l1}{l2}-src.{l1}.sgm'
ref = f'dev/{set_name}-{l1}{l2}-ref.{l2}.sgm'
name = f'{set_name}_{l1}{l2}'
entries.append(Entry((l1, l2), name=name, filename='wmt20dev.tgz', in_paths=[src, ref],
url='http://data.statmt.org/wmt20/translation-task/dev.tgz'))
# filename='wmt20dev.tgz' -- is manually set, because url has dev.gz that can be confusing
# in_paths=[src, ref] -- listing two sgm files inside the tarball
Developers:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mtdata-0.1.tar.gz
.
File metadata
- Download URL: mtdata-0.1.tar.gz
- Upload date:
- Size: 11.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4530d954305ffe24e8ede4409421da4dfa1f006fe1b2980777d0826a456e2ee2 |
|
MD5 | 9f93911d75927b507507a38c68e1d9fd |
|
BLAKE2b-256 | 8d5b6f3bed524faaa9992fd7820eabfcfb0a38df83e4db527308d706d9218786 |
File details
Details for the file mtdata-0.1-py3-none-any.whl
.
File metadata
- Download URL: mtdata-0.1-py3-none-any.whl
- Upload date:
- Size: 17.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0e0eef2f2627eedd29e7ca38010670f0a51f9355de46720fab0931498957ce09 |
|
MD5 | 87fba2db26994a93890bdc6e62e9e115 |
|
BLAKE2b-256 | 69b3fa0338849183beb610e0b9fda98f7c0974a4bb01bd9684b2fd858322d3c3 |