Skip to main content

mtdata is a tool to download datasets for machine translation

Project description

MTData

MTData tool is written to reduce the burden of preparing the datasets for machine translation. It provides commandline and python APIs that can be either used as a standalone tool, or call it from shell scripts or embed it in python application for preparing MT experiments.

With this you DON'T have to :

  • Know where the URLs are for data sets: WMT tests and devs for [2014, 2015, ... 2020], Paracrawl, Europarl, News Commentary, WikiTitles ...
  • Know how to extract files : .tar, .tar.gz, .tgz, .zip, .gz, ...
  • Know how to parse .tmx, .sgm, .tsv
  • Know if parallel data is in one .tsv file or two sgm files
  • (And more over the time. Create an issue discuss more of such "you dont have to" topics)

because, MTData does all the above under the hood.

Installation

# coming soon to pypi
# pip install mtdata 

git clone https://github.com/thammegowda/mtdata 
cd mtdata
pip install .  # add "--editable" flag for development mode 

CLI Usage

  • After pip installation, the CLI can be called using mtdata command or python -m mtdata
  • There are two sub commands: list for listing the datasets, and get for getting them

mtdata list

mtdata list -h
usage: mtdata list [-h] [-l LANGS] [-n [NAMES [NAMES ...]]]

optional arguments:
  -h, --help            show this help message and exit
  -l LANGS, --langs LANGS
                        Language pairs; e.g.: de-en
  -n [NAMES [NAMES ...]], --names [NAMES [NAMES ...]]
                        Name of dataset set; eg europarl_v9.
# List everything
mtdata list

# List a lang pair 
mtdata list -l de-en

# List a dataset by name(s)
mtdata list -n europarl_v9
mtdata list -n europarl_v9 news_commentary_v14

# list by both language pair and dataset name
mtdata list -l de-en -n europarl_v9 news_commentary_v14 newstest201{4,5,6,7,8,9}_deen

mtdata get

mtdata get -h
usage: mtdata get [-h] -l LANGS [-n [NAMES [NAMES ...]]] -o OUT

optional arguments:
  -h, --help            show this help message and exit
  -l LANGS, --langs LANGS
                        Language pairs; e.g.: de-en
  -n [NAMES [NAMES ...]], --names [NAMES [NAMES ...]]
                        Name of dataset set; eg europarl_v9.
  -o OUT, --out OUT     Output directory name

Here is an example showing collection and preparation of DE-EN datasets.

mtdata get  -l de-en -n europarl_v9 news_commentary_v14 newstest201{4,5,6,7,8,9}_deen -o de-en

How to extend:

Please help grow the datasets by adding missing+new datasets to index.py module. Here is an example listing europarl-v9 corpus.

from mtdata.index import entries, Entry
EUROPARL_v9 = 'http://www.statmt.org/europarl/v9/training/europarl-v9.%s-%s.tsv.gz'
for pair in ['de en', 'cs en', 'cs pl', 'es pt', 'fi en', 'lt en']:
    l1, l2 = pair.split()
    entries.append(Entry(langs=(l1, l2), name='europarl_v9', url=EUROPARL_v9 % (l1, l2)))

If a datset is inside an archive such as zip or tar

from mtdata.index import entries, Entry
wmt_sets = {
    'newstest2014': [('de', 'en'), ('cs', 'en'), ('fr', 'en'), ('ru', 'en'), ('hi', 'en')],
    'newsdev2015': [('fi', 'en'), ('en', 'fi')]
}
for set_name, pairs in wmt_sets.items():
    for l1, l2 in pairs:
        src = f'dev/{set_name}-{l1}{l2}-src.{l1}.sgm'
        ref = f'dev/{set_name}-{l1}{l2}-ref.{l2}.sgm'
        name = f'{set_name}_{l1}{l2}'
        entries.append(Entry((l1, l2), name=name, filename='wmt20dev.tgz', in_paths=[src, ref],
                             url='http://data.statmt.org/wmt20/translation-task/dev.tgz'))
# filename='wmt20dev.tgz' -- is manually set, because url has dev.gz that can be confusing
# in_paths=[src, ref]  -- listing two sgm files inside the tarball

Developers:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mtdata-0.1.tar.gz (11.8 kB view details)

Uploaded Source

Built Distribution

mtdata-0.1-py3-none-any.whl (17.9 kB view details)

Uploaded Python 3

File details

Details for the file mtdata-0.1.tar.gz.

File metadata

  • Download URL: mtdata-0.1.tar.gz
  • Upload date:
  • Size: 11.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.3

File hashes

Hashes for mtdata-0.1.tar.gz
Algorithm Hash digest
SHA256 4530d954305ffe24e8ede4409421da4dfa1f006fe1b2980777d0826a456e2ee2
MD5 9f93911d75927b507507a38c68e1d9fd
BLAKE2b-256 8d5b6f3bed524faaa9992fd7820eabfcfb0a38df83e4db527308d706d9218786

See more details on using hashes here.

File details

Details for the file mtdata-0.1-py3-none-any.whl.

File metadata

  • Download URL: mtdata-0.1-py3-none-any.whl
  • Upload date:
  • Size: 17.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.3

File hashes

Hashes for mtdata-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0e0eef2f2627eedd29e7ca38010670f0a51f9355de46720fab0931498957ce09
MD5 87fba2db26994a93890bdc6e62e9e115
BLAKE2b-256 69b3fa0338849183beb610e0b9fda98f7c0974a4bb01bd9684b2fd858322d3c3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page