Skip to main content

Scraping grapheme-to-phoneme data from Wiktionary.

Project description

WikiPron

PyPI version Supported Python versions CircleCI

WikiPron is a command line toolkit for scraping grapheme-to-phoneme (G2P) data from Wiktionary.

Installation

WikiPron requires Python 3.6+. It is available through pip:

pip install wikipron

Usage

Quick Start

After installation, the terminal command wikipron will be available. As a basic example, the following command scrapes G2P data for French:

wikipron fra

Specifying the Language

The language is indicated by a three-letter ISO 639-2 or ISO 639-3 language code, e.g., fra for French. For which languages can be scraped, here is the complete list of languages on Wiktionary that have pronunciation entries.

Output

The scraped data is organized with each <word, pronunciation> pair on its own line, where the word and pronunciation are separated by a tab. Note that the pronunciation is in International Phonetic Alphabet (IPA), segmented by spaces that correctly handle the combining and modifier diacritics for modeling purposes, e.g., we have kʰ æ t with the aspirated k instead of k ʰ æ t.

For illustration, here is a snippet of French data scraped by WikiPron:

accrémentitielle	a k ʁ e m ɑ̃ t i t j ɛ l
accrescent	a k ʁ ɛ s ɑ̃
accrétion	a k ʁ e s j ɔ̃
accrétions	a k ʁ e s j ɔ̃

By default, the scraped data appears in the terminal. To save the data in a TSV file, please redirect the standard output to a filename of your choice:

wikipron fra > fra.tsv

Advanced Options

The wikipron terminal command has an array of options to configure your scraping run. For a full list of the options, please run wikipron -h.

Python API

The underlying module can also be used from Python. A standard workflow looks like:

import wikipron

config = wikipron.Config(key="fra")  # French, with default options.
for word, pron in wikipron.scrape(config):
    ...

Development

The source code of WikiPron is hosted on GitHub at https://github.com/kylebgorman/wikipron, where development also happens.

For the latest changes not yet released through pip or working on the codebase yourself, you may obtain the latest source code through GitHub and git:

  1. Create a fork of the wikipron repo on your GitHub account.

  2. Locally, make sure you are in some sort of a virtual environment (venv, virtualenv, conda, etc).

  3. Download and install the library in the "editable" mode together with the core and dev dependencies within the virtual environment:

    git clone https://github.com/<your-github-username>/wikipron.git
    cd wikipron
    pip install --upgrade pip setuptools
    pip install -r requirements.txt
    pip install --no-deps -e .
    

We keep track of notable changes in CHANGELOG.md.

Contribution

For questions, bug reports, and feature requests, please file an issue.

If you would like to contribute to the wikipron codebase, please see CONTRIBUTING.md.

License

Apache 2.0. Please see LICENSE.txt for details.

Please note that Wiktionary data has its own licensing terms , as does the other data in the languages/ subdirectory.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikipron-1.0.0.tar.gz (24.5 kB view details)

Uploaded Source

Built Distribution

wikipron-1.0.0-py3-none-any.whl (34.0 kB view details)

Uploaded Python 3

File details

Details for the file wikipron-1.0.0.tar.gz.

File metadata

  • Download URL: wikipron-1.0.0.tar.gz
  • Upload date:
  • Size: 24.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.1 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.6.9

File hashes

Hashes for wikipron-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c5405befc50d164881da22a234e6e94208866b5d687b874c26582bedfa24bb9a
MD5 e4283142918e90a63042e1ed2d761d00
BLAKE2b-256 4f70e278f3121f3d16455513cbe98569936e2eeae9200a8690ad7ce99dcedfea

See more details on using hashes here.

File details

Details for the file wikipron-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: wikipron-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 34.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.1 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.6.9

File hashes

Hashes for wikipron-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 17e32316659ff1229929b3cd1634de02a7fbf14cf09a5ef073888b5fb38d6341
MD5 11fff9958d5592ee3cc1f3afacb89f05
BLAKE2b-256 762f4af5693054c04ba7583441fe217b689ba09592b87b0ceb59121e151767fa

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page