Skip to main content

A tokenizer, text cleaner, and phonemizer for many human languages.

Project description

Gruut

A tokenizer, text cleaner, and IPA phonemizer for several human languages.

from gruut import text_to_phonemes

text = 'He wound it around the wound, saying "I read it was $10 to read."'

for sent_idx, word, word_phonemes in text_to_phonemes(text, lang="en-us"):
    print(word, *word_phonemes)

which outputs:

he h ˈi
wound w ˈaʊ n d
it ˈɪ t
around ɚ ˈaʊ n d
the ð ə
wound w ˈu n d
, |
saying s ˈeɪ ɪ ŋ
i ˈaɪ
read ɹ ˈɛ d
it ˈɪ t
was w ə z
ten t ˈɛ n
dollars d ˈɑ l ɚ z
to t ə
read ɹ ˈi d
. ‖

Note that "wound" and "read" have different pronunciations when used in different contexts.

See the documentation for more details.

Intended Audience

gruut is useful for transforming raw text into phonetic pronunciations, similar to phonemizer. Unlike phonemizer, gruut looks up words in a pre-built lexicon (pronunciation dictionary) or guesses word pronunciations with a pre-trained grapheme-to-phoneme model. Phonemes for each language come from a carefully chosen inventory.

For each supported language, gruut includes a:

  • A word pronunciation lexicon built from open source data
  • A pre-trained grapheme-to-phoneme model for guessing word pronunciations

Some languages also include:

Supported Languages

gruut currently supports:

  • Czech (cs)
  • German (de)
  • English (en)
  • Spanish (es)
  • Farsi/Persian (fa)
  • French (fr)
  • Italian (it)
  • Dutch (nl)
  • Russian (ru)
  • Swedish (sv)

The goal is to support all of voice2json's languages

Dependencies

  • Python 3.7 or higher
  • Linux
    • Tested on Debian Buster
  • Babel and num2words
    • Currency/number handling
  • gruut-ipa
    • IPA pronunciation manipulation
  • pycrfsuite
    • Part of speech tagging and grapheme to phoneme models

Installation

$ pip install gruut

Additional languages can be added during installation. For example, with French and Italian support:

$ pip install gruut[fr,it]

Command-Line Usage

The gruut module can be executed with python3 -m gruut <LANGUAGE> <COMMAND> <ARGS>

The commands are line-oriented, consuming/producing either text or JSONL. They can be composed to produce a pipeline for cleaning text.

You will probably want to install jq to manipulate the JSONL output from gruut.

tokenize

Takes raw text and outputs JSONL with cleaned words/tokens.

$ echo 'This, right here, is some RAW text!' \
    | python3 -m gruut en-us tokenize \
    | jq -c .clean_words
["this", ",", "right", "here", ",", "is", "some", "raw", "text", "!"]

See python3 -m gruut <LANGUAGE> tokenize --help for more options.

phonemize

Takes JSONL output from tokenize and produces JSONL with phonemic pronunciations.

$ echo 'This, right here, is some RAW text!' \
    | python3 -m gruut en-us tokenize \
    | python3 -m gruut en-us phonemize \
    | jq -c .pronunciation_text
ð ɪ s | ɹ  t h  ɹ | ɪ z s ʌ m ɹ ɑː t ɛ k s t 

See python3 -m gruut <LANGUAGE> phonemize --help for more options.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gruut-1.0.0.tar.gz (7.4 MB view details)

Uploaded Source

File details

Details for the file gruut-1.0.0.tar.gz.

File metadata

  • Download URL: gruut-1.0.0.tar.gz
  • Upload date:
  • Size: 7.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.10

File hashes

Hashes for gruut-1.0.0.tar.gz
Algorithm Hash digest
SHA256 256e356b8cf2da1aaa0ec653326e303dfc90c958279247d8bcba7178945a6c2f
MD5 e095a149606ac649723ec65a404db391
BLAKE2b-256 4b87552838a99b58f4ba6e2b6e9cd7e9331ee05e34cddb30bc1cd8be7d4f48f7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page