A tokenizer, text cleaner, and phonemizer for many human languages.

These details have not been verified by PyPI

Project links

Homepage

Project description

Gruut

A tokenizer, text cleaner, and IPA phonemizer for several human languages.

$ echo 'He wound it around the wound, saying "I read it was $10 to read."' | \
    gruut en-us tokenize | \
    gruut en-us phonemize | \
    jq -c .clean_words,.pronunciation

["he","wound","it","around","the","wound",",","saying","i","read","it","was","ten","dollars","to","read","."]
[["h","ˈi"],["w","ˈaʊ","n","d"],["ˈɪ","t"],["ɚ","ˈaʊ","n","d"],["ð","ˈi"],["w","ˈu","n","d"],["|"],["s","ˈeɪ","ɪ","ŋ"],["ˈaɪ"],["ɹ","ˈɛ","d"],["ˈɪ","t"],["w","ˈɑ","z"],["t","ˈɛ","n"],["d","ˈɑ","l","ɚ","z"],["t","ˈu"],["ɹ","ˈi","d"],["‖"]]

Includes a pre-trained U.S. English model with part-of-speech/tense aware pronunciations (e.g., "read" pronounced like "red" or "reed").

Pre-trained models are also available for the supported languages.

Useful for transforming raw text into phonetic pronunciations, similar to phonemizer. Unlike phonemizer, gruut looks up words in a pre-built lexicon (pronunciation dictionary) or guesses word pronunciations with a pre-trained grapheme-to-phoneme model. Phonemes for each language come from a carefully chosen inventory.

For each supported language, gruut includes a:

List of phonemes in the International Phonetic Alphabet
Word pronunciation lexicon built from Wiktionary
- See pron_dict
Pre-trained grapheme-to-phoneme model for guessing word pronunciations

Supported Languages

gruut currently supports:

U.S. English (en-us)
U.K. English (en-gb)
Dutch (nl)
Czech (cs-cz)
German (de-de)
French (fr-fr)
Italian (it-it)
Spanish (es-es)
Russian (ru-ru)
Vietnamese (vi-n)

The goal is to support all of voice2json's languages

Dependencies

Python 3.7 or higher
Linux
- Tested on Debian Buster
Babel and num2words
- Currency/number handling
gruut-ipa
- IPA pronunciation manipulation
phonetisaurus
- Guessing word pronunciations outside lexicon

Installation

$ pip install gruut

For Raspberry Pi (ARM), you will first need to manually install phonetisaurus.

Language Download

Pre-trained models for gruut can be downloaded with:

$ python3 -m gruut <LANGUAGE> download

A U.S. English model is included in the distribution.

By default, models are stored in $HOME/.config/gruut. This can be overridden by passing a --data-dir argument to all gruut commands.

Usage

The gruut module can be executed with python3 -m gruut <LANGUAGE> <COMMAND> <ARGS>

The commands are line-oriented, consuming/producing either text or JSONL. They can be composed to produce a pipeline for cleaning text.

You will probably want to install jq to manipulate the JSONL output from gruut.

tokenize

Takes raw text and outputs JSONL with cleaned words/tokens.

$ echo 'This, right here, is some RAW text!' \
    | python3 -m gruut en-us tokenize \
    | jq -c .clean_words
["this", ",", "right", "here", ",", "is", "some", "raw", "text", "!"]

See python3 -m gruut <LANGUAGE> tokenize --help for more options.

phonemize

Takes JSONL output from tokenize and produces JSONL with phonemic pronunciations.

$ echo 'This, right here, is some RAW text!' \
    | python3 -m gruut en-us tokenize \
    | python3 -m gruut en-us phonemize \
    | jq -c .pronunciation_text
ð ɪ s | ɹ aɪ t h iː ɹ | ɪ z s ʌ m ɹ ɑː t ɛ k s t ‖

See python3 -m gruut <LANGUAGE> phonemize --help for more options.

phones2phonemes

Takes IPA pronunciations (one per line) and outputs JSONL with phonemes and their descriptions.

$ echo '/ˈt͡ʃuːz/' \
    | python3 -m gruut en-us phones2phonemes --keep-stress \
    | jq .phonemes
[
  {
    "text": "t͡ʃ",
    "letters": "t͡ʃ",
    "example": "[ch]in",
    "stress": "primary",
    "type": "Consonant",
    "place": "post-alveolar",
    "voiced": false,
    "nasalated": false,
    "elongated": false
  },
  {
    "text": "uː",
    "letters": "u",
    "example": "s[oo]n",
    "stress": "none",
    "height": "close",
    "placement": "back",
    "rounded": true,
    "type": "Vowel",
    "nasalated": false,
    "elongated": true
  },
  {
    "text": "z",
    "letters": "z",
    "example": "[z]ing",
    "stress": "none",
    "type": "Consonant",
    "place": "alveolar",
    "voiced": true,
    "nasalated": false,
    "elongated": false
  }
]

See python3 -m gruut <LANGUAGE> phones2phonemes --help for more options.

coverage

Takes JSONL from from phonemize and outputs a coverage report for all singleton and phoneme pairs.

$ echo 'The quick brown fox jumps over the lazy dog.' \
    | python3 -m gruut en-us tokenize \
    | python3 -m gruut en-us phonemize \
    | python3 -m gruut en-us coverage \
    | jq -c .coverage
{"single":0.625,"pair":0.42028985507246375}

With multiple sentences:

$ cat << EOF |
The birch canoe slid on the smooth planks.
Glue the sheet to the dark blue background.
It's easy to tell the depth of a well.
These days a chicken leg is a rare dish.
Rice is often served in round bowls.
The juice of lemons makes fine punch.
The box was thrown beside the parked truck.
The hogs were fed chopped corn and garbage.
Four hours of steady work faced us.
Large size in stockings is hard to sell.
EOF
    python3 -m gruut en-us tokenize \
    | python3 -m gruut en-us phonemize \
    | python3 -m gruut en-us coverage \
    | jq -c .coverage
{"single":0.9,"pair":0.8214285714285714}

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.4.0

Jul 3, 2024

2.3.4

Jun 17, 2022

2.3.3

Jun 17, 2022

2.3.2

May 11, 2022

2.3.1

May 11, 2022

2.3.0

Mar 30, 2022

2.2.3

Mar 17, 2022

2.2.2

Mar 11, 2022

2.2.0

Dec 6, 2021

2.1.1

Dec 3, 2021

2.1.0

Nov 10, 2021

2.0.4

Nov 5, 2021

2.0.3

Nov 1, 2021

2.0.2

Oct 19, 2021

2.0.1

Oct 15, 2021

2.0.0 yanked

Oct 14, 2021

Reason this release was yanked:

Bug fix for Python 3.6

1.3.1

Aug 2, 2021

1.3.0

Jul 22, 2021

1.2.3

Jul 11, 2021

1.2.2

Jun 18, 2021

1.2.1

Jun 16, 2021

1.1.0

Jun 9, 2021

1.0.0

Jun 1, 2021

0.9.5

Apr 27, 2021

0.9.4

Apr 14, 2021

0.9.3

Apr 12, 2021

0.9.2

Mar 31, 2021

0.9.1

Mar 26, 2021

This version

0.8.0

Mar 5, 2021

0.7.0

Mar 3, 2021

0.3.0

Oct 26, 2020

0.2.1

Oct 9, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gruut-0.8.0.tar.gz (18.0 MB view details)

Uploaded Mar 5, 2021 Source

File details

Details for the file gruut-0.8.0.tar.gz.

File metadata

Download URL: gruut-0.8.0.tar.gz
Upload date: Mar 5, 2021
Size: 18.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.7

File hashes

Hashes for gruut-0.8.0.tar.gz
Algorithm	Hash digest
SHA256	`68183c5cc7083a04f0c331cb5672b1e4beafea2307094798f4638bcc0062ccfa`
MD5	`be893ebe7ab0be50fad060e22e5e3006`
BLAKE2b-256	`04c8a595c2e1efe316e3e31c8d316794ba2295c820c65ad7a17ae176c16cb415`