A tokenizer, text cleaner, and phonemizer for many human languages.
Project description
Gruut
A tokenizer, text cleaner, and IPA phonemizer for several human languages.
from gruut import text_to_phonemes
text = 'He wound it around the wound, saying "I read it was $10 to read."'
for sent_idx, word, word_phonemes in text_to_phonemes(text, lang="en-us"):
print(word, *word_phonemes)
which outputs:
he h ˈi
wound w ˈaʊ n d
it ˈɪ t
around ɚ ˈaʊ n d
the ð ə
wound w ˈu n d
, |
saying s ˈeɪ ɪ ŋ
i ˈaɪ
read ɹ ˈɛ d
it ˈɪ t
was w ə z
ten t ˈɛ n
dollars d ˈɑ l ɚ z
to t ə
read ɹ ˈi d
. ‖
Note that "wound" and "read" have different pronunciations when used in different contexts.
See the documentation for more details.
Installation
$ pip install gruut
Additional languages can be added during installation. For example, with French and Italian support:
$ pip install gruut[fr,it]
You may also manually download language files and use the --lang-dir
option:
$ gruut <lang> <command> --lang-dir /path/to/language-files/
Extracting the files to $HOME/.config/gruut/
will allow gruut to automatically make use of them. gruut will look for language files in the directory $HOME/.config/gruut/<lang>/
if the corresponding Python package is not installed. Note that <lang>
here is the full language name, e.g. de-de
instead of just de
.
Supported Languages
gruut currently supports:
- Czech (
cs
orcs-cz
) - German (
de
orde-de
) - English (
en
oren-us
) - Spanish (
es
ores-es
) - Farsi/Persian (
fa
) - French (
fr
orfr-fr
) - Italian (
it
orit-it
) - Dutch (
nl
) - Russian (
ru
orru-ru
) - Swedish (
sv
orsv-se
)
The goal is to support all of voice2json's languages
Dependencies
- Python 3.7 or higher
- Linux
- Tested on Debian Buster
- Babel and num2words
- Currency/number handling
- gruut-ipa
- IPA pronunciation manipulation
- pycrfsuite
- Part of speech tagging and grapheme to phoneme models
Command-Line Usage
The gruut
module can be executed with python3 -m gruut <LANGUAGE> <COMMAND> <ARGS>
The commands are line-oriented, consuming/producing either text or JSONL. They can be composed to produce a pipeline for cleaning text.
You will probably want to install jq to manipulate the JSONL output from gruut
.
tokenize
Takes raw text and outputs JSONL with cleaned words/tokens.
$ echo 'This, right here, is some RAW text!' \
| python3 -m gruut en-us tokenize \
| jq -c .clean_words
["this", ",", "right", "here", ",", "is", "some", "raw", "text", "!"]
See python3 -m gruut <LANGUAGE> tokenize --help
for more options.
phonemize
Takes JSONL output from tokenize
and produces JSONL with phonemic pronunciations.
$ echo 'This, right here, is some RAW text!' \
| python3 -m gruut en-us tokenize \
| python3 -m gruut en-us phonemize \
| jq -c .pronunciation_text
ð ɪ s | ɹ aɪ t h iː ɹ | ɪ z s ʌ m ɹ ɑː t ɛ k s t ‖
See python3 -m gruut <LANGUAGE> phonemize --help
for more options.
Intended Audience
gruut is useful for transforming raw text into phonetic pronunciations, similar to phonemizer. Unlike phonemizer, gruut looks up words in a pre-built lexicon (pronunciation dictionary) or guesses word pronunciations with a pre-trained grapheme-to-phoneme model. Phonemes for each language come from a carefully chosen inventory.
For each supported language, gruut includes a:
- A word pronunciation lexicon built from open source data
- See pron_dict
- A pre-trained grapheme-to-phoneme model for guessing word pronunciations
Some languages also include:
- A pre-trained part of speech tagger built from open source data:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file gruut-1.1.0.tar.gz
.
File metadata
- Download URL: gruut-1.1.0.tar.gz
- Upload date:
- Size: 7.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e849640b7470d54f2306b80f893ab323ff593f8199a846efc6c582b462604059 |
|
MD5 | 0d7ad8d4cb72fc883bdda71d12f7aee3 |
|
BLAKE2b-256 | a34318398616e13539d9efec974a747702bac327a13d3e1b1e435a4f547959a6 |