A tokenizer, text cleaner, and phonemizer for many human languages.
Project description
Gruut
A tokenizer, text cleaner, and IPA phonemizer for several human languages that supports SSML.
from gruut import sentences
text = 'He wound it around the wound, saying "I read it was $10 to read."'
for sent in sentences(text, lang="en-us"):
if word.phonemes:
print(word.text, *word.phonemes)
which outputs:
He h ˈi
wound w ˈaʊ n d
it ˈɪ t
around ɚ ˈaʊ n d
the ð ə
wound w ˈu n d
, |
saying s ˈeɪ ɪ ŋ
I ˈaɪ
read ɹ ˈɛ d
it ˈɪ t
was w ə z
ten t ˈɛ n
dollars d ˈɑ l ɚ z
to t ə
read ɹ ˈi d
. ‖
Note that "wound" and "read" have different pronunciations when used in different (grammatical) contexts.
A subset of SSML is also supported:
from gruut import sentences
ssml_text = """<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<s>Today at 4pm, 2/1/2000.</s>
<s xml:lang="it">Un mese fà, 2/1/2000.</s>
</speak>"""
for sent in sentences(ssml_text, ssml=True):
for word in sent:
if word.phonemes:
print(sent.idx, word.lang, word.text, *word.phonemes)
with the output:
0 en-US Today t ə d ˈeɪ
0 en-US at ˈæ t
0 en-US four f ˈɔ ɹ
0 en-US P p ˈi
0 en-US M ˈɛ m
0 en-US , |
0 en-US February f ˈɛ b j u ˌɛ ɹ i
0 en-US first f ˈɚ s t
0 en-US , |
0 en-US two t ˈu
0 en-US thousand θ ˈaʊ z ə n d
0 en-US . ‖
1 it Un u n
1 it mese ˈm e s e
1 it fà f a
1 it , |
1 it due d j u
1 it gennaio d͡ʒ e n n ˈa j o
1 it duemila d u e ˈm i l a
1 it . ‖
See the documentation for more details.
Installation
pip install gruut
Languages besides English can be added during installation. For example, with French and Italian support:
pip install -f 'https://synesthesiam.github.io/prebuilt-apps/' gruut[fr,it]
The extra pip repo is needed for an updated num2words fork that includes support for more languages.
You may also manually download language files and use put them in $XDG_CONFIG_HOME/gruut/
($HOME/.config/gruut
by default).
gruut will look for language files in the directory $XDG_CONFIG_HOME/gruut/<lang>/
if the corresponding Python package is not installed. Note that <lang>
here is the full language name, e.g. de-de
instead of just de
.
Supported Languages
gruut currently supports:
- Arabic (
ar
) - Czech (
cs
orcs-cz
) - German (
de
orde-de
) - English (
en
oren-us
) - Spanish (
es
ores-es
) - Farsi/Persian (
fa
) - French (
fr
orfr-fr
) - Italian (
it
orit-it
) - Dutch (
nl
) - Russian (
ru
orru-ru
) - Swedish (
sv
orsv-se
) - Swahili (
sw
)
The goal is to support all of voice2json's languages
Dependencies
- Python 3.7 or higher
- Linux
- Tested on Debian Bullseye
- num2words fork and Babel
- Currency/number handling
- num2words fork includes additional language support (Arabic, Farsi, Swedish, Swahili)
- gruut-ipa
- IPA pronunciation manipulation
- pycrfsuite
- Part of speech tagging and grapheme to phoneme models
- pydateparser
- Date parsing for multiple languages
Numbers, Dates, and More
gruut
can automatically verbalize numbers, dates, and other expressions. This is done in a locale-aware manner for both parsing and verbalization, so "1/1/2020" may be interpreted as "M/D/Y" or "D/M/Y" depending on the word or sentence's language (e.g., <s lang="...">
).
The following types of expressions can be automatically expanded into words by gruut
:
- Numbers - "123" to "one hundred and twenty three" (disable with
verbalize_numbers=False
or--no-numbers
)- Relies on
Babel
for parsing andnum2words
for verbalization
- Relies on
- Dates - "1/1/2020" to "January first, twenty twenty" (disable with
verbalize_dates=False
or--no-dates
)- Relies on
pydateparser
for parsing and bothBabel
andnum2words
for verbalization
- Relies on
- Currency - "$10" to "ten dollars" (disable with
verbalize_currency=False
or--no-currency
)- Relies on
Babel
for parsing and bothBabel
andnum2words
for verbalization
- Relies on
- Times - "12:01am" to "twelve oh one A M" (disable with
verbalize_times=False
or--no-times
)- English only
- Relies on
num2words
for verbalization
Command-Line Usage
The gruut
module can be executed with python3 -m gruut --language <LANGUAGE> <TEXT>
or with the gruut
command (from setup.py
).
The gruut
command is line-oriented, consuming text and producing JSONL.
You will probably want to install jq to manipulate the JSONL output from gruut
.
Plain Text
Takes raw text and outputs JSONL with cleaned words/tokens.
echo 'This, right here, is some "RAW" text!' \
| gruut --language en-us \
| jq --raw-output '.words[].text'
This
,
right
here
,
is
some
"
RAW
"
text
!
More information is available in the full JSON output:
gruut --language en-us 'More text.' | jq .
Output:
{
"idx": 0,
"text": "More text.",
"text_with_ws": "More text.",
"text_spoken": "More text",
"par_idx": 0,
"lang": "en-us",
"voice": "",
"words": [
{
"idx": 0,
"text": "More",
"text_with_ws": "More ",
"leading_ws": "",
"training_ws": " ",
"sent_idx": 0,
"par_idx": 0,
"lang": "en-us",
"voice": "",
"pos": "JJR",
"phonemes": [
"m",
"ˈɔ",
"ɹ"
],
"is_major_break": false,
"is_minor_break": false,
"is_punctuation": false,
"is_break": false,
"is_spoken": true,
"pause_before_ms": 0,
"pause_after_ms": 0
},
{
"idx": 1,
"text": "text",
"text_with_ws": "text",
"leading_ws": "",
"training_ws": "",
"sent_idx": 0,
"par_idx": 0,
"lang": "en-us",
"voice": "",
"pos": "NN",
"phonemes": [
"t",
"ˈɛ",
"k",
"s",
"t"
],
"is_major_break": false,
"is_minor_break": false,
"is_punctuation": false,
"is_break": false,
"is_spoken": true,
"pause_before_ms": 0,
"pause_after_ms": 0
},
{
"idx": 2,
"text": ".",
"text_with_ws": ".",
"leading_ws": "",
"training_ws": "",
"sent_idx": 0,
"par_idx": 0,
"lang": "en-us",
"voice": "",
"pos": null,
"phonemes": [
"‖"
],
"is_major_break": true,
"is_minor_break": false,
"is_punctuation": false,
"is_break": true,
"is_spoken": false,
"pause_before_ms": 0,
"pause_after_ms": 0
}
],
"pause_before_ms": 0,
"pause_after_ms": 0
}
For the whole input line and each word, the text
property contains the processed input text with normalized whitespace while text_with_ws
retains the original whitespace. The text_spoken
property only contains words that are spoken, so punctuation and breaks are excluded.
Within each word, there is:
idx
- zero-based index of the word in the sentencesent_idx
- zero-based index of the sentence in the input textpos
- part of speech tag (if available)phonemes
- list of IPA phonemes for the word (if available)is_minor_break
-true
if "word" separates phrases (comma, semicolon, etc.)is_major_break
-true
if "word" separates sentences (period, question mark, etc.)is_break
-true
if "word" is a major or minor breakis_punctuation
-true
if "word" is a surrounding punctuation mark (quote, bracket, etc.)is_spoken
-true
if not a break or punctuation
See python3 -m gruut <LANGUAGE> --help
for more options.
SSML
A subset of SSML is supported:
<speak>
- wrap around SSML textlang
- set language for document
<p>
- paragraphlang
- set language for paragraph
<s>
- sentence (disables automatic sentence breaking)lang
- set language for sentence
<w>
/<token>
- word (disables automatic tokenization)lang
- set language for wordrole
- set word role (see word roles)
<lang lang="...">
- set language inner text<voice name="...">
- set voice of inner text<say-as interpret-as="">
- force interpretation of inner textinterpret-as
one of "spell-out", "date", "number", "time", or "currency"format
- way to format text depending oninterpret-as
- number - one of "cardinal", "ordinal", "digits", "year"
- date - string with "d" (cardinal day), "o" (ordinal day), "m" (month), or "y" (year)
<break time="">
- Pause for given amount of time- time - seconds ("123s") or milliseconds ("123ms")
<mark name="">
- User-defined mark (marks_before
andmarks_after
attributes of words/sentences)- name - name of mark
<sub alias="">
- substitutealias
for inner text<phoneme ph="...">
- supply phonemes for inner textph
- phonemes for each word of inner text, separated by whitespacealphabet
- if "ipa", phonemes are intelligently split ("aːˈb" -> "aː", "ˈb")
Word Roles
During phonemization, word roles are used to disambiguate pronunciations. Unless manually specified, a word's role is derived from its part of speech tag as gruut:<TAG>
. For initialisms and spell-out
, the role gruut:letter
is used to indicate that e.g., "a" should be spoken as /eɪ/
instead of /ə/
.
Intended Audience
gruut is useful for transforming raw text into phonetic pronunciations, similar to phonemizer. Unlike phonemizer, gruut looks up words in a pre-built lexicon (pronunciation dictionary) or guesses word pronunciations with a pre-trained grapheme-to-phoneme model. Phonemes for each language come from a carefully chosen inventory.
For each supported language, gruut includes a:
- A word pronunciation lexicon built from open source data
- See pron_dict
- A pre-trained grapheme-to-phoneme model for guessing word pronunciations
Some languages also include:
- A pre-trained part of speech tagger built from open source data:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file gruut-2.0.4.tar.gz
.
File metadata
- Download URL: gruut-2.0.4.tar.gz
- Upload date:
- Size: 15.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1fd3ec5cd7e48d2c243e5549270413df1bddb584ec360f91aa5eb73d989480cf |
|
MD5 | b3d276a5af517f62deb58a775da9fccd |
|
BLAKE2b-256 | d1df847048b992c26344d67c8338ca76109a1644a4c2395752337894368bd33a |