Skip to main content

Compound word splitter, dictionary-based

Project description

wikdict-compound

PyPI Changelog

This library splits compound words into the individual parts. It uses a large dictionary including inflected forms and keeps the amount of language specific rules to a minimum in order to support a variety of languages. The dictionaries come from Wiktionary via WikDict and are licensed under Creative Commons BY-SA.

Installation

Install this library using pip:

pip install wikdict-compound

Usage

Create Required Databases

To use wikdict-compound, you need a database with the required compound splitting dictionaries. These are created based on the WikDict dictionaries at https://download.wikdict.com/dictionaries/sqlite/2/. For each language you want to use

  • Download the corresponding WikDict SQLite dictionary (e.g. de.sqlite3 for German)
  • Execute make_db(lang, input_path, output_path) where input path contains the WikDict dictionary and output_path is the directory where the generated compound splitting db should be placed.

Split Compound Words

>>> from wikdict_compound import split_compound
>>> split_compound(db_path='compound_dbs', lang='de', compound='Bücherkiste')
Solution(parts=[
    Part(written_rep='Buch', score=63.57055093514545, match='bücher'),
    Part(written_rep='Kiste', score=33.89508861315521, match='kiste')
])

The returned solution object has a parts attribute, which contains the separate word parts in the correct order, along with the matched word part and a matching score (mostly interesting when comparing different splitting possibilites for the same word).

Supported Languages and Splitting Quality

The results for each language are compared against compound word information from Wikidata. For each language a success range is given, where the higher value includes all compounds where a splitting could be found while the lower value only counts those where the results are the same as on Wikidata. Since some words have multiple valid splittings and the Wikidata entries are not perfect either, the true success rate should be somewhere within this range.

  • de: 81.8%-97.7% success, tested over 2984 cases
  • en: 69.6%-98.2% success, tested over 16061 cases
  • es: 27.5%-75.6% success, tested over 1000 cases
  • fi: 78.5%-96.9% success, tested over 65 cases
  • fr: 15.2%-36.3% success, tested over 328 cases
  • it: 18.4%-60.3% success, tested over 136 cases
  • nl: 33.3%-100.0% success, tested over 3 cases
  • pl: 30.9%-90.9% success, tested over 220 cases
  • sv: 75.7%-97.8% success, tested over 5979 cases

Development

To contribute to this library, first checkout the code. Then create a new virtual environment:

cd wikdict-compound
python -m venv .venv
source .venv/bin/activate

Now install the dependencies and test dependencies:

pip install -e '.[test]'

Related Resources

The approach is similar to the one described in Simple Compound Splitting for German (Weller-Di Marco, MWE 2017). I can also recommend the paper as an overview of the problems and approaches to compound words splitting of German words.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikdict-compound-0.2.tar.gz (8.2 kB view details)

Uploaded Source

Built Distribution

wikdict_compound-0.2-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file wikdict-compound-0.2.tar.gz.

File metadata

  • Download URL: wikdict-compound-0.2.tar.gz
  • Upload date:
  • Size: 8.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.1

File hashes

Hashes for wikdict-compound-0.2.tar.gz
Algorithm Hash digest
SHA256 f073da75469a7a9012c873e70585bfcfcf49c1b53ef7d249938fe3240a94c886
MD5 619e229b6856cd0cc8aefecc142e4227
BLAKE2b-256 b85c32da3eed59e4748c3361c12624750788300d5ad6797f0e5962df23d90863

See more details on using hashes here.

File details

Details for the file wikdict_compound-0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for wikdict_compound-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b7962d0b0482b4bc3e0898aa37c24fd92b0d4f9fa0eeadb1a2afd26446b0785b
MD5 444d16c785f2e1f1a2ef93ca0b408f77
BLAKE2b-256 9b3c8f79098d477ab4480983928bde30a8f3f2cd88bdd72d0376da14eb023cc1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page