Skip to main content

Compound word splitter, dictionary-based

Project description

wikdict-compound

PyPI Changelog

This library splits compound words into the individual parts. It uses a large dictionary including inflected forms and keeps the amount of language specific rules to a minimum in order to support a variety of languages. The dictionaries come from Wiktionary via WikDict and are licensed under Creative Commons BY-SA.

Installation

Install this library using pip:

pip install wikdict-compound

Usage

Create Required Databases

To use wikdict-compound, you need a database with the required compound splitting dictionaries. These are created based on the WikDict dictionaries at https://download.wikdict.com/dictionaries/sqlite/2/. For each language you want to use

  • Download the corresponding WikDict SQLite dictionary (e.g. de.sqlite3 for German)
  • Execute make_db(lang, input_path, output_path) where input path contains the WikDict dictionary and output_path is the directory where the generated compound splitting db should be placed.

Split Compound Words

from wikdict_compound import split_compound

parts = split_compound(db_path='compound_dbs', lang='de', compound='Gartenschere')

This returns the list of words which form the compound in the correct order, along with a rating of the word importance, in this case [('Garten', 1.4645167634735892), ('Schere', 1.1692122623775094)].

Supported Languages and Splitting Quality

The results for each language are compared against compound word information from Wikidata. For each language a success range is given, where the higher value includes all compounds where a splitting could be found while the lower value only counts those where the results are the same as on Wikidata. Since some words have multiple valid splittings and the Wikidata entries are not perfect either, the true success rate should be somewhere within this range.

  • de: 77.0%-96.2% success, tested over 2984 cases
  • en: 64.2%-99.6% success, tested over 16061 cases
  • es: 13.9%-27.5% success, tested over 1000 cases
  • fi: 76.9%-89.2% success, tested over 65 cases
  • fr: 14.0%-33.8% success, tested over 328 cases
  • it: 14.0%-30.1% success, tested over 136 cases
  • nl: 33.3%-100.0% success, tested over 3 cases
  • pl: 29.5%-69.5% success, tested over 220 cases
  • sv: 75.7%-96.7% success, tested over 5922 cases

Development

To contribute to this library, first checkout the code. Then create a new virtual environment:

cd wikdict-compound
python -m venv .venv
source .venv/bin/activate

Now install the dependencies and test dependencies:

pip install -e '.[test]'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikdict-compound-0.1.tar.gz (4.7 kB view details)

Uploaded Source

Built Distribution

wikdict_compound-0.1-py3-none-any.whl (4.8 kB view details)

Uploaded Python 3

File details

Details for the file wikdict-compound-0.1.tar.gz.

File metadata

  • Download URL: wikdict-compound-0.1.tar.gz
  • Upload date:
  • Size: 4.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.1

File hashes

Hashes for wikdict-compound-0.1.tar.gz
Algorithm Hash digest
SHA256 f0ceee47140866ef8e9cc29211cc66d294a187ac9f668b2ed2c818cd62404d7c
MD5 554e396ca640a6138e8b5253ad68e138
BLAKE2b-256 a3e1f2baa2dddff525137bac52cfeca78dca42c87cd41ecf53e55e4ed0cde718

See more details on using hashes here.

File details

Details for the file wikdict_compound-0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for wikdict_compound-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1a812030480a0fe076013cef71c558e7ff37aba27585f068731b4053712782f7
MD5 cf8a692ab5a139d025ea2fc67ef46f39
BLAKE2b-256 a3b9994250e330f1d2ae9af6fecd5b5c5014e312bb194deac29c1faab47d644a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page