Compound word splitter, dictionary-based
Project description
wikdict-compound
This library splits compound words into the individual parts. It uses a large dictionary including inflected forms and keeps the amount of language specific rules to a minimum in order to support a variety of languages. The dictionaries come from Wiktionary via WikDict and are licensed under Creative Commons BY-SA.
Installation
Install this library using pip
:
pip install wikdict-compound
Usage
Create Required Databases
To use wikdict-compound, you need a database with the required compound splitting dictionaries. These are created based on the WikDict dictionaries at https://download.wikdict.com/dictionaries/sqlite/2/. For each language you want to use
- Download the corresponding WikDict SQLite dictionary (e.g.
de.sqlite3
for German) - Execute
make_db(lang, input_path, output_path)
whereinput
path contains the WikDict dictionary andoutput_path
is the directory where the generated compound splitting db should be placed.
Split Compound Words
from wikdict_compound import split_compound
parts = split_compound(db_path='compound_dbs', lang='de', compound='Gartenschere')
This returns the list of words which form the compound in the correct order, along with a rating of the word importance, in this case [('Garten', 1.4645167634735892), ('Schere', 1.1692122623775094)]
.
Supported Languages and Splitting Quality
The results for each language are compared against compound word information from Wikidata. For each language a success range is given, where the higher value includes all compounds where a splitting could be found while the lower value only counts those where the results are the same as on Wikidata. Since some words have multiple valid splittings and the Wikidata entries are not perfect either, the true success rate should be somewhere within this range.
- de: 77.0%-96.2% success, tested over 2984 cases
- en: 64.2%-99.6% success, tested over 16061 cases
- es: 13.9%-27.5% success, tested over 1000 cases
- fi: 76.9%-89.2% success, tested over 65 cases
- fr: 14.0%-33.8% success, tested over 328 cases
- it: 14.0%-30.1% success, tested over 136 cases
- nl: 33.3%-100.0% success, tested over 3 cases
- pl: 29.5%-69.5% success, tested over 220 cases
- sv: 75.7%-96.7% success, tested over 5922 cases
Development
To contribute to this library, first checkout the code. Then create a new virtual environment:
cd wikdict-compound
python -m venv .venv
source .venv/bin/activate
Now install the dependencies and test dependencies:
pip install -e '.[test]'
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file wikdict-compound-0.1.tar.gz
.
File metadata
- Download URL: wikdict-compound-0.1.tar.gz
- Upload date:
- Size: 4.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f0ceee47140866ef8e9cc29211cc66d294a187ac9f668b2ed2c818cd62404d7c |
|
MD5 | 554e396ca640a6138e8b5253ad68e138 |
|
BLAKE2b-256 | a3e1f2baa2dddff525137bac52cfeca78dca42c87cd41ecf53e55e4ed0cde718 |
File details
Details for the file wikdict_compound-0.1-py3-none-any.whl
.
File metadata
- Download URL: wikdict_compound-0.1-py3-none-any.whl
- Upload date:
- Size: 4.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1a812030480a0fe076013cef71c558e7ff37aba27585f068731b4053712782f7 |
|
MD5 | cf8a692ab5a139d025ea2fc67ef46f39 |
|
BLAKE2b-256 | a3b9994250e330f1d2ae9af6fecd5b5c5014e312bb194deac29c1faab47d644a |