Skip to main content

Han character library for CJKV languages

Project description

Introduction

Cjklib provides language routines related to Han characters (characters based on Chinese characters named Hanzi, Kanji, Hanja and chu Han respectively) used in writing of the Chinese, the Japanese, infrequently the Korean and formerly the Vietnamese language(s). Functionality is included for character pronunciations, radicals, glyph components, stroke decomposition and variant information.

Dependencies

Alternatively for MySQL as backend:

Installing

Windows

Install cjklib using the provided .exe installer. Make sure above dependencies are satisfied.

Three scripts cjknife.exe, buildcjkdb.exe, and installcjkdict.exe will be added to the Python Scripts sub-directory. Make sure this directory is included in your PATH environment variable to access these programs from the command line.

CJK dictionaries are not included by default. If you want to install any of those run the following (with an Internet connection) from the root directory of the source package:

$ installcjkdict CEDICT

This will download CEDICT, create a SQLite database file and install it under the directory given by the APPDATA environment variable, e.g. C:\windows\profiles\MY_USER\Application Data\cjklib. Just substitute CEDICT for any other supported dictionary (i.e. EDICT, CEDICT, HanDeDict, CFDICT, CEDICTGR).

Unix

If you are installing from the source package you need to deploy the library on your system:

$ sudo python setup.py install

Also make sure above dependencies are satisfied. CJK dictionaries are not included by default. If you want to install any of those run the following (with an Internet connection):

$ sudo installcjkdict CEDICT

This will download CEDICT, create a SQLite database file and install it to /usr/local/share/cjklib. Just substitute CEDICT for any other supported dictionary (i.e. EDICT, CEDICT, HanDeDict, CFDICT, CEDICTGR).

Documentation & Usage

Documentation is available online. Also see the project page and its wiki. There is a small command line tool cjknife that offers some of the library’s functions. See cjknife --help for an overview.

Examples

  • Get stroke order of characters:

    >>> from cjklib import characterlookup
    >>> cjk = characterlookup.CharacterLookup('C')
    >>> cjk.getStrokeOrder(u'说')
    [u'㇔', u'㇊', u'㇔', u'㇒', u'㇑', u'㇕', u'㇐', u'㇓', u'㇟']
  • Access a dictionary (here using Jim Breen’s EDICT):

    >>> from cjklib.dictionary import EDICT
    >>> d = EDICT()
    >>> d.getForTranslation('Tokyo')
    [EntryTuple(Headword=u'東京', Reading=u'とうきょう',
    Translation=u'/(n) Tokyo (current capital of Japan)/(P)/')]

Database

Packaged versions of the library will ship with a pre-built SQLite database file. You can however easily rebuild the database yourself.

First download the newest Unihan file:

$ wget ftp://ftp.unicode.org/Public/UNIDATA/Unihan.zip

Then start the build process:

$ sudo buildcjkdb -r build cjklibData

SQLite

SQLite by default has no Unicode support for string operations. Optionally the ICU library can be compiled in for handling alphabetic non-ASCII characters. Cjklib can register own Unicode functions if ICU support is missing. Queries with LIKE will then use function lower(). This compatibility mode has negative impact on performance and as it is not needed for dictionaries like EDICT or CEDICT it is disabled by default. See cjklib.conf for enabling.

MySQL

With MySQL 5 the following CREATE command creates a database with utf8 as character set using the general Unicode collation (MySQL from 5.5.3 on will support full Unicode given character set utf8mb4 and collation utf8mb4_bin):

CREATE DATABASE cjklib DEFAULT CHARACTER SET utf8 COLLATE utf8_bin;

You might need to set access rights, too (substitute user_name and host_name):

GRANT ALL ON cjklib.* TO 'user_name'@'host_name';

Now update the settings in cjklib.conf.

MySQL < 5.5 doesn’t support full UTF-8, and uses a version with max 3 bytes, so characters outside the Basic Multilingual Plane (BMP) can’t be encoded. Building the Unihan database thus might result in warnings, characters above U+FFFF can’t be built at all. You need to disable building the full character range by setting wideBuild to False in cjklib.conf before building. Alternatively pass --wideBuild=False to buildcjkdb.

Contact

For help or discussions on cjklib, join cjklib-devel@googlegroups.com.

Please report bugs to the project’s bug tracker.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ninchanese-cjklib-0.4.0.tar.gz (25.1 MB view details)

Uploaded Source

Built Distribution

ninchanese_cjklib-0.4.0-py3-none-any.whl (24.9 MB view details)

Uploaded Python 3

File details

Details for the file ninchanese-cjklib-0.4.0.tar.gz.

File metadata

  • Download URL: ninchanese-cjklib-0.4.0.tar.gz
  • Upload date:
  • Size: 25.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for ninchanese-cjklib-0.4.0.tar.gz
Algorithm Hash digest
SHA256 1c21a6f34ee071f861d2471d395748935e141e3ca50557f61447a1a6ce4e3463
MD5 1f8905e23f5b71d9ead1446a3d4bed69
BLAKE2b-256 9f7f1c0ff1deeed21836e12b4d8c9432e4dbf02bae75bc8fbd16fcc797b88815

See more details on using hashes here.

File details

Details for the file ninchanese_cjklib-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: ninchanese_cjklib-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 24.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for ninchanese_cjklib-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3af35e5f1ceadd108aca0e55e3f5cea45bcf9533a0e1d0e8613223a95a213970
MD5 03f9143a1ac2f9af7ffb6268c9c407e6
BLAKE2b-256 88a49c56f424a82c33725afae8dbd56d0fbbf44d1667756a0d8ae53e10ca6b4f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page