Python module that identifies Chinese text as Simplified or Traditional.
Project description
Hanzi Identifier is a simple Python module that identifies a string of text as having Simplified or Traditional characters.
Free software: MIT license
About
Easy-to-use helper functions for identifying strings:
>>> import hanzidentifier
>>> hanzidentifier.has_chinese('Hello my name is John.')
False
>>> hanzidentifier.is_simplified('John说:你好!')
True
>>> hanzidentifier.is_traditional('John說:你好!')
True
>>> hanzidentifier.has_chinese('Country in Simplified: 国家. Country in Traditional: 國家.')
True
Here it is without the helper functions:
>>> hanzidentifier.identify('Hello my name is Thomas.') is hanzidentifier.UNKNOWN
True
>>> hanzidentifier.identify('Thomas 说:你好!') is hanzidentifier.SIMPLIFIED
True
>>> hanzidentifier.identify('Thomas 說:你好!') is hanzidentifier.TRADITIONAL
True
>>> hanzidentifier.identify('你好!') is hanzidentifier.BOTH
True
>>> hanzidentifier.identify('Country in Simplified: 国家. Country in Traditional: 國家.' ) is hanzidentifier.MIXED
True
hanzidentifier.identify has five possible return values:
hanzidentifier.UNKNOWN: there are no recognized Chinese characters in the string.
hanzidentifier.BOTH: the string is compatible with both Simplified and Traditional character systems.
hanzidentifier.TRADITIONAL: the string consists of Traditional characters.
hanzidentifier.SIMPLIFIED: the string consists of Simplified characters.
hanzidentifier.MIXED: the string consists of characters recognized solely as Traditional characters and also consists of characters recognized solely as Simplified characters.
Characters that aren’t found in CC-CEDICT are ignored when determining a string’s identity. Hanzi Identifier uses the CC-CEDICT data provided by Zhon to identify Chinese characters.
Because the Traditional and Simplified Chinese character systems overlap, a string containing Simplified characters could identify as hanzidentifer.SIMPLIFIED or hanzidentifier.BOTH depending on if the characters are also Traditional characters.
Hanzi Identifier’s functions accept and return unicode.
Getting Started
Install Hanzi Identifier: $ pip install hanzidentifier
Report bugs and ask questions via GitHub Issues
Change Log
v1.0.2 (2015-08-06)
New README format
Adds Travis CI support
Uses io.open() in setup.py. Fixes #1.
v1.0.1 (2014-04-14)
Fixes URL typo.
v1.0 (2014-04-12)
Version 1.0 merges some changes from Dragon Mapper. It is not backwards compatible with the previous versions of Hanzi Identifier (e.g. some of the constants are named differently).
Merges code from Dragon Mapper project.
Adds tox support.
v0.1 (2013-04-24)
Initial release.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file hanzidentifier-1.0.2.tar.gz
.
File metadata
- Download URL: hanzidentifier-1.0.2.tar.gz
- Upload date:
- Size: 4.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 793a298430aa9a9d6ab344dc0ca0ab4bd1161d88c7da941d6554571093003cba |
|
MD5 | 7dfa853288a429878b225848b3bed8ed |
|
BLAKE2b-256 | 48f83903525d9f63de307dfcf311be64d43e6ab88aa175b6daa4d159d17b3933 |