Zhon provides constants used in Chinese text processing.
Project description
Zhon is a Python module that provides constants commonly used in Chinese text processing:
Chinese characters
Chinese punctuation
Pinyin and Zhuyin characters
Traditional and simplified characters
ASCII characters
Fullwidth alphanumeric variants
Chinese radicals (as used in dictionaries)
Zhon’s constants are formatted as strings containing Unicode code ranges. This is useful for compiling RE pattern objects. They can be combined to make RE pattern objects as needed.
>>> re.findall('[%s]' % zhon.unicode.HAN_IDEOGRAPHS, 'Hello = 你好')
['你', '好']
>>> re.split('[%s]' % zhon.unicode.PUNCTUATION, '有人丢失了一把斧子,怎么找也没有找到。')
['有人丢失了一把斧子', '怎么找也没有找到', '']
>>> not_zh_re = re.compile('[^%s%s]' % (zhon.unicode.HAN_IDEOGRAPHS, zhon.unicode.PUNCTUATION))
>>> not_zh_re.findall('我叫Thomas。你叫什么名字?')
['T', 'h', 'o', 'm', 'a', 's']
Overview
- zhon.unicode.HAN_IDEOGRAPHS
This represents every Chinese character (including historic and rare characters). HAN_IDEOGRAPHS includes CJK Unified Ideographs, CJK Unified Extensions (A-D), CJK Compatibility Ideographs, CJK Compatibility Ideographs Supplement, and the extension to the URO. More information is available in Chapter 12 of the Unicode Standard.
- zhon.unicode.PUNCTUATION
This contains punctuation used in Chinese text.
- zhon.unicode.PINYIN
This contains characters used in Pinyin (both numbered and accented).
- zhon.unicode.ZHUYIN
This contains characters used in Zhuyin (Bopomofo).
- zhon.unicode.ASCII
This contains all ASCII characters.
- zhon.unicode.FULLWIDTH_ALPHANUMERIC
This contains the fullwidth variants for A-Z, a-z, and 0-9.
- zhon.unicode.RADICALS
This contains the Kangxi radicals and the CJK Radicals Supplement. They are used in dictionaries to index characters.
- zhon.cedict.TRADITIONAL
This contains characters considered by CC-CEDICT to be traditional.
- zhon.cedict.SIMPLIFIED
This contains characters considered by CC-CEDICT to be simplified.
Narrow Python Builds
If you have a narrow Python 2 build and run the following code, a ValueError is raised:
>>> unichr(0x20000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)
Narrow Python 3.1/3.2 builds have problems compiling RE pattern objects using characters ranges greater than 0xFFFF:
>>> re.compile('[\U00020000-\U00020005]')
Traceback (most recent call last):
...
sre_constants.error: bad character range
Narrow Python builds incorrectly handle the character U00020000 and others like it. Zhon takes this into account when building its constants so that you don’t have to worry about it – characters greater than your Python build’s sys.maxunicode are not included in Zhon’s constants.
Name
Zhon is short for ZHongwen cONstants. It is pronounced like the name ‘John’.
Requirements
Zhon supports Python 2.6, 2.7, 3.1, 3.2, and 3.3.
Install
Just use pip:
$ pip install zhon
Bugs/Feature Requests
Zhon uses its GitHub Issues page to track bugs, feature requests, and support questions.
License
Zhon is released under the OSI-approved MIT License. See the file LICENSE.txt for more information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file zhon-0.1.0.tar.gz
.
File metadata
- Download URL: zhon-0.1.0.tar.gz
- Upload date:
- Size: 6.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a66eb3f3505aa2c1cc794024784064dd57d17b9b2934a6680873fec435ace7ce |
|
MD5 | cab023e4ba0c6da32765e9899aab0089 |
|
BLAKE2b-256 | b3e48356284aded266a5dcf67af60b9297c4afa8878c97e0d0238878c44b7c6f |