Skip to main content

Python module that identifies Chinese text as Simplified or Traditional.

Project description

https://badge.fury.io/py/hanzidentifier.png https://travis-ci.org/tsroten/hanzidentifier.png?branch=develop

Hanzi Identifier is a simple Python module that identifies a string of text as having Simplified or Traditional characters.

About

Easy-to-use helper functions for identifying strings:

>>> import hanzidentifier
>>> hanzidentifier.has_chinese('Hello my name is John.')
False
>>> hanzidentifier.is_simplified('John说:你好!')
True
>>> hanzidentifier.is_traditional('John說:你好!')
True
>>> hanzidentifier.has_chinese('Country in Simplified: 国家. Country in Traditional: 國家.')
True

Here it is without the helper functions:

>>> hanzidentifier.identify('Hello my name is Thomas.') is hanzidentifier.UNKNOWN
True
>>> hanzidentifier.identify('Thomas 说:你好!') is hanzidentifier.SIMPLIFIED
True
>>> hanzidentifier.identify('Thomas 說:你好!') is hanzidentifier.TRADITIONAL
True
>>> hanzidentifier.identify('你好!') is hanzidentifier.BOTH
True
>>> hanzidentifier.identify('Country in Simplified: 国家. Country in Traditional: 國家.' ) is hanzidentifier.MIXED
True

hanzidentifier.identify has five possible return values:

  • hanzidentifier.UNKNOWN: there are no recognized Chinese characters in the string.

  • hanzidentifier.BOTH: the string is compatible with both Simplified and Traditional character systems.

  • hanzidentifier.TRADITIONAL: the string consists of Traditional characters.

  • hanzidentifier.SIMPLIFIED: the string consists of Simplified characters.

  • hanzidentifier.MIXED: the string consists of characters recognized solely as Traditional characters and also consists of characters recognized solely as Simplified characters.

Characters that aren’t found in CC-CEDICT are ignored when determining a string’s identity. Hanzi Identifier uses the CC-CEDICT data provided by Zhon to identify Chinese characters.

Because the Traditional and Simplified Chinese character systems overlap, a string containing Simplified characters could identify as hanzidentifer.SIMPLIFIED or hanzidentifier.BOTH depending on if the characters are also Traditional characters.

Hanzi Identifier’s functions accept and return unicode.

Getting Started

Change Log

v1.1.0 (2020-10-15)

  • New function: count_chinese(). Thanks to ramwin.

  • Drop Python 2.

v1.0.2 (2015-08-06)

  • New README format

  • Adds Travis CI support

  • Uses io.open() in setup.py. Fixes #1.

v1.0.1 (2014-04-14)

  • Fixes URL typo.

v1.0 (2014-04-12)

Version 1.0 merges some changes from Dragon Mapper. It is not backwards compatible with the previous versions of Hanzi Identifier (e.g. some of the constants are named differently).

v0.1 (2013-04-24)

  • Initial release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hanzidentifier-1.1.0.tar.gz (4.4 kB view details)

Uploaded Source

Built Distribution

hanzidentifier-1.1.0-py3-none-any.whl (4.6 kB view details)

Uploaded Python 3

File details

Details for the file hanzidentifier-1.1.0.tar.gz.

File metadata

  • Download URL: hanzidentifier-1.1.0.tar.gz
  • Upload date:
  • Size: 4.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.10

File hashes

Hashes for hanzidentifier-1.1.0.tar.gz
Algorithm Hash digest
SHA256 f4e9a3b87b58688807f72c053e8b4d262aa11afabb02c89252f80db476956625
MD5 11365bd89117e217be449cd1e676d8c3
BLAKE2b-256 506368c317ee817423b0851a14a74bc21cf9317070538c85d1d6c2ba11273a92

See more details on using hashes here.

File details

Details for the file hanzidentifier-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for hanzidentifier-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a5889898f888335c973fc21a1baec1455ca3a63b8abda24ae6472f658a1b11bb
MD5 c444cf4143d7fd43ca5fa50c24c9f32c
BLAKE2b-256 379f9d4c63c9f018b5a99547897a46fbd331b24b35c0c49a124cd2f9d52807b7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page