Skip to main content

Python module that identifies Chinese text as Simplified or Traditional.

Project description

https://badge.fury.io/py/hanzidentifier.svg https://github.com/tsroten/hanzidentifier/actions/workflows/ci.yml/badge.svg

Hanzi Identifier is a simple Python module that identifies a string of text as having Simplified or Traditional characters.

About

Easy-to-use helper functions for identifying strings:

>>> import hanzidentifier
>>> hanzidentifier.has_chinese('Hello my name is John.')
False
>>> hanzidentifier.is_simplified('John说:你好!')
True
>>> hanzidentifier.is_traditional('John說:你好!')
True
>>> hanzidentifier.has_chinese('Country in Simplified: 国家. Country in Traditional: 國家.')
True

Here it is without the helper functions:

>>> hanzidentifier.identify('Hello my name is Thomas.') is hanzidentifier.UNKNOWN
True
>>> hanzidentifier.identify('Thomas 说:你好!') is hanzidentifier.SIMPLIFIED
True
>>> hanzidentifier.identify('Thomas 說:你好!') is hanzidentifier.TRADITIONAL
True
>>> hanzidentifier.identify('你好!') is hanzidentifier.BOTH
True
>>> hanzidentifier.identify('Country in Simplified: 国家. Country in Traditional: 國家.' ) is hanzidentifier.MIXED
True

hanzidentifier.identify has five possible return values:

  • hanzidentifier.UNKNOWN: there are no recognized Chinese characters in the string.

  • hanzidentifier.BOTH: the string is compatible with both Simplified and Traditional character systems.

  • hanzidentifier.TRADITIONAL: the string consists of Traditional characters.

  • hanzidentifier.SIMPLIFIED: the string consists of Simplified characters.

  • hanzidentifier.MIXED: the string consists of characters recognized solely as Traditional characters and also consists of characters recognized solely as Simplified characters.

Characters that aren’t found in CC-CEDICT are ignored when determining a string’s identity. Hanzi Identifier uses the CC-CEDICT data provided by Zhon to identify Chinese characters.

Because the Traditional and Simplified Chinese character systems overlap, a string containing Simplified characters could identify as hanzidentifier.SIMPLIFIED or hanzidentifier.BOTH depending on if the characters are also Traditional characters.

Getting Started

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hanzidentifier-1.2.0.tar.gz (3.9 kB view details)

Uploaded Source

Built Distribution

hanzidentifier-1.2.0-py3-none-any.whl (4.8 kB view details)

Uploaded Python 3

File details

Details for the file hanzidentifier-1.2.0.tar.gz.

File metadata

  • Download URL: hanzidentifier-1.2.0.tar.gz
  • Upload date:
  • Size: 3.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.24.1

File hashes

Hashes for hanzidentifier-1.2.0.tar.gz
Algorithm Hash digest
SHA256 8e4198ae87c1da80d77cde46d7e90cb50d1a7561ee8b33e725058a2b0d70e83d
MD5 fc0c73e34e87d5f8b8e2630b7ad471a7
BLAKE2b-256 b1c61f42864ea272c5497ba78858f6b4082cca37cf4f59b60effdc60273e2c1f

See more details on using hashes here.

File details

Details for the file hanzidentifier-1.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for hanzidentifier-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 022cbb3aa01ff87b41caa7dbb6e917463a09f399d6d2b9d5499f34d6f6cc1218
MD5 ba1d1ef8d3ae0b4e9650de5c19da2dee
BLAKE2b-256 c67c1884b2e27fb81fbe429c390e5bde41ea3742c2ce0875fbf0be766d7586a0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page