Skip to main content

PyCantonese: Cantonese Linguistics and NLP in Python

Project description

Full Documentation: https://pycantonese.org


PyPI version Supported Python versions Build

PyCantonese is a Python library for Cantonese linguistics and natural language processing (NLP). Currently implemented features (more to come!):

  • Accessing and searching corpus data

  • Parsing and conversion tools for Jyutping romanization

  • Stop words

  • Word segmentation

  • Part-of-speech tagging

Quick Examples

With PyCantonese imported:

>>> import pycantonese
  1. Word segmentation

>>> pycantonese.segment("廣東話好難學?")  # Is Cantonese difficult to learn?
['廣東話', '好', '難', '學', '?']
  1. Conversion from Cantonese characters to Jyutping

>>> pycantonese.characters_to_jyutping('香港人講廣東話')  # Hongkongers speak Cantonese
[("香港人", "hoeng1gong2jan4"), ("講", "gong2"), ("廣東話", "gwong2dung1waa2")]
  1. Finding all verbs in the HKCanCor corpus

    In this example, we search for the regular expression '^V' for all words whose part-of-speech tag begins with “V” in the original HKCanCor annotations:

>>> corpus = pycantonese.hkcancor() # get HKCanCor
>>> all_verbs = corpus.search(pos='^V')
>>> len(all_verbs)  # number of all verbs
29726
>>> all_verbs[:10]  # print 10 results
[Token(word='去', pos='V', jyutping='heoi3', mor=None, gra=None),
 Token(word='去', pos='V', jyutping='heoi3', mor=None, gra=None),
 Token(word='旅行', pos='VN', jyutping='leoi5hang4', mor=None, gra=None),
 Token(word='有冇', pos='V1', jyutping='jau5mou5', mor=None, gra=None),
 Token(word='要', pos='VU', jyutping='jiu3', mor=None, gra=None),
 Token(word='有得', pos='VU', jyutping='jau5dak1', mor=None, gra=None),
 Token(word='冇得', pos='VU', jyutping='mou5dak1', mor=None, gra=None),
 Token(word='去', pos='V', jyutping='heoi3', mor=None, gra=None),
 Token(word='係', pos='V', jyutping='hai6', mor=None, gra=None),
 Token(word='係', pos='V', jyutping='hai6', mor=None, gra=None)]
  1. Parsing Jyutping for the onset, nucleus, coda, and tone

>>> pycantonese.parse_jyutping('gwong2dung1waa2')  # 廣東話
[Jyutping(onset='gw', nucleus='o', coda='ng', tone='2'),
 Jyutping(onset='d', nucleus='u', coda='ng', tone='1'),
 Jyutping(onset='w', nucleus='aa', coda='', tone='2')]

Download and Install

To download and install the stable, most recent version:

$ pip install --upgrade pycantonese

To test your installation in the Python interpreter:

>>> import pycantonese
>>> pycantonese.__version__  # show version number

How to Cite

PyCantonese is authored and mainteined by Jackson L. Lee.

A talk introducing PyCantonese:

Lee, Jackson L. 2015. PyCantonese: Cantonese linguistic research in the age of big data. Talk at the Childhood Bilingualism Research Centre, Chinese University of Hong Kong. September 15. 2015. Notes+slides

License

MIT License. Please see LICENSE.txt in the GitHub source code for details.

The HKCanCor dataset included in PyCantonese is substantially modified from its source in terms of format. The original dataset has a CC BY license. Please see pycantonese/data/hkcancor/README.md in the GitHub source code for details.

The rime-cantonese data (release 2020.09.09) is incorporated into PyCantonese for word segmentation and characters-to-Jyutping conversion. This data has a CC BY 4.0 license. Please see pycantonese/data/rime_cantonese/README.md in the GitHub source code for details.

Acknowledgments

Wonderful resources with a permissive license that have been incorporated into PyCantonese:

  • HKCanCor

  • rime-cantonese

Individuals who have contributed feedback, bug reports, etc. (in alphabetical order of last names):

  • @cathug

  • Litong Chen

  • Jenny Chim

  • @g-traveller

  • Rachel Han

  • Ryan Lai

  • Charles Lam

  • Hill Ma

  • @richielo

  • @rylanchiu

  • Stephan Stiller

  • Tsz-Him Tsui

  • Robin Yuen

Changelog

Please see CHANGELOG.md.

Setting up a Development Environment

The latest code under development is available on Github at jacksonllee/pycantonese. You need to have Git LFS installed on your system. To obtain this version for experimental features or for development:

$ git clone https://github.com/jacksonllee/pycantonese.git
$ cd pycantonese
$ git lfs pull
$ pip install -r dev-requirements.txt
$ pip install -e .

To run tests and styling checks:

$ pytest -vv --doctest-modules --cov=pycantonese pycantonese docs
$ flake8 pycantonese
$ black --check pycantonese

To build the documentation website files:

$ python build_docs.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycantonese-3.2.1.tar.gz (3.8 MB view details)

Uploaded Source

Built Distribution

pycantonese-3.2.1-py3-none-any.whl (3.9 MB view details)

Uploaded Python 3

File details

Details for the file pycantonese-3.2.1.tar.gz.

File metadata

  • Download URL: pycantonese-3.2.1.tar.gz
  • Upload date:
  • Size: 3.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.7.0 requests/2.24.0 setuptools/54.1.2 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for pycantonese-3.2.1.tar.gz
Algorithm Hash digest
SHA256 b351c5bf363b173e8d4960c0d154b499c0dfac8406e5f6eed6457fd0f2cd5587
MD5 57ec4edb94d21a6964626305b8da0615
BLAKE2b-256 eb47b0661e0a58cd70dfd96d2d41bd50ba42f8f537b07fbca6a9451c9e8416a9

See more details on using hashes here.

File details

Details for the file pycantonese-3.2.1-py3-none-any.whl.

File metadata

  • Download URL: pycantonese-3.2.1-py3-none-any.whl
  • Upload date:
  • Size: 3.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.7.0 requests/2.24.0 setuptools/54.1.2 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for pycantonese-3.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a866c9021841741265c1b1793c4ffd859936715a118f5edb3f1c399a3da9c5c7
MD5 700136089912abe7a8ede2176ea6c7b2
BLAKE2b-256 15efe4e3c9a639671bbed3448a0b5e636fbf6539fc972888d8e20644af1b0402

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page