Skip to main content

Python library for Pyidaungsu Myanmar languages

Project description

Pyidaungsu (Project discontinued)

Python library for Myanmar language. Useful in Natural Language Processing and text preprocessing for Myanmar language.

Installation

pip install pyidaungsu

Usage

Zawgyi-Unicode detection Language detection (Myanmar <Zawgyi, Unicode>, Karen, Mon, Shan)

Starting from the pyidaungsu 0.0.9, it does not only detect Zawgyi and Unicode for Myanmar language but also other languages such as Mon, Karen, Shan as well.

Language detection for Mon and Shan is temporarily disabled starting from 0.1.3 as the accuracy for those languages wasn't as good. Supported languages ATM: English, Spanish, French, Chinese, Japanese, Korean, Myanmar (unicode), Myanmar (zawgyi), Karen

import pyidaungsu as pds

# language detection
pds.detect("ထမင်းစားပြီးပြီလား")
>> "mm_uni"
pds.detect("ထမင္းစားၿပီးၿပီလား")
>> "mm_zg"
pds.detect("တၢ်သိၣ်လိတၢ်ဖးလံာ် ကွဲးလံာ်အိၣ်လၢ မ့ရ့ၣ်အစုပူၤလီၤ.")
>> "karen"
pds.detect("ဇၟာပ်မၞိဟ်ဂှ် ကတဵုဒှ်ကၠုင် ပ္ဍဲကဵုဂကောံမွဲ ဖအိုတ်ရ၊၊")
>> "mon"
pds.detect("ၼႂ်းဢိူင်ႇမိူင်းၽူင်း ၸႄႈဝဵင်းတႃႈၶီႈလဵၵ်း ၾႆးမႆႈႁိူၼ်း ၵူၼ်းဝၢၼ်ႈ လင်ၼိုင်ႈ")
>> "shan"

Zawgyi-Unicode conversion

# convert to zawgyi
pds.cvt2zgi("ထမင်းစားပြီးပြီလား")
>> "ထမင္းစားၿပီးၿပီလား"

# convert to unicode
pds.cvt2uni("ထမင္းစားၿပီးၿပီလား")
>> "ထမင်းစားပြီးပြီလား"

Tokenization

# syllable level tokenization for Burmese
pds.tokenize("Alan TuringကိုArtificial Intelligenceနဲ့Computerတွေရဲ့ဖခင်ဆိုပြီးလူသိများပါတယ်") # lang parameter for default function is 'mm'
>> ['Alan', 'Turing', 'ကို', 'Artificial', 'Intelligence', 'နဲ့', 'Computer', 'တွေ', 'ရဲ့', 'ဖ', 'ခင်', 'ဆို', 'ပြီး', 'လူ', 'သိ', 'များ', 'ပါ', 'တယ်']

# syllable level tokenization for Karen
pds.tokenize("သရၣ်,သရၣ်မုၣ် ခဲလၢာ်ဟးထီၣ် (၃၅) ဂၤန့ၣ်လီၤ.", lang="karen")
>> ['ကၠိ', 'သ', 'ရၣ်', ',', 'သ', 'ရၣ်', 'မုၣ်', 'ခဲ', 'လၢာ်', 'ဟး', 'ထီၣ်', '(', '၃၅', ')', 'ဂၤ', 'န့ၣ်', 'လီၤ', '.']

# word level tokenization
pds.tokenize("ဖေဖေနဲ့မေမေ၏ကျေးဇူးတရားမှာကြီးမားလှပေသည်", form="word")
>> ['ဖေဖေ', 'နဲ့', 'မေမေ', '၏', 'ကျေးဇူးတရား', 'မှာ', 'ကြီးမား', 'လှ', 'ပေ', 'သည်']

Syllable-level tokenization supports for 4 languages (Burmese, Karen, Shan, Mon). Word-level tokenization supports only Burmese currently.
Available values for lang parameter in tokenize function: "mm", "karen", "mon", "shan"

Future work

  • Add tokenizer for Burmese (Syllabel and word-level tokenization)
  • Add more tokenizer (BPE, WordPiece etc.)
  • Add Part-of-Speech (POS) tagger for Burmese
  • Add Named-entities Recognition (NER) classifier for Burmese
  • Add thorough documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyidaungsu_numpy2-0.1.4.tar.gz (5.5 MB view details)

Uploaded Source

Built Distribution

pyidaungsu_numpy2-0.1.4-py3-none-any.whl (5.5 MB view details)

Uploaded Python 3

File details

Details for the file pyidaungsu_numpy2-0.1.4.tar.gz.

File metadata

  • Download URL: pyidaungsu_numpy2-0.1.4.tar.gz
  • Upload date:
  • Size: 5.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.10

File hashes

Hashes for pyidaungsu_numpy2-0.1.4.tar.gz
Algorithm Hash digest
SHA256 f6f99d672cabe46240c53e17f8f5f87dc003c90dfe7e4073a821fc32075a4d0f
MD5 11d80e2a99dd0ba315ffaebb87f65784
BLAKE2b-256 96422e8f1b976997911d174f5a2d67f76b22e33bc902dcf3116c5dd3c17a0330

See more details on using hashes here.

File details

Details for the file pyidaungsu_numpy2-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for pyidaungsu_numpy2-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 57978d28c9f284b83a39983c6f695297fe2f747da251886aa8faffb94c4ebd61
MD5 79ec0d8dddc6b6d5a2fc153f545526a8
BLAKE2b-256 e84dfe00306589c40ed0b03333cbb46b0a977f6f4c9ed699e090349a12790c67

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page