Skip to main content

A tiny sentence/word tokenizer for Japanese text written in Python

Project description

๐ŸŒฟ Konoha: Simple wrapper of Japanese Tokenizers

Open In Colab

GitHub stars

Downloads Downloads Downloads

Build Status Documentation Status Python PyPI GitHub Issues GitHub Pull Requests

Konoha is a Python library for providing easy-to-use integrated interface of various Japanese tokenizers, which enables you to switch a tokenizer and boost your pre-processing.

Supported tokenizers

Also, konoha provides rule-based tokenizers (whitespace, character) and a rule-based sentence splitter.

Quick Start with Docker

Simply run followings on your computer:

docker run --rm -p 8000:8000 -t himkt/konoha  # from DockerHub

Or you can build image on your machine:

git clone https://github.com/himkt/konoha  # download konoha
cd konoha && docker-compose up --build  # build and launch container

Tokenization is done by posting a json object to localhost:8000/api/v1/tokenize. You can also batch tokenize by passing texts: ["๏ผ‘ใค็›ฎใฎๅ…ฅๅŠ›", "๏ผ’ใค็›ฎใฎๅ…ฅๅŠ›"] to localhost:8000/api/v1/batch_tokenize.

(API documentation is available on localhost:8000/redoc, you can check it using your web browser)

Send a request using curl on your terminal. Note that a path to an endpoint is changed in v4.6.4. Please check our release note (https://github.com/himkt/konoha/releases/tag/v4.6.4).

$ curl localhost:8000/api/v1/tokenize -X POST -H "Content-Type: application/json" \
    -d '{"tokenizer": "mecab", "text": "ใ“ใ‚Œใฏใƒšใƒณใงใ™"}'

{
  "tokens": [
    [
      {
        "surface": "ใ“ใ‚Œ",
        "part_of_speech": "ๅ่ฉž"
      },
      {
        "surface": "ใฏ",
        "part_of_speech": "ๅŠฉ่ฉž"
      },
      {
        "surface": "ใƒšใƒณ",
        "part_of_speech": "ๅ่ฉž"
      },
      {
        "surface": "ใงใ™",
        "part_of_speech": "ๅŠฉๅ‹•่ฉž"
      }
    ]
  ]
}

Installation

I recommend you to install konoha by pip install 'konoha[all]'.

  • Install konoha with a specific tokenizer: pip install 'konoha[(tokenizer_name)].
  • Install konoha with a specific tokenizer and remote file support: pip install 'konoha[(tokenizer_name),remote]'

If you want to install konoha with a tokenizer, please install konoha with a specific tokenizer (e.g. konoha[mecab], konoha[sudachi], ...etc) or install tokenizers individually.

Example

Word level tokenization

from konoha import WordTokenizer

sentence = '่‡ช็„ถ่จ€่ชžๅ‡ฆ็†ใ‚’ๅ‹‰ๅผทใ—ใฆใ„ใพใ™'

tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize(sentence))
# => [่‡ช็„ถ, ่จ€่ชž, ๅ‡ฆ็†, ใ‚’, ๅ‹‰ๅผท, ใ—, ใฆ, ใ„, ใพใ™]

tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))
# => [โ–, ่‡ช็„ถ, ่จ€่ชž, ๅ‡ฆ็†, ใ‚’, ๅ‹‰ๅผท, ใ—, ใฆใ„ใพใ™]

For more detail, please see the example/ directory.

Remote files

Konoha supports dictionary and model on cloud storage (currently supports Amazon S3). It requires installing konoha with the remote option, see Installation.

# download user dictionary from S3
word_tokenizer = WordTokenizer("mecab", user_dictionary_path="s3://abc/xxx.dic")
print(word_tokenizer.tokenize(sentence))

# download system dictionary from S3
word_tokenizer = WordTokenizer("mecab", system_dictionary_path="s3://abc/yyy")
print(word_tokenizer.tokenize(sentence))

# download model file from S3
word_tokenizer = WordTokenizer("sentencepiece", model_path="s3://abc/zzz.model")
print(word_tokenizer.tokenize(sentence))

Sentence level tokenization

from konoha import SentenceTokenizer

sentence = "็งใฏ็Œซใ ใ€‚ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„ใ€‚ใ ใŒ๏ผŒใ€Œใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚"

tokenizer = SentenceTokenizer()
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็Œซใ ใ€‚', 'ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„ใ€‚', 'ใ ใŒ๏ผŒใ€Œใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚']

You can change symbols for a sentence splitter and bracket expression.

  1. sentence splitter
sentence = "็งใฏ็Œซใ ใ€‚ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„๏ผŽใ ใŒ๏ผŒใ€Œใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚"

tokenizer = SentenceTokenizer(period="๏ผŽ")
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็Œซใ ใ€‚ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„๏ผŽ', 'ใ ใŒ๏ผŒใ€Œใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚']
  1. bracket expression
sentence = "็งใฏ็Œซใ ใ€‚ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„ใ€‚ใ ใŒ๏ผŒใ€Žใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚"

tokenizer = SentenceTokenizer(
    patterns=SentenceTokenizer.PATTERNS + [re.compile(r"ใ€Ž.*?ใ€")],
)
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็Œซใ ใ€‚', 'ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„ใ€‚', 'ใ ใŒ๏ผŒใ€Žใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚']

Test

python -m pytest

Article

Acknowledgement

Sentencepiece model used in test is provided by @yoheikikuta. Thanks!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

konoha-5.5.1.tar.gz (14.1 kB view details)

Uploaded Source

Built Distribution

konoha-5.5.1-py3-none-any.whl (17.7 kB view details)

Uploaded Python 3

File details

Details for the file konoha-5.5.1.tar.gz.

File metadata

  • Download URL: konoha-5.5.1.tar.gz
  • Upload date:
  • Size: 14.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.0 Linux/5.15.0-1053-azure

File hashes

Hashes for konoha-5.5.1.tar.gz
Algorithm Hash digest
SHA256 ded7516f823cfe875d3f3d94d8bb22583ee9514bb1c49bfc01a9bbcbd3253a20
MD5 46cce3b9c7c8499b35cc62b7cc669944
BLAKE2b-256 ff54c572b0d46e057c9dd3a7ce0ae6aec43da5386d13dc5d9f655a1186f01926

See more details on using hashes here.

File details

Details for the file konoha-5.5.1-py3-none-any.whl.

File metadata

  • Download URL: konoha-5.5.1-py3-none-any.whl
  • Upload date:
  • Size: 17.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.0 Linux/5.15.0-1053-azure

File hashes

Hashes for konoha-5.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ac183fde6947c8e381652691a93267337f4dbd8f7a2e8762e0474b848a605097
MD5 e9040a088b15b2e62424c14bef197420
BLAKE2b-256 a6501e4664924fc50cee7b9146f5fae0b2bfdf79e480a4f40f20b5fd189f8fd4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page