Skip to main content

A tiny sentence/word tokenizer for Japanese text written in Python

Project description

🌿 Konoha: Simple wrapper of Japanese Tokenizers

GitHub stars

Build Status Documentation Status Python PyPI GitHub Issues GitHub Pull Requests

Konoha is a Python library for providing easy-to-use integrated interface of various Japanese tokenziers, which enables you to switch a tokenizer and boost your pre-processing.

Supported tokenizers

Also, konoha provides rule-based tokenizers (whitespace, character) and a rule-based sentence splitter.

Quick Start with Docker

Simply run followings on your computer:

docker run --rm -p 8000:8000 -t himkt/konoha  # from DockerHub

Or you can build image on your machine:

git clone https://github.com/himkt/konoha  # download konoha
cd konoha && docker-compose up --build  # build and launch contaier

Tokenization is done by posting a json object to localhost:8000/api/tokenize. You can also batch tokenize by passing texts: ["1つ目の入力", "2つ目の入力"] to the server.

(API documentation is available on localhost:8000/redoc, you can check it using your web browser)

Send a request using curl on you terminal.

$ curl localhost:8000/api/tokenize -X POST -H "Content-Type: application/json" \
    -d '{"tokenizer": "mecab", "text": "これはペンです"}'

{
  "tokens": [
    [
      {
        "surface": "これ",
        "part_of_speech": "名詞"
      },
      {
        "surface": "は",
        "part_of_speech": "助詞"
      },
      {
        "surface": "ペン",
        "part_of_speech": "名詞"
      },
      {
        "surface": "です",
        "part_of_speech": "助動詞"
      }
    ]
  ]
}

Installation

I recommend you to install konoha by pip install 'konoha[all]' or pip install 'konoha[all_with_integrations]'. (all_with_integrations will install AllenNLP)

  • Install konoha with a specific tokenizer: pip install 'konoha[(tokenizer_name)].
  • Install konoha with a specific tokenizer and AllenNLP integration: pip install 'konoha[(tokenizer_name),allennlp].
  • Install konoha with a specific tokenzier and remote file support: pip install 'konoha[(tokenizer_name),remote]'

Example

Word level tokenization

from konoha import WordTokenizer

sentence = '自然言語処理を勉強しています'

tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize(sentence))
# => [自然, 言語, 処理, を, 勉強, し, て, い, ます]

tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))
# => [▁, 自然, 言語, 処理, を, 勉強, し, ています]

For more detail, please see the example/ directory.

Remote files

Konoha supports dictionary and model on cloud storage (currently supports Amazon S3). It requires installing konoha with the remote option, see Installation.

# download user dictionary from S3
word_tokenizer = WordTokenizer("mecab", user_dictionary_path="s3://abc/xxx.dic")
print(word_tokenizer.tokenize(sentence))

# download system dictionary from S3
word_tokenizer = WordTokenizer("mecab", system_dictionary_path="s3://abc/yyy")
print(word_tokenizer.tokenize(sentence))

# download model file from S3
word_tokenizer = WordTokenizer("sentencepiece", model_path="s3://abc/zzz.model")
print(word_tokenizer.tokenize(sentence))

Sentence level tokenization

from konoha import SentenceTokenizer

sentence = "私は猫だ。名前なんてものはない。だが,「かわいい。それで十分だろう」。"

tokenizer = SentenceTokenizer()
print(tokenizer.tokenize(sentence))
# => ['私は猫だ。', '名前なんてものはない。', 'だが,「かわいい。それで十分だろう」。']

AllenNLP integration

Konoha provides AllenNLP integration, it enables users to specify konoha tokenizer in a Jsonnet config file. By running allennlp train with --include-package konoha, you can train a model using konoha tokenizer!

For example, konoha tokenizer is specified in xxx.jsonnet like following:

{
  "dataset_reader": {
    "lazy": false,
    "type": "text_classification_json",
    "tokenizer": {
      "type": "konoha",  // <-- konoha here!!!
      "tokenizer_name": "janome",
    },
    "token_indexers": {
      "tokens": {
        "type": "single_id",
        "lowercase_tokens": true,
      },
    },
  },
  ...
  "model": {
  ...
  },
  "trainer": {
  ...
  }
}

After finishing other sections (e.g. model config, trainer config, ...etc), allennlp train config/xxx.jsonnet --include-package konoha --serialization-dir yyy works! (remember to include konoha by --include-package konoha)

For more detail, please refer my blog article (in Japanese, sorry).

Test

python -m pytest

Article

Acknowledgement

Sentencepiece model used in test is provided by @yoheikikuta. Thanks!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

konoha-4.6.1.tar.gz (15.8 kB view details)

Uploaded Source

Built Distribution

konoha-4.6.1-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file konoha-4.6.1.tar.gz.

File metadata

  • Download URL: konoha-4.6.1.tar.gz
  • Upload date:
  • Size: 15.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.0a3 CPython/3.8.5 Darwin/19.6.0

File hashes

Hashes for konoha-4.6.1.tar.gz
Algorithm Hash digest
SHA256 12a5870cd27adeff4963fd5185c885a3538bb00f1f51041ea18d8e7883955a4c
MD5 8c194b4334fff5ab45be71c4da52ef1a
BLAKE2b-256 8646f2540011ca1b1333b2552e640c05a3961a51412a8d80e817d40a929e23c5

See more details on using hashes here.

File details

Details for the file konoha-4.6.1-py3-none-any.whl.

File metadata

  • Download URL: konoha-4.6.1-py3-none-any.whl
  • Upload date:
  • Size: 19.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.0a3 CPython/3.8.5 Darwin/19.6.0

File hashes

Hashes for konoha-4.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 253781b3c32cdf611e0c9a6e037b2376550b94f9feb4d2f361c87108061849d4
MD5 4e97b86874100a9b1d65dd7e563d062b
BLAKE2b-256 194bf29cb8cf226d49b14f0d18bd175916dd0054dc66e42488f2c808075ef5a8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page