A tiny sentence/word tokenizer for Japanese text written in Python

These details have not been verified by PyPI

Project description

🌿 Konoha: Simple wrapper of Japanese Tokenizers

Python

Konoha is a Python library for providing easy-to-use integrated interface of various Japanese tokenizers, which enables you to switch a tokenizer and boost your pre-processing.

Supported tokenizers

Also, konoha provides rule-based tokenizers (whitespace, character) and a rule-based sentence splitter.

Quick Start with Docker

Simply run followings on your computer:

docker run --rm -p 8000:8000 -t himkt/konoha  # from DockerHub

Or you can build image on your machine:

git clone https://github.com/himkt/konoha  # download konoha
cd konoha && docker-compose up --build  # build and launch container

Tokenization is done by posting a json object to localhost:8000/api/v1/tokenize. You can also batch tokenize by passing texts: ["１つ目の入力", "２つ目の入力"] to localhost:8000/api/v1/batch_tokenize.

(API documentation is available on localhost:8000/redoc, you can check it using your web browser)

Send a request using curl on your terminal. Note that a path to an endpoint is changed in v4.6.4. Please check our release note (https://github.com/himkt/konoha/releases/tag/v4.6.4).

$ curl localhost:8000/api/v1/tokenize -X POST -H "Content-Type: application/json" \
    -d '{"tokenizer": "mecab", "text": "これはペンです"}'

{
  "tokens": [
    [
      {
        "surface": "これ",
        "part_of_speech": "名詞"
      },
      {
        "surface": "は",
        "part_of_speech": "助詞"
      },
      {
        "surface": "ペン",
        "part_of_speech": "名詞"
      },
      {
        "surface": "です",
        "part_of_speech": "助動詞"
      }
    ]
  ]
}

Installation

I recommend you to install konoha by pip install 'konoha[all]' or pip install 'konoha[all_with_integrations]'. (all_with_integrations will install AllenNLP)

Install konoha with a specific tokenizer: pip install 'konoha[(tokenizer_name)].
Install konoha with a specific tokenizer and AllenNLP integration: pip install 'konoha[(tokenizer_name),allennlp].
Install konoha with a specific tokenizer and remote file support: pip install 'konoha[(tokenizer_name),remote]'

If you want to install konoha with a tokenizer, please install konoha with a specific tokenizer (e.g. konoha[mecab], konoha[sudachi], ...etc) or install tokenizers individually.

Example

Word level tokenization

from konoha import WordTokenizer

sentence = '自然言語処理を勉強しています'

tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize(sentence))
# => [自然, 言語, 処理, を, 勉強, し, て, い, ます]

tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))
# => [▁, 自然, 言語, 処理, を, 勉強, し, ています]

For more detail, please see the example/ directory.

Remote files

Konoha supports dictionary and model on cloud storage (currently supports Amazon S3). It requires installing konoha with the remote option, see Installation.

# download user dictionary from S3
word_tokenizer = WordTokenizer("mecab", user_dictionary_path="s3://abc/xxx.dic")
print(word_tokenizer.tokenize(sentence))

# download system dictionary from S3
word_tokenizer = WordTokenizer("mecab", system_dictionary_path="s3://abc/yyy")
print(word_tokenizer.tokenize(sentence))

# download model file from S3
word_tokenizer = WordTokenizer("sentencepiece", model_path="s3://abc/zzz.model")
print(word_tokenizer.tokenize(sentence))

Sentence level tokenization

from konoha import SentenceTokenizer

sentence = "私は猫だ。名前なんてものはない。だが，「かわいい。それで十分だろう」。"

tokenizer = SentenceTokenizer()
print(tokenizer.tokenize(sentence))
# => ['私は猫だ。', '名前なんてものはない。', 'だが，「かわいい。それで十分だろう」。']

AllenNLP integration

Konoha provides AllenNLP integration, it enables users to specify konoha tokenizer in a Jsonnet config file. By running allennlp train with --include-package konoha, you can train a model using konoha tokenizer!

For example, konoha tokenizer is specified in xxx.jsonnet like following:

{
  "dataset_reader": {
    "lazy": false,
    "type": "text_classification_json",
    "tokenizer": {
      "type": "konoha",  // <-- konoha here!!!
      "tokenizer_name": "janome",
    },
    "token_indexers": {
      "tokens": {
        "type": "single_id",
        "lowercase_tokens": true,
      },
    },
  },
  ...
  "model": {
  ...
  },
  "trainer": {
  ...
  }
}

After finishing other sections (e.g. model config, trainer config, ...etc), allennlp train config/xxx.jsonnet --include-package konoha --serialization-dir yyy works! (remember to include konoha by --include-package konoha)

For more detail, please refer my blog article (in Japanese, sorry).

Test

python -m pytest

Article

Acknowledgement

Sentencepiece model used in test is provided by @yoheikikuta. Thanks!

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

5.5.6

May 15, 2024

5.5.6a0 pre-release

May 13, 2024

5.5.5

Feb 20, 2024

5.5.4

Feb 4, 2024

5.5.3

Jan 25, 2024

5.5.2

Jan 12, 2024

5.5.1

Jan 11, 2024

5.5.0

Jan 11, 2024

5.4.0

Jan 7, 2023

5.3.0

Jul 25, 2022

5.2.1

Dec 18, 2021

5.2.0

Dec 4, 2021

5.1.0

Nov 21, 2021

5.0.1

Jun 26, 2021

This version

5.0.0

Jun 6, 2021

4.6.5

May 23, 2021

4.6.4

Mar 7, 2021

4.6.3

Mar 5, 2021

4.6.2

Sep 24, 2020

4.6.1

Aug 6, 2020

4.6.0

Aug 6, 2020

4.5.0

Jul 16, 2020

4.4.0

Jul 2, 2020

4.3.0

May 16, 2020

4.2.0

May 3, 2020

4.1.0

May 3, 2020

4.0.0

Jan 15, 2020

1.0.0

Jan 14, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

konoha-5.0.0.tar.gz (16.8 kB view details)

Uploaded Jun 6, 2021 Source

Built Distribution

konoha-5.0.0-py3-none-any.whl (19.3 kB view details)

Uploaded Jun 6, 2021 Python 3

File details

Details for the file konoha-5.0.0.tar.gz.

File metadata

Download URL: konoha-5.0.0.tar.gz
Upload date: Jun 6, 2021
Size: 16.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.1.5 CPython/3.9.5 Darwin/20.5.0

File hashes

Hashes for konoha-5.0.0.tar.gz
Algorithm	Hash digest
SHA256	`dd75f01a843b9e501eff1c153e9af86cca93c46bf2b7404d505ff8a2afbe949a`
MD5	`d34b06e3aa008dcdaaf71e4a71e60bef`
BLAKE2b-256	`5f3244b7365ca76ceb5358700723c8a567fe8ea862490997d375fcab355f85ce`

See more details on using hashes here.

File details

Details for the file konoha-5.0.0-py3-none-any.whl.

File metadata

Download URL: konoha-5.0.0-py3-none-any.whl
Upload date: Jun 6, 2021
Size: 19.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.1.5 CPython/3.9.5 Darwin/20.5.0

File hashes

Hashes for konoha-5.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`debc63dcea580dce3d50eb9fc54c656d40b85850d8e2414ca152f93705075d8d`
MD5	`e22d48bfca8d783aaeafb1b7b71cd724`
BLAKE2b-256	`62a5e62967f3dd2935684fd29aee4079e5d83b267a529eefc2d17f5333abe7fb`