Skip to main content

Wikipedia Tokenizer Utility

Project description

Wiki NLP Tools

Python package to perform language-agnostic tokenization.

Vision

  • Researchers can start with a Wikipedia article (wikitext or HTML), strip syntax to leave just paragraphs of plaintext, and then further tokenize these sentences into sentences and words for input into models.
  • This would be language-agnostic – i.e. the library would work equally well regardless of Wikipedia language. https://meta.wikimedia.org/wiki/List_of_Wikipedias
  • This would be easily accessible – i.e. each component is a open-source, pip-installable Python library that is configurable but provides good default performance out-of-the-box that Wikimedia could use internally via PySpark UDFs on our cluster and external organizations/researchers could incorporate into their workflows.
  • The connections between states are transparent – i.e. for any text extracted in word tokenization, that text can be connected directly back to the original wikitext or HTML that it was derived from.

Features

  • Tokenize text to sentence and words across 300+ languages out-of-the-box
  • Abbreviations can be used to improve performances
  • Word-toknizer takes non-whitespace delimited languages into account during tokenization
  • Input can be exactly reconstructed from the tokenization output

Installation

$ pip install mwtokenizer

Basic Usage

from mwtokenizer.tokenizer import Tokenizer
# initiate a tokenizer for "en" or English
tokenizer = Tokenizer(language_code = "en")
sample_text =  '''Have Moly and Co. made it to the shop near St. Michael's Church?? \n\t The address is written by Bohr Jr.   here!'''
print(list(tokenizer.sentence_tokenize(sample_text, use_abbreviation=True)))
'''
[output] ["Have Moly and Co. made it to the shop near St. Michael's Church?? \n\t ", 'The address is written by Bohr Jr.   here!']
'''
print(list(tokenizer.word_tokenize(text=sample_text, use_abbreviation=True)))
'''
[output] ['Have', ' ', 'Moly', ' ', 'and', ' ', 'Co.', ' ', 'made', ' ', 'it', ' ', 'to', ' ', 'the', ' ', 'shop', ' ', 'near', ' ', 'St.', ' ', "Michael's", ' ', 'Church', '??', ' \n\t ', 'The', ' ', 'address', ' ', 'is', ' ', 'written', ' ', 'by', ' ', 'Bohr', ' ', 'Jr.', '   ', 'here', '!']
'''

Project Information

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mwtokenizer-0.0.2.tar.gz (7.1 MB view details)

Uploaded Source

Built Distribution

mwtokenizer-0.0.2-py3-none-any.whl (7.1 MB view details)

Uploaded Python 3

File details

Details for the file mwtokenizer-0.0.2.tar.gz.

File metadata

  • Download URL: mwtokenizer-0.0.2.tar.gz
  • Upload date:
  • Size: 7.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.7

File hashes

Hashes for mwtokenizer-0.0.2.tar.gz
Algorithm Hash digest
SHA256 c8080477a7917d8cdd9ae7d193616cd42858e13b08cad5a32ad34dae75428902
MD5 87c27cd333c72782c9e6e4fe7e737ebe
BLAKE2b-256 7b06cfa57d330b102458eaf4c78e0629bc465160989f4231fdc843abf58caf29

See more details on using hashes here.

File details

Details for the file mwtokenizer-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: mwtokenizer-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 7.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.7

File hashes

Hashes for mwtokenizer-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 31c5a39e680cf78331eed06305b6dc5c42c16f9f8edc291f4b9abc42f455f020
MD5 c0f8dfaf991b652785a64debbb19f227
BLAKE2b-256 f5b9b045767ee5b08f8d9678aa179970c5c24812f63034125b736a64390d48e7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page