Skip to main content

Wikipedia Tokenizer Utility

Project description

Wiki NLP Tools

Python package to perform language-agnostic tokenization.

Vision

  • Researchers can start with a Wikipedia article (wikitext or HTML), strip syntax to leave just paragraphs of plaintext, and then further tokenize these sentences into sentences and words for input into models.
  • This would be language-agnostic – i.e. the library would work equally well regardless of Wikipedia language. https://meta.wikimedia.org/wiki/List_of_Wikipedias
  • This would be easily accessible – i.e. each component is a open-source, pip-installable Python library that is configurable but provides good default performance out-of-the-box that Wikimedia could use internally via PySpark UDFs on our cluster and external organizations/researchers could incorporate into their workflows.
  • The connections between states are transparent – i.e. for any text extracted in word tokenization, that text can be connected directly back to the original wikitext or HTML that it was derived from.

Features

  • Tokenize text to sentence and words across 300+ languages out-of-the-box
  • Abbreviations can be used to improve performances
  • Word-toknizer takes non-whitespace delimited languages into account during tokenization
  • Input can be exactly reconstructed from the tokenization output

Installation

$ pip install mwtokenizer

Basic Usage

from mwtokenizer.tokenizer import Tokenizer
# initiate a tokenizer for "en" or English
tokenizer = Tokenizer(language_code = "en")
sample_text =  '''Have Moly and Co. made it to the shop near St. Michael's Church?? \n\t The address is written by Bohr Jr.   here!'''
print(list(tokenizer.sentence_tokenize(sample_text, use_abbreviation=True)))
'''
[output] ["Have Moly and Co. made it to the shop near St. Michael's Church?? \n\t ", 'The address is written by Bohr Jr.   here!']
'''
print(list(tokenizer.word_tokenize(text=sample_text, use_abbreviation=True)))
'''
[output] ['Have', ' ', 'Moly', ' ', 'and', ' ', 'Co.', ' ', 'made', ' ', 'it', ' ', 'to', ' ', 'the', ' ', 'shop', ' ', 'near', ' ', 'St.', ' ', "Michael's", ' ', 'Church', '??', ' \n\t ', 'The', ' ', 'address', ' ', 'is', ' ', 'written', ' ', 'by', ' ', 'Bohr', ' ', 'Jr.', '   ', 'here', '!']
'''

Project Information

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mwtokenizer-0.1.0.tar.gz (6.9 MB view details)

Uploaded Source

Built Distribution

mwtokenizer-0.1.0-py3-none-any.whl (6.9 MB view details)

Uploaded Python 3

File details

Details for the file mwtokenizer-0.1.0.tar.gz.

File metadata

  • Download URL: mwtokenizer-0.1.0.tar.gz
  • Upload date:
  • Size: 6.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for mwtokenizer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e4226e915c4888a058b1ed3ba158aabc7f0d3019ae900d3e1fcdfb66a3b01449
MD5 8c3a0a466c456c16451a1df2750c0fb0
BLAKE2b-256 32a612e4310ae10550db556ef4c7dde0748311201d6b749d0448f8ebb591ebc7

See more details on using hashes here.

File details

Details for the file mwtokenizer-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mwtokenizer-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for mwtokenizer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 77dc8db6fdec6537093c862cd07bdcbeb03991dd530bd8b9d21f1c90ff03418c
MD5 3979ff2ac6f3600aa70ed76fdaddc31c
BLAKE2b-256 aa2f45f9fabb51e8698f61e9a3b38e816cf496a002ff7f2e1fcd68df9024b9d2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page