Skip to main content

Wikipedia Tokenizer Utility

Project description

Wiki NLP Tools

Python package to perform language-agnostic tokenization.

Vision

  • Researchers can start with a Wikipedia article (wikitext or HTML), strip syntax to leave just paragraphs of plaintext, and then further tokenize these sentences into sentences and words for input into models.
  • This would be language-agnostic – i.e. the library would work equally well regardless of Wikipedia language. https://meta.wikimedia.org/wiki/List_of_Wikipedias
  • This would be easily accessible – i.e. each component is a open-source, pip-installable Python library that is configurable but provides good default performance out-of-the-box that Wikimedia could use internally via PySpark UDFs on our cluster and external organizations/researchers could incorporate into their workflows.
  • The connections between states are transparent – i.e. for any text extracted in word tokenization, that text can be connected directly back to the original wikitext or HTML that it was derived from.

Features

  • Tokenize text to sentence and words across 300+ languages out-of-the-box
  • Abbreviations can be used to improve performances
  • Word-toknizer takes non-whitespace delimited languages into account during tokenization
  • Input can be exactly reconstructed from the tokenization output

Installation

$ pip install mwtokenizer

Basic Usage

from mwtokenizer.tokenizer import Tokenizer
# initiate a tokenizer for "en" or English
tokenizer = Tokenizer(language_code = "en")
sample_text =  '''Have Moly and Co. made it to the shop near St. Michael's Church?? \n\t The address is written by Bohr Jr.   here!'''
print(list(tokenizer.sentence_tokenize(sample_text, use_abbreviation=True)))
'''
[output] ["Have Moly and Co. made it to the shop near St. Michael's Church?? \n\t ", 'The address is written by Bohr Jr.   here!']
'''
print(list(tokenizer.word_tokenize(text=sample_text, use_abbreviation=True)))
'''
[output] ['Have', ' ', 'Moly', ' ', 'and', ' ', 'Co.', ' ', 'made', ' ', 'it', ' ', 'to', ' ', 'the', ' ', 'shop', ' ', 'near', ' ', 'St.', ' ', "Michael's", ' ', 'Church', '??', ' \n\t ', 'The', ' ', 'address', ' ', 'is', ' ', 'written', ' ', 'by', ' ', 'Bohr', ' ', 'Jr.', '   ', 'here', '!']
'''

Project Information

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mwtokenizer-0.2.0.tar.gz (6.9 MB view details)

Uploaded Source

Built Distribution

mwtokenizer-0.2.0-py3-none-any.whl (6.9 MB view details)

Uploaded Python 3

File details

Details for the file mwtokenizer-0.2.0.tar.gz.

File metadata

  • Download URL: mwtokenizer-0.2.0.tar.gz
  • Upload date:
  • Size: 6.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for mwtokenizer-0.2.0.tar.gz
Algorithm Hash digest
SHA256 95c496172e6915814edbed261bc64829b8661622fcae2947e530b63aad7bc4ec
MD5 781a3a64665b5c360ac588736c2feeae
BLAKE2b-256 0d713097b66d99807c97babcaf2db08abbcee3157c00e7f76337565431c48e5c

See more details on using hashes here.

File details

Details for the file mwtokenizer-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: mwtokenizer-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 6.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for mwtokenizer-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 84c87ea1968761fa7ad2d774d87bdbf440abba9a56c5e972710b4b504ad1f1b9
MD5 02a3beba69e796719d424cbf313b4722
BLAKE2b-256 25292aad1f38a7b70d7291c716e17958e43ab17416cd041910a1dbfa15a82773

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page