Wikipedia Tokenizer Utility
Project description
Wiki NLP Tools
Python package to perform language-agnostic tokenization.
Vision
- Researchers can start with a Wikipedia article (wikitext or HTML), strip syntax to leave just paragraphs of plaintext, and then further tokenize these sentences into sentences and words for input into models.
- This would be language-agnostic – i.e. the library would work equally well regardless of Wikipedia language. https://meta.wikimedia.org/wiki/List_of_Wikipedias
- This would be easily accessible – i.e. each component is a open-source, pip-installable Python library that is configurable but provides good default performance out-of-the-box that Wikimedia could use internally via PySpark UDFs on our cluster and external organizations/researchers could incorporate into their workflows.
- The connections between states are transparent – i.e. for any text extracted in word tokenization, that text can be connected directly back to the original wikitext or HTML that it was derived from.
Features
- Tokenize text to sentence and words across
300+
languages out-of-the-box - Abbreviations can be used to improve performances
- Word-toknizer takes non-whitespace delimited languages into account during tokenization
- Input can be exactly reconstructed from the tokenization output
Installation
$ pip install mwtokenizer
Basic Usage
from mwtokenizer.tokenizer import Tokenizer
# initiate a tokenizer for "en" or English
tokenizer = Tokenizer(language_code = "en")
sample_text = '''Have Moly and Co. made it to the shop near St. Michael's Church?? \n\t The address is written by Bohr Jr. here!'''
print(list(tokenizer.sentence_tokenize(sample_text, use_abbreviation=True)))
'''
[output] ["Have Moly and Co. made it to the shop near St. Michael's Church?? \n\t ", 'The address is written by Bohr Jr. here!']
'''
print(list(tokenizer.word_tokenize(text=sample_text, use_abbreviation=True)))
'''
[output] ['Have', ' ', 'Moly', ' ', 'and', ' ', 'Co.', ' ', 'made', ' ', 'it', ' ', 'to', ' ', 'the', ' ', 'shop', ' ', 'near', ' ', 'St.', ' ', "Michael's", ' ', 'Church', '??', ' \n\t ', 'The', ' ', 'address', ' ', 'is', ' ', 'written', ' ', 'by', ' ', 'Bohr', ' ', 'Jr.', ' ', 'here', '!']
'''
Project Information
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
mwtokenizer-0.1.0.tar.gz
(6.9 MB
view details)
Built Distribution
File details
Details for the file mwtokenizer-0.1.0.tar.gz
.
File metadata
- Download URL: mwtokenizer-0.1.0.tar.gz
- Upload date:
- Size: 6.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e4226e915c4888a058b1ed3ba158aabc7f0d3019ae900d3e1fcdfb66a3b01449 |
|
MD5 | 8c3a0a466c456c16451a1df2750c0fb0 |
|
BLAKE2b-256 | 32a612e4310ae10550db556ef4c7dde0748311201d6b749d0448f8ebb591ebc7 |
File details
Details for the file mwtokenizer-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: mwtokenizer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 77dc8db6fdec6537093c862cd07bdcbeb03991dd530bd8b9d21f1c90ff03418c |
|
MD5 | 3979ff2ac6f3600aa70ed76fdaddc31c |
|
BLAKE2b-256 | aa2f45f9fabb51e8698f61e9a3b38e816cf496a002ff7f2e1fcd68df9024b9d2 |