Wikipedia Tokenizer Utility
Project description
Wiki NLP Tools
Python package to perform language-agnostic tokenization.
Vision
- Researchers can start with a Wikipedia article (wikitext or HTML), strip syntax to leave just paragraphs of plaintext, and then further tokenize these sentences into sentences and words for input into models.
- This would be language-agnostic – i.e. the library would work equally well regardless of Wikipedia language. https://meta.wikimedia.org/wiki/List_of_Wikipedias
- This would be easily accessible – i.e. each component is a open-source, pip-installable Python library that is configurable but provides good default performance out-of-the-box that Wikimedia could use internally via PySpark UDFs on our cluster and external organizations/researchers could incorporate into their workflows.
- The connections between states are transparent – i.e. for any text extracted in word tokenization, that text can be connected directly back to the original wikitext or HTML that it was derived from.
Features
- Tokenize text to sentence and words across
300+
languages out-of-the-box - Abbreviations can be used to improve performances
- Word-toknizer takes non-whitespace delimited languages into account during tokenization
- Input can be exactly reconstructed from the tokenization output
Installation
$ pip install mwtokenizer
Basic Usage
from mwtokenizer.tokenizer import Tokenizer
# initiate a tokenizer for "en" or English
tokenizer = Tokenizer(language_code = "en")
sample_text = '''Have Moly and Co. made it to the shop near St. Michael's Church?? \n\t The address is written by Bohr Jr. here!'''
print(list(tokenizer.sentence_tokenize(sample_text, use_abbreviation=True)))
'''
[output] ["Have Moly and Co. made it to the shop near St. Michael's Church?? \n\t ", 'The address is written by Bohr Jr. here!']
'''
print(list(tokenizer.word_tokenize(text=sample_text, use_abbreviation=True)))
'''
[output] ['Have', ' ', 'Moly', ' ', 'and', ' ', 'Co.', ' ', 'made', ' ', 'it', ' ', 'to', ' ', 'the', ' ', 'shop', ' ', 'near', ' ', 'St.', ' ', "Michael's", ' ', 'Church', '??', ' \n\t ', 'The', ' ', 'address', ' ', 'is', ' ', 'written', ' ', 'by', ' ', 'Bohr', ' ', 'Jr.', ' ', 'here', '!']
'''
Project Information
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
mwtokenizer-0.2.0.tar.gz
(6.9 MB
view details)
Built Distribution
File details
Details for the file mwtokenizer-0.2.0.tar.gz
.
File metadata
- Download URL: mwtokenizer-0.2.0.tar.gz
- Upload date:
- Size: 6.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 95c496172e6915814edbed261bc64829b8661622fcae2947e530b63aad7bc4ec |
|
MD5 | 781a3a64665b5c360ac588736c2feeae |
|
BLAKE2b-256 | 0d713097b66d99807c97babcaf2db08abbcee3157c00e7f76337565431c48e5c |
File details
Details for the file mwtokenizer-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: mwtokenizer-0.2.0-py3-none-any.whl
- Upload date:
- Size: 6.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 84c87ea1968761fa7ad2d774d87bdbf440abba9a56c5e972710b4b504ad1f1b9 |
|
MD5 | 02a3beba69e796719d424cbf313b4722 |
|
BLAKE2b-256 | 25292aad1f38a7b70d7291c716e17958e43ab17416cd041910a1dbfa15a82773 |