Wikipedia Tokenizer Utility
Project description
Wiki NLP Tools
Python package to perform language-agnostic tokenization.
Vision
- Researchers can start with a Wikipedia article (wikitext or HTML), strip syntax to leave just paragraphs of plaintext, and then further tokenize these sentences into sentences and words for input into models.
- This would be language-agnostic – i.e. the library would work equally well regardless of Wikipedia language. https://meta.wikimedia.org/wiki/List_of_Wikipedias
- This would be easily accessible – i.e. each component is a open-source, pip-installable Python library that is configurable but provides good default performance out-of-the-box that Wikimedia could use internally via PySpark UDFs on our cluster and external organizations/researchers could incorporate into their workflows.
- The connections between states are transparent – i.e. for any text extracted in word tokenization, that text can be connected directly back to the original wikitext or HTML that it was derived from.
Features
- Tokenize text to sentence and words across
300+
languages out-of-the-box - Abbreviations can be used to improve performances
- Word-toknizer takes non-whitespace delimited languages into account during tokenization
- Input can be exactly reconstructed from the tokenization output
Installation
$ pip install mwtokenizer
Basic Usage
from mwtokenizer.tokenizer import Tokenizer
# initiate a tokenizer for "en" or English
tokenizer = Tokenizer(language_code = "en")
sample_text = '''Have Moly and Co. made it to the shop near St. Michael's Church?? \n\t The address is written by Bohr Jr. here!'''
print(list(tokenizer.sentence_tokenize(sample_text, use_abbreviation=True)))
'''
[output] ["Have Moly and Co. made it to the shop near St. Michael's Church?? \n\t ", 'The address is written by Bohr Jr. here!']
'''
print(list(tokenizer.word_tokenize(text=sample_text, use_abbreviation=True)))
'''
[output] ['Have', ' ', 'Moly', ' ', 'and', ' ', 'Co.', ' ', 'made', ' ', 'it', ' ', 'to', ' ', 'the', ' ', 'shop', ' ', 'near', ' ', 'St.', ' ', "Michael's", ' ', 'Church', '??', ' \n\t ', 'The', ' ', 'address', ' ', 'is', ' ', 'written', ' ', 'by', ' ', 'Bohr', ' ', 'Jr.', ' ', 'here', '!']
'''
Project Information
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
mwtokenizer-0.0.2.tar.gz
(7.1 MB
view details)
Built Distribution
File details
Details for the file mwtokenizer-0.0.2.tar.gz
.
File metadata
- Download URL: mwtokenizer-0.0.2.tar.gz
- Upload date:
- Size: 7.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c8080477a7917d8cdd9ae7d193616cd42858e13b08cad5a32ad34dae75428902 |
|
MD5 | 87c27cd333c72782c9e6e4fe7e737ebe |
|
BLAKE2b-256 | 7b06cfa57d330b102458eaf4c78e0629bc465160989f4231fdc843abf58caf29 |
File details
Details for the file mwtokenizer-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: mwtokenizer-0.0.2-py3-none-any.whl
- Upload date:
- Size: 7.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 31c5a39e680cf78331eed06305b6dc5c42c16f9f8edc291f4b9abc42f455f020 |
|
MD5 | c0f8dfaf991b652785a64debbb19f227 |
|
BLAKE2b-256 | f5b9b045767ee5b08f8d9678aa179970c5c24812f63034125b736a64390d48e7 |