Project description

Wiki NLP Tools

Python package to perform language-agnostic tokenization.

Vision

Researchers can start with a Wikipedia article (wikitext or HTML), strip syntax to leave just paragraphs of plaintext, and then further tokenize these sentences into sentences and words for input into models.
This would be language-agnostic – i.e. the library would work equally well regardless of Wikipedia language. https://meta.wikimedia.org/wiki/List_of_Wikipedias
This would be easily accessible – i.e. each component is a open-source, pip-installable Python library that is configurable but provides good default performance out-of-the-box that Wikimedia could use internally via PySpark UDFs on our cluster and external organizations/researchers could incorporate into their workflows.
The connections between states are transparent – i.e. for any text extracted in word tokenization, that text can be connected directly back to the original wikitext or HTML that it was derived from.

Features

Tokenize text to sentence and words across 300+ languages out-of-the-box
Abbreviations can be used to improve performances
Word-toknizer takes non-whitespace delimited languages into account during tokenization
Input can be exactly reconstructed from the tokenization output

Installation

$ pip install mwtokenizer

Basic Usage

from mwtokenizer.tokenizer import Tokenizer
# initiate a tokenizer for "en" or English
tokenizer = Tokenizer(language_code = "en")
sample_text =  '''Have Moly and Co. made it to the shop near St. Michael's Church?? \n\t The address is written by Bohr Jr.   here!'''
print(list(tokenizer.sentence_tokenize(sample_text, use_abbreviation=True)))
'''
[output] ["Have Moly and Co. made it to the shop near St. Michael's Church?? \n\t ", 'The address is written by Bohr Jr.   here!']
'''
print(list(tokenizer.word_tokenize(text=sample_text, use_abbreviation=True)))
'''
[output] ['Have', ' ', 'Moly', ' ', 'and', ' ', 'Co.', ' ', 'made', ' ', 'it', ' ', 'to', ' ', 'the', ' ', 'shop', ' ', 'near', ' ', 'St.', ' ', "Michael's", ' ', 'Church', '??', ' \n\t ', 'The', ' ', 'address', ' ', 'is', ' ', 'written', ' ', 'by', ' ', 'Bohr', ' ', 'Jr.', '   ', 'here', '!']
'''

Project Information

Licensing
Repository
Issue Tracker
Contribution Guidelines
Benchmarking
Resource Generation: Abbreviations & Benchmarking data + Sentencepiece Corpus and Training

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
Programming Language

Release history Release notifications | RSS feed

This version

0.2.0

Dec 22, 2023

0.1.0

Dec 5, 2023

0.0.2

Jun 26, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mwtokenizer-0.2.0.tar.gz (6.9 MB view details)

Uploaded Dec 22, 2023 Source

Built Distribution

mwtokenizer-0.2.0-py3-none-any.whl (6.9 MB view details)

Uploaded Dec 22, 2023 Python 3

File details

Details for the file mwtokenizer-0.2.0.tar.gz.

File metadata

Download URL: mwtokenizer-0.2.0.tar.gz
Upload date: Dec 22, 2023
Size: 6.9 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for mwtokenizer-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`95c496172e6915814edbed261bc64829b8661622fcae2947e530b63aad7bc4ec`
MD5	`781a3a64665b5c360ac588736c2feeae`
BLAKE2b-256	`0d713097b66d99807c97babcaf2db08abbcee3157c00e7f76337565431c48e5c`

See more details on using hashes here.

File details

Details for the file mwtokenizer-0.2.0-py3-none-any.whl.

File metadata

Download URL: mwtokenizer-0.2.0-py3-none-any.whl
Upload date: Dec 22, 2023
Size: 6.9 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for mwtokenizer-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`84c87ea1968761fa7ad2d774d87bdbf440abba9a56c5e972710b4b504ad1f1b9`
MD5	`02a3beba69e796719d424cbf313b4722`
BLAKE2b-256	`25292aad1f38a7b70d7291c716e17958e43ab17416cd041910a1dbfa15a82773`