Skip to main content

MWParserFromHell is a parser for MediaWiki wikicode.

Project description

Build Status Coverage Status

mwparserfromhell (the MediaWiki Parser from Hell) is a Python package that provides an easy-to-use and outrageously powerful parser for MediaWiki wikicode. It supports Python 2 and Python 3.

Developed by Earwig with contributions from Σ, Legoktm, and others. Full documentation is available on ReadTheDocs. Development occurs on GitHub.

Installation

The easiest way to install the parser is through the Python Package Index; you can install the latest release with pip install mwparserfromhell (get pip). Make sure your pip is up-to-date first, especially on Windows.

Alternatively, get the latest development version:

git clone https://github.com/earwig/mwparserfromhell.git
cd mwparserfromhell
python setup.py install

You can run the comprehensive unit testing suite with python setup.py test -q.

Usage

Normal usage is rather straightforward (where text is page text):

>>> import mwparserfromhell
>>> wikicode = mwparserfromhell.parse(text)

wikicode is a mwparserfromhell.Wikicode object, which acts like an ordinary str object (or unicode in Python 2) with some extra methods. For example:

>>> text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?"
>>> wikicode = mwparserfromhell.parse(text)
>>> print(wikicode)
I has a template! {{foo|bar|baz|eggs=spam}} See it?
>>> templates = wikicode.filter_templates()
>>> print(templates)
['{{foo|bar|baz|eggs=spam}}']
>>> template = templates[0]
>>> print(template.name)
foo
>>> print(template.params)
['bar', 'baz', 'eggs=spam']
>>> print(template.get(1).value)
bar
>>> print(template.get("eggs").value)
spam

Since nodes can contain other nodes, getting nested templates is trivial:

>>> text = "{{foo|{{bar}}={{baz|{{spam}}}}}}"
>>> mwparserfromhell.parse(text).filter_templates()
['{{foo|{{bar}}={{baz|{{spam}}}}}}', '{{bar}}', '{{baz|{{spam}}}}', '{{spam}}']

You can also pass recursive=False to filter_templates() and explore templates manually. This is possible because nodes can contain additional Wikicode objects:

>>> code = mwparserfromhell.parse("{{foo|this {{includes a|template}}}}")
>>> print(code.filter_templates(recursive=False))
['{{foo|this {{includes a|template}}}}']
>>> foo = code.filter_templates(recursive=False)[0]
>>> print(foo.get(1).value)
this {{includes a|template}}
>>> print(foo.get(1).value.filter_templates()[0])
{{includes a|template}}
>>> print(foo.get(1).value.filter_templates()[0].get(1).value)
template

Templates can be easily modified to add, remove, or alter params. Wikicode objects can be treated like lists, with append(), insert(), remove(), replace(), and more. They also have a matches() method for comparing page or template names, which takes care of capitalization and whitespace:

>>> text = "{{cleanup}} '''Foo''' is a [[bar]]. {{uncategorized}}"
>>> code = mwparserfromhell.parse(text)
>>> for template in code.filter_templates():
...     if template.name.matches("Cleanup") and not template.has("date"):
...         template.add("date", "July 2012")
...
>>> print(code)
{{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{uncategorized}}
>>> code.replace("{{uncategorized}}", "{{bar-stub}}")
>>> print(code)
{{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{bar-stub}}
>>> print(code.filter_templates())
['{{cleanup|date=July 2012}}', '{{bar-stub}}']

You can then convert code back into a regular str object (for saving the page!) by calling str() on it:

>>> text = str(code)
>>> print(text)
{{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{bar-stub}}
>>> text == code
True

Likewise, use unicode(code) in Python 2.

Limitations

While the MediaWiki parser generates HTML and has access to the contents of templates, among other things, mwparserfromhell acts as a direct interface to the source code only. This has several implications:

  • Syntax elements produced by a template transclusion cannot be detected. For example, imagine a hypothetical page "Template:End-bold" that contained the text </b>. While MediaWiki would correctly understand that <b>foobar{{end-bold}} translates to <b>foobar</b>, mwparserfromhell has no way of examining the contents of {{end-bold}}. Instead, it would treat the bold tag as unfinished, possibly extending further down the page.

  • Templates adjacent to external links, as in http://example.com{{foo}}, are considered part of the link. In reality, this would depend on the contents of the template.

  • When different syntax elements cross over each other, as in {{echo|''Hello}}, world!'', the parser gets confused because this cannot be represented by an ordinary syntax tree. Instead, the parser will treat the first syntax construct as plain text. In this case, only the italic tag would be properly parsed.

    Workaround: Since this commonly occurs with text formatting and text formatting is often not of interest to users, you may pass skip_style_tags=True to mwparserfromhell.parse(). This treats '' and ''' as plain text.

    A future version of mwparserfromhell may include multiple parsing modes to get around this restriction more sensibly.

Additionally, the parser lacks awareness of certain wiki-specific settings:

  • Word-ending links are not supported, since the linktrail rules are language-specific.

  • Localized namespace names aren’t recognized, so file links (such as [[File:...]]) are treated as regular wikilinks.

  • Anything that looks like an XML tag is treated as a tag, even if it is not a recognized tag name, since the list of valid tags depends on loaded MediaWiki extensions.

Integration

mwparserfromhell is used by and originally developed for EarwigBot; Page objects have a parse method that essentially calls mwparserfromhell.parse() on page.get().

If you’re using Pywikibot, your code might look like this:

import mwparserfromhell
import pywikibot

def parse(title):
    site = pywikibot.Site()
    page = pywikibot.Page(site, title)
    text = page.get()
    return mwparserfromhell.parse(text)

If you’re not using a library, you can parse any page using the following Python 3 code (via the API):

import json
from urllib.parse import urlencode
from urllib.request import urlopen
import mwparserfromhell
API_URL = "https://en.wikipedia.org/w/api.php"

def parse(title):
    data = {"action": "query", "prop": "revisions", "rvlimit": 1,
            "rvprop": "content", "format": "json", "titles": title}
    raw = urlopen(API_URL, urlencode(data).encode()).read()
    res = json.loads(raw)
    text = res["query"]["pages"].values()[0]["revisions"][0]["*"]
    return mwparserfromhell.parse(text)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mwparserfromhell-0.5.tar.gz (130.5 kB view details)

Uploaded Source

Built Distributions

mwparserfromhell-0.5-cp36-cp36m-win_amd64.whl (100.4 kB view details)

Uploaded CPython 3.6m Windows x86-64

mwparserfromhell-0.5-cp36-cp36m-win32.whl (96.6 kB view details)

Uploaded CPython 3.6m Windows x86

mwparserfromhell-0.5-cp35-cp35m-win_amd64.whl (100.4 kB view details)

Uploaded CPython 3.5m Windows x86-64

mwparserfromhell-0.5-cp35-cp35m-win32.whl (96.6 kB view details)

Uploaded CPython 3.5m Windows x86

mwparserfromhell-0.5-cp34-cp34m-win_amd64.whl (96.5 kB view details)

Uploaded CPython 3.4m Windows x86-64

mwparserfromhell-0.5-cp34-cp34m-win32.whl (94.5 kB view details)

Uploaded CPython 3.4m Windows x86

mwparserfromhell-0.5-cp33-cp33m-win_amd64.whl (96.6 kB view details)

Uploaded CPython 3.3m Windows x86-64

mwparserfromhell-0.5-cp33-cp33m-win32.whl (94.5 kB view details)

Uploaded CPython 3.3m Windows x86

mwparserfromhell-0.5-cp27-cp27m-win_amd64.whl (96.2 kB view details)

Uploaded CPython 2.7m Windows x86-64

mwparserfromhell-0.5-cp27-cp27m-win32.whl (94.0 kB view details)

Uploaded CPython 2.7m Windows x86

File details

Details for the file mwparserfromhell-0.5.tar.gz.

File metadata

File hashes

Hashes for mwparserfromhell-0.5.tar.gz
Algorithm Hash digest
SHA256 58cf4ccc081410d884b72abeda5664f390021b5761328e2ced09421091a48f68
MD5 3fda168d81a27912ad8ba20c100e1352
BLAKE2b-256 8db3cfa6432d8c0cec4979137d436d0c068946543f48dc9c89fb182f221d464d

See more details on using hashes here.

File details

Details for the file mwparserfromhell-0.5-cp36-cp36m-win_amd64.whl.

File metadata

File hashes

Hashes for mwparserfromhell-0.5-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 db6489c9cae40232b52259b369757cf0ec29d80211bc5f50373b88dbc31e39b9
MD5 b999331a138095cab56e50bedc2fc13c
BLAKE2b-256 e1e512a5faab8334f45ebb7f5e763353acebd44f9f4bc205b5a4fa01c2b12454

See more details on using hashes here.

File details

Details for the file mwparserfromhell-0.5-cp36-cp36m-win32.whl.

File metadata

File hashes

Hashes for mwparserfromhell-0.5-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 16bf442aa6561d8b5cd5af962e590823b17a1b166722f70c694b0264deae0950
MD5 5fa28904ee9cebd43c39107636fae61f
BLAKE2b-256 255d0109b4e2299f0ab1b376743ad957931cf1a54374d77e6770d2d3d93d2104

See more details on using hashes here.

File details

Details for the file mwparserfromhell-0.5-cp35-cp35m-win_amd64.whl.

File metadata

File hashes

Hashes for mwparserfromhell-0.5-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 fe32f9bd64f3d0b6bdf8a6c12000b823f9f03ba802f5dc2115af0e6b7fdc5ff8
MD5 aa8b958b209d5cd282a20238771cdd72
BLAKE2b-256 bfa505e52914d7e6bbbc2b1fe16a46b583ac05f4a4bbb55d0e3b1fc09bd67506

See more details on using hashes here.

File details

Details for the file mwparserfromhell-0.5-cp35-cp35m-win32.whl.

File metadata

File hashes

Hashes for mwparserfromhell-0.5-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 eec5df23fefe5cf7e0b9296607834736b0c475b0b069a3be1567885a8b576ef6
MD5 593aee993f79c7e695bce14337397439
BLAKE2b-256 ed60a2df912d263f476bc187236fdf9bc2d24c1c864f1f845ca5d66b4684d8bb

See more details on using hashes here.

File details

Details for the file mwparserfromhell-0.5-cp34-cp34m-win_amd64.whl.

File metadata

File hashes

Hashes for mwparserfromhell-0.5-cp34-cp34m-win_amd64.whl
Algorithm Hash digest
SHA256 d4ecf44dcf98203daee79b427b49439d7cd7136d2dd10156ac269f3d4c07e00f
MD5 f41a477cf80ea626161689025fe43072
BLAKE2b-256 108871d994924b3c0db5a44e4af8ef876d5522822e5b62fcb94fdf2c658ea8a7

See more details on using hashes here.

File details

Details for the file mwparserfromhell-0.5-cp34-cp34m-win32.whl.

File metadata

File hashes

Hashes for mwparserfromhell-0.5-cp34-cp34m-win32.whl
Algorithm Hash digest
SHA256 30de9f6e03c3f180c2be6312b71c25bd7c082c97a9de0916330426d65b683083
MD5 446705655deb59672f9e5adc153b1735
BLAKE2b-256 115d7695157f1380696fdb97065faa4c725e497e0a4d93e8bd1ba20f1b1704a8

See more details on using hashes here.

File details

Details for the file mwparserfromhell-0.5-cp33-cp33m-win_amd64.whl.

File metadata

File hashes

Hashes for mwparserfromhell-0.5-cp33-cp33m-win_amd64.whl
Algorithm Hash digest
SHA256 a4b95fce8b42dd0f1ea97dfa5f2c03793dd5cb5f3ef207a19dfdc56dbc37925d
MD5 0ede66634f935d0aa6ceebee46f95a99
BLAKE2b-256 1fb8c2cedb9eba0d8ac0845c6eefcd0705f601507d1e4574c44bcd6614780f81

See more details on using hashes here.

File details

Details for the file mwparserfromhell-0.5-cp33-cp33m-win32.whl.

File metadata

File hashes

Hashes for mwparserfromhell-0.5-cp33-cp33m-win32.whl
Algorithm Hash digest
SHA256 9c88939f43f0521078180cda9ade2aaa7ae4ce1321535ca39cb61646b0e66630
MD5 5415453c00a45fbef92d5a901b9784b5
BLAKE2b-256 0d0fcc2d124cceb6d98c0a027470119a4b8170a351616a7c145f3bb5f2fc9afa

See more details on using hashes here.

File details

Details for the file mwparserfromhell-0.5-cp27-cp27m-win_amd64.whl.

File metadata

File hashes

Hashes for mwparserfromhell-0.5-cp27-cp27m-win_amd64.whl
Algorithm Hash digest
SHA256 829e780c80ba3b002efd80ecf902cf6e166aa5d1a8819a1a5278fa7afa92d251
MD5 9c16f78cc356d5e89e282f52817c6209
BLAKE2b-256 642c04e3952263b11d71722fe5fdcb4f46d6967bbc31bd9edc36919b3825c84f

See more details on using hashes here.

File details

Details for the file mwparserfromhell-0.5-cp27-cp27m-win32.whl.

File metadata

File hashes

Hashes for mwparserfromhell-0.5-cp27-cp27m-win32.whl
Algorithm Hash digest
SHA256 2f6b722096948d4ba66b0e2dde74a0959a2c8bb7b4570802739bb5cc6d55d4f9
MD5 6a0808f1234010ccc2c8e132167431b6
BLAKE2b-256 ed00618079c198410a3250480542384aa7dad81090b9997f30eb62d731c668b0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page