Skip to main content

MWParserFromHell is a parser for MediaWiki wikicode.

Project description

Build Status

mwparserfromhell (the MediaWiki Parser from Hell) is a Python package that provides an easy-to-use and outrageously powerful parser for MediaWiki wikicode. It supports Python 2 and Python 3.

Developed by Earwig with contributions from Σ, Legoktm, and others. Full documentation is available on ReadTheDocs. Development occurs on GitHub.

Installation

The easiest way to install the parser is through the Python Package Index, so you can install the latest release with pip install mwparserfromhell (get pip). Alternatively, get the latest development version:

git clone https://github.com/earwig/mwparserfromhell.git
cd mwparserfromhell
python setup.py install

If you get error: Unable to find vcvarsall.bat while installing, this is because Windows can’t find the compiler for C extensions. Consult this StackOverflow question for help. You can also set ext_modules in setup.py to an empty list to prevent the extension from building.

You can run the comprehensive unit testing suite with python setup.py test -q.

Usage

Normal usage is rather straightforward (where text is page text):

>>> import mwparserfromhell
>>> wikicode = mwparserfromhell.parse(text)

wikicode is a mwparserfromhell.Wikicode object, which acts like an ordinary unicode object (or str in Python 3) with some extra methods. For example:

>>> text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?"
>>> wikicode = mwparserfromhell.parse(text)
>>> print wikicode
I has a template! {{foo|bar|baz|eggs=spam}} See it?
>>> templates = wikicode.filter_templates()
>>> print templates
['{{foo|bar|baz|eggs=spam}}']
>>> template = templates[0]
>>> print template.name
foo
>>> print template.params
['bar', 'baz', 'eggs=spam']
>>> print template.get(1).value
bar
>>> print template.get("eggs").value
spam

Since nodes can contain other nodes, getting nested templates is trivial:

>>> text = "{{foo|{{bar}}={{baz|{{spam}}}}}}"
>>> mwparserfromhell.parse(text).filter_templates()
['{{foo|{{bar}}={{baz|{{spam}}}}}}', '{{bar}}', '{{baz|{{spam}}}}', '{{spam}}']

You can also pass recursive=False to filter_templates() and explore templates manually. This is possible because nodes can contain additional Wikicode objects:

>>> code = mwparserfromhell.parse("{{foo|this {{includes a|template}}}}")
>>> print code.filter_templates(recursive=False)
['{{foo|this {{includes a|template}}}}']
>>> foo = code.filter_templates(recursive=False)[0]
>>> print foo.get(1).value
this {{includes a|template}}
>>> print foo.get(1).value.filter_templates()[0]
{{includes a|template}}
>>> print foo.get(1).value.filter_templates()[0].get(1).value
template

Templates can be easily modified to add, remove, or alter params. Wikicode objects can be treated like lists, with append(), insert(), remove(), replace(), and more. They also have a matches() method for comparing page or template names, which takes care of capitalization and whitespace:

>>> text = "{{cleanup}} '''Foo''' is a [[bar]]. {{uncategorized}}"
>>> code = mwparserfromhell.parse(text)
>>> for template in code.filter_templates():
...     if template.name.matches("Cleanup") and not template.has("date"):
...         template.add("date", "July 2012")
...
>>> print code
{{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{uncategorized}}
>>> code.replace("{{uncategorized}}", "{{bar-stub}}")
>>> print code
{{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{bar-stub}}
>>> print code.filter_templates()
['{{cleanup|date=July 2012}}', '{{bar-stub}}']

You can then convert code back into a regular unicode object (for saving the page!) by calling unicode() on it:

>>> text = unicode(code)
>>> print text
{{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{bar-stub}}
>>> text == code
True

Likewise, use str(code) in Python 3.

Integration

mwparserfromhell is used by and originally developed for EarwigBot; Page objects have a parse method that essentially calls mwparserfromhell.parse() on page.get().

If you’re using Pywikipedia, your code might look like this:

import mwparserfromhell
import wikipedia as pywikibot
def parse(title):
    site = pywikibot.getSite()
    page = pywikibot.Page(site, title)
    text = page.get()
    return mwparserfromhell.parse(text)

If you’re not using a library, you can parse templates in any page using the following code (via the API):

import json
import urllib
import mwparserfromhell
API_URL = "http://en.wikipedia.org/w/api.php"
def parse(title):
    data = {"action": "query", "prop": "revisions", "rvlimit": 1,
            "rvprop": "content", "format": "json", "titles": title}
    raw = urllib.urlopen(API_URL, urllib.urlencode(data)).read()
    res = json.loads(raw)
    text = res["query"]["pages"].values()[0]["revisions"][0]["*"]
    return mwparserfromhell.parse(text)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mwparserfromhell-0.3.3.tar.gz (104.5 kB view details)

Uploaded Source

Built Distributions

mwparserfromhell-0.3.3-cp32-none-win32.whl (84.2 kB view details)

Uploaded CPython 3.2 Windows x86

mwparserfromhell-0.3.3-cp27-none-win32.whl (84.2 kB view details)

Uploaded CPython 2.7 Windows x86

mwparserfromhell-0.3.3-cp26-none-win32.whl (84.4 kB view details)

Uploaded CPython 2.6 Windows x86

File details

Details for the file mwparserfromhell-0.3.3.tar.gz.

File metadata

File hashes

Hashes for mwparserfromhell-0.3.3.tar.gz
Algorithm Hash digest
SHA256 885b14b7013ca65b4d6d59baec90210600b536bff6ccf76f7483aaafe027bca4
MD5 bb65ed15a20e15a8a54ac0c3e5b35155
BLAKE2b-256 32cc51a912be94552701dc5c4d16932ad23fce7e3e85aef8e296f7c9510f409b

See more details on using hashes here.

Provenance

File details

Details for the file mwparserfromhell-0.3.3-cp32-none-win32.whl.

File metadata

File hashes

Hashes for mwparserfromhell-0.3.3-cp32-none-win32.whl
Algorithm Hash digest
SHA256 b1d857f42a57798d16a4bd356144bee8033f3091cae0933d0ca7f6f05a2aeb63
MD5 865431e13eaccc680221c9d704008f21
BLAKE2b-256 9994c196f6c65a3c716605ecb2c85ce1dda17d4c172d886b5eb1901ba28a81e1

See more details on using hashes here.

Provenance

File details

Details for the file mwparserfromhell-0.3.3-cp27-none-win32.whl.

File metadata

File hashes

Hashes for mwparserfromhell-0.3.3-cp27-none-win32.whl
Algorithm Hash digest
SHA256 45ae46b85eb2ab2e98b13cecba62ef1cc08aa0dc40668b1e7f03bbbdd741c6c6
MD5 a09933f94fcdeadbbd9dfe59e383ad61
BLAKE2b-256 e3459158b504205a7a7849d7364085dee35ebadb35f40f11f19afbec50247dec

See more details on using hashes here.

Provenance

File details

Details for the file mwparserfromhell-0.3.3-cp26-none-win32.whl.

File metadata

File hashes

Hashes for mwparserfromhell-0.3.3-cp26-none-win32.whl
Algorithm Hash digest
SHA256 91838786caa18acb2a3a1f141bda6b44c0d371754c933e4084e9294dc2d98f03
MD5 432c6deb0b8d797949aaa8a2e2ae10d7
BLAKE2b-256 00decf2d8fff3a55b920e630cce600608283d3f2057cdbf77fc8ff6480cc1c45

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page