Skip to main content

A set of utilities for processing MediaWiki XML dump data.

Project description

# MediaWiki XML

This library contains a collection of utilities for efficiently processing MediaWiki’s XML database dumps. There are two important concerns that this module intends to address: complexity and performance of streaming XML parsing. This library enables memory efficent stream processing of XML dumps with a simple [iterator](https://pythonhosted.org/mwxml/iteration.html) strategy. This library also implements a distributed processing strategy (see [map()](https://pythonhosted.org/mwxml/map.html)) that enables parallel processing of many XML dump files at the same time.

## Example

>>> import mwxml
>>>
>>> dump = mwxml.Dump.from_file(open("dump.xml"))
>>> print(dump.site_info.name, dump.site_info.dbname)
Wikipedia enwiki
>>>
>>> for page in dump:
...     for revision in page:
...        print(revision.id)
...
1
2
3

## Author * Aaron Halfaker – https://github.com/halfak

## See also * http://dumps.wikimedia.org/ * http://community.wikia.com/wiki/Help:Database_download

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mwxml-0.2.2.tar.gz (13.9 kB view details)

Uploaded Source

File details

Details for the file mwxml-0.2.2.tar.gz.

File metadata

  • Download URL: mwxml-0.2.2.tar.gz
  • Upload date:
  • Size: 13.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for mwxml-0.2.2.tar.gz
Algorithm Hash digest
SHA256 5a92ed56eceb5a282d68fade4451583bbd0eda411b831003f6e7b087471cb5d5
MD5 cedb7d210b883afbe21fb6b81f5e5bce
BLAKE2b-256 e6712f2c1c72f9293b663e17bba6d714cc78dbb1972a2106857eca20048a716a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page