Skip to main content

A set of utilities for processing MediaWiki XML dump data.

Project description

# MediaWiki XML

This library contains a collection of utilities for efficiently processing MediaWiki’s XML database dumps. There are two important concerns that this module intends to address: complexity and performance of streaming XML parsing. This library enables memory efficent stream processing of XML dumps with a simple [iterator](https://pythonhosted.org/mwxml/iteration.html) strategy. This library also implements a distributed processing strategy (see [map()](https://pythonhosted.org/mwxml/map.html)) that enables parallel processing of many XML dump files at the same time.

## Example

>>> import mwxml
>>>
>>> dump = mwxml.Dump.from_file(open("dump.xml"))
>>> print(dump.site_info.name, dump.site_info.dbname)
Wikipedia enwiki
>>>
>>> for page in dump:
...     for revision in page:
...        print(revision.id)
...
1
2
3

## Author * Aaron Halfaker – https://github.com/halfak

## See also * http://dumps.wikimedia.org/ * http://community.wikia.com/wiki/Help:Database_download

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mwxml-0.3.2.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

mwxml-0.3.2-py2.py3-none-any.whl (32.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file mwxml-0.3.2.tar.gz.

File metadata

  • Download URL: mwxml-0.3.2.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for mwxml-0.3.2.tar.gz
Algorithm Hash digest
SHA256 c8b992db787c84efc4e1342dbe102a4d4f245c318e9afdacf1e27b4db80b20f9
MD5 3d61e567173c2d518f7613cd803cb348
BLAKE2b-256 af92bc8f93824a1b6106e2b41f14ef934b3f4995f5b3bcc99abc90ea61fb34ec

See more details on using hashes here.

File details

Details for the file mwxml-0.3.2-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for mwxml-0.3.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 c3994bbae73ba3504474336697a3d825f7b882d12afed529cdf2789bce8fd6d6
MD5 71e2084b81ce2f2d006074099c8a677c
BLAKE2b-256 9debc469cb2d3f3cebf97ea4429b5baf1aba5d8c0ada84f017cc83197b6c7684

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page