Skip to main content

A set of utilities for processing MediaWiki XML dump data.

Project description

# MediaWiki XML

This library contains a collection of utilities for efficiently processing MediaWiki’s XML database dumps. There are two important concerns that this module intends to address: complexity and performance of streaming XML parsing. This library enables memory efficent stream processing of XML dumps with a simple [iterator](https://pythonhosted.org/mwxml/iteration.html) strategy. This library also implements a distributed processing strategy (see [map()](https://pythonhosted.org/mwxml/map.html)) that enables parallel processing of many XML dump files at the same time.

## Example

>>> import mwxml
>>>
>>> dump = mwxml.Dump.from_file(open("dump.xml"))
>>> print(dump.site_info.name, dump.site_info.dbname)
Wikipedia enwiki
>>>
>>> for page in dump:
...     for revision in page:
...        print(revision.id)
...
1
2
3

## Author * Aaron Halfaker – https://github.com/halfak

## See also * http://dumps.wikimedia.org/ * http://community.wikia.com/wiki/Help:Database_download

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mwxml-0.3.0.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

mwxml-0.3.0-py2.py3-none-any.whl (32.3 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file mwxml-0.3.0.tar.gz.

File metadata

  • Download URL: mwxml-0.3.0.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for mwxml-0.3.0.tar.gz
Algorithm Hash digest
SHA256 bf1978536784ae3f5cc373eefff918c3f60190ddca2f672e962774eb4076bfac
MD5 a1c54983707fe0173c70cbfd3ea9ba07
BLAKE2b-256 e7a0ca47e8e2c80563efeb9e1792a12f4c319cae72dd0e9e7e6b20a3f736f362

See more details on using hashes here.

File details

Details for the file mwxml-0.3.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for mwxml-0.3.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 8c2f0027e4d30e3a415b0b8eff8e88ecb8d828738a8820a7420214ab61f545f5
MD5 4c2ae1e75cb4b0683be75e700bb9e35a
BLAKE2b-256 f42d067ca56ed7750a26649d7930b847d81e7fdbc111ece01d74cf1e28e1067d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page