A set of utilities for processing MediaWiki XML dump data.
Project description
# MediaWiki XML
This library contains a collection of utilities for efficiently processing MediaWiki’s XML database dumps. There are two important concerns that this module intends to address: complexity and performance of streaming XML parsing. This library enables memory efficent stream processing of XML dumps with a simple [iterator](https://pythonhosted.org/mwxml/iteration.html) strategy. This library also implements a distributed processing strategy (see [map()](https://pythonhosted.org/mwxml/map.html)) that enables parallel processing of many XML dump files at the same time.
Installation: pip install mwxml
Documentation: https://pythonhosted.org/mwxml
Repositiory: https://github.com/mediawiki-utilities/python-mwxml
License: MIT
## Example
>>> import mwxml >>> >>> dump = mwxml.Dump.from_file(open("dump.xml")) >>> print(dump.site_info.name, dump.site_info.dbname) Wikipedia enwiki >>> >>> for page in dump: ... for revision in page: ... print(revision.id) ... 1 2 3
## Author * Aaron Halfaker – https://github.com/halfak
## See also * http://dumps.wikimedia.org/ * http://community.wikia.com/wiki/Help:Database_download
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file mwxml-0.2.2.tar.gz
.
File metadata
- Download URL: mwxml-0.2.2.tar.gz
- Upload date:
- Size: 13.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5a92ed56eceb5a282d68fade4451583bbd0eda411b831003f6e7b087471cb5d5 |
|
MD5 | cedb7d210b883afbe21fb6b81f5e5bce |
|
BLAKE2b-256 | e6712f2c1c72f9293b663e17bba6d714cc78dbb1972a2106857eca20048a716a |