A collection of scripts and utilities to support the stream-processing of MediaWiki data.
Project description
A set of utilities for stream-processing MediaWiki data.
Usage
mwstream (-h | --help)
mwstream <utility> [-h|--help]
Data processing utilities
- diffs2persistence
Generates token persistence statistics using revision JSON blobs with diff information.
- dump2json
Converts an XML dump to a stream of revision JSON blobs
- dump2diffs
Computes diffs directly from an XML dump
- json2diffs
Computes and adds a “diff” field to a stream of revision JSON blobs
- persistence2stats
Aggregates a token persistence statistics to revision statistics
- wikihadoop2json
Converts a Wikihadoop-processed stream of XML pages to JSON blobs
General utilities
- json2tsv
Converts a stream of JSON blobs to tab-separated values based a set of fieldnames.
- normalize
Normalizes old versions of RevisionDocument json schemas to correspond to the most recent schema version.
- validate
Validates JSON against a provided schema.
- truncate_text
Truncates the ‘text’ field of JSON blobs to a limited length in unicode characters. (addresses content dump vandalism issues) and adds a boolean ‘truncated’ field.
Installation
pip install mwstreaming
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
File details
Details for the file mwstreaming-0.5.1.zip
.
File metadata
- Download URL: mwstreaming-0.5.1.zip
- Upload date:
- Size: 19.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c50039d9f2699ef717237e81523f7b9a5d89b68559c82a30313c18eae84fda54 |
|
MD5 | 81d352281aa4785a522a1b60afc7fe43 |
|
BLAKE2b-256 | 5125609f767dc7a90fa615dc9e375e881f51e104d28c0807d6ad72cccb3cfe3c |
File details
Details for the file mwstreaming-0.5.1.tar.gz
.
File metadata
- Download URL: mwstreaming-0.5.1.tar.gz
- Upload date:
- Size: 10.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d03d83c7f8cf006f8e6ac81df62c10eaa79f4f0c89eccd1dd8a96c870c226252 |
|
MD5 | e98d4775e8a74a22257a294f8f513c0d |
|
BLAKE2b-256 | 1e61c9d2900e3b4e291163d431070738f8e31227c2eb1ff6736d3b533a725191 |