Skip to main content

A collection of scripts and utilities to support the stream-processing of MediaWiki data.

Project description

A set of utilities for stream-processing MediaWiki data.

Usage

mwstream (-h | --help)

mwstream <utility> [-h|--help]

Data processing utilities

diffs2persistence

Generates token persistence statistics using revision JSON blobs with diff information.

dump2json

Converts an XML dump to a stream of revision JSON blobs

dump2diffs

Computes diffs directly from an XML dump

json2diffs

Computes and adds a “diff” field to a stream of revision JSON blobs

persistence2stats

Aggregates a token persistence statistics to revision statistics

wikihadoop2json

Converts a Wikihadoop-processed stream of XML pages to JSON blobs

General utilities

json2tsv

Converts a stream of JSON blobs to tab-separated values based a set of fieldnames.

normalize

Normalizes old versions of RevisionDocument json schemas to correspond to the most recent schema version.

validate

Validates JSON against a provided schema.

truncate_text

Truncates the ‘text’ field of JSON blobs to a limited length in unicode characters. (addresses content dump vandalism issues) and adds a boolean ‘truncated’ field.

Installation

pip install mwstreaming

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

mwstreaming-0.5.1.zip (19.5 kB view details)

Uploaded Source

mwstreaming-0.5.1.tar.gz (10.1 kB view details)

Uploaded Source

File details

Details for the file mwstreaming-0.5.1.zip.

File metadata

  • Download URL: mwstreaming-0.5.1.zip
  • Upload date:
  • Size: 19.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for mwstreaming-0.5.1.zip
Algorithm Hash digest
SHA256 c50039d9f2699ef717237e81523f7b9a5d89b68559c82a30313c18eae84fda54
MD5 81d352281aa4785a522a1b60afc7fe43
BLAKE2b-256 5125609f767dc7a90fa615dc9e375e881f51e104d28c0807d6ad72cccb3cfe3c

See more details on using hashes here.

File details

Details for the file mwstreaming-0.5.1.tar.gz.

File metadata

  • Download URL: mwstreaming-0.5.1.tar.gz
  • Upload date:
  • Size: 10.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for mwstreaming-0.5.1.tar.gz
Algorithm Hash digest
SHA256 d03d83c7f8cf006f8e6ac81df62c10eaa79f4f0c89eccd1dd8a96c870c226252
MD5 e98d4775e8a74a22257a294f8f513c0d
BLAKE2b-256 1e61c9d2900e3b4e291163d431070738f8e31227c2eb1ff6736d3b533a725191

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page