Skip to main content

A collection of scripts and utilities to support the stream-processing of MediaWiki data.

Project description

A set of utilities for stream-processing MediaWiki data.

Usage

mwstream (-h | --help)

mwstream <utility> [-h|--help]

Data processing utilities

diffs2persistence

Generates token persistence statistics using revision JSON blobs with diff information.

dump2json

Converts an XML dump to a stream of revision JSON blobs

dump2diffs

Computes diffs directly from an XML dump

json2diffs

Computes and adds a “diff” field to a stream of revision JSON blobs

mend_diffs

Mends diffs that were computed in chunks and out of order.

persistence2stats

Aggregates a token persistence statistics to revision statistics

wikihadoop2json

Converts a Wikihadoop-processed stream of XML pages to JSON blobs

General utilities

json2tsv

Converts a stream of JSON blobs to tab-separated values based a set of fieldnames.

normalize

Normalizes old versions of RevisionDocument json schemas to correspond to the most recent schema version.

validate

Validates JSON against a provided schema.

truncate_text

Truncates the ‘text’ field of JSON blobs to a limited length in unicode characters. (addresses content dump vandalism issues) and adds a boolean ‘truncated’ field.

Installation

pip install mwstreaming

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

mwstreaming-0.5.3.zip (23.2 kB view details)

Uploaded Source

mwstreaming-0.5.3.tar.gz (12.5 kB view details)

Uploaded Source

File details

Details for the file mwstreaming-0.5.3.zip.

File metadata

  • Download URL: mwstreaming-0.5.3.zip
  • Upload date:
  • Size: 23.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for mwstreaming-0.5.3.zip
Algorithm Hash digest
SHA256 15f973c4b5a47d185723d4e8534b8f23ff8c795c0d5d78cc7c3c8b02233e5394
MD5 d55ecd713a23523380a8f332ec001807
BLAKE2b-256 1e6cb5d929ad1dba8fd2cfdaa8c38963c55e25e61754820eb39ca4e5cd0e0dab

See more details on using hashes here.

File details

Details for the file mwstreaming-0.5.3.tar.gz.

File metadata

  • Download URL: mwstreaming-0.5.3.tar.gz
  • Upload date:
  • Size: 12.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for mwstreaming-0.5.3.tar.gz
Algorithm Hash digest
SHA256 f20f5de8d99b160868887f612bfbfd3a434cfb8c536cff81e77fc4e2a39fe58d
MD5 4c7e01d1d9f598f221c3d6c5a1f74c55
BLAKE2b-256 fdfbe745f7bc3ee378e6b4c42e1d5aac013c141618d7c42e482d5f4398381ef4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page