Skip to main content

mwsql is a set of utilities for processing" MediaWiki SQL dump data

Project description

https://badge.fury.io/py/mwsql.svg https://github.com/mediawiki-utilities/python-mwsql/actions/workflows/test.yml/badge.svg https://readthedocs.org/projects/ansicolortags/badge/?version=latest

Overview

mwsql provides utilities for working with Wikimedia SQL dump files. It supports Python 3.9 and later versions.

mwsql abstracts the messiness of working with SQL dump files. Each Wikimedia SQL dump file contains one database table. The most common use case for mwsql is to convert this table into a more user-friendly Python Dump class instance. This lets you access the table’s metadata (db names, field names, data types, etc.) as attributes, and its content – the table rows – as a generator, which enables processing of larger-than-memory datasets due to the inherent lazy/delayed execution of Python generators.

mwsql also provides a method to convert SQL dump files into CSV. You can find more information on how to use mwsql in the usage examples.

Installation

You can install mwsql with pip:

$ pip install mwsql

Basic Usage

>>> from mwsql import Dump
>>> dump = Dump.from_file('simplewiki-latest-change_tag_def.sql.gz')
>>> dump.head(5)
['ctd_id', 'ctd_name', 'ctd_user_defined', 'ctd_count']
['1', 'mw-replace', '0', '10453']
['2', 'visualeditor', '0', '309141']
['3', 'mw-undo', '0', '59767']
['4', 'mw-rollback', '0', '71585']
['5', 'mobile edit', '0', '234682']
>>> dump.dtypes
{'ctd_id': int, 'ctd_name': str, 'ctd_user_defined': int, 'ctd_count': int}
>>> rows = dump.rows(convert_dtypes=True)
>>> next(rows)
[1, 'mw-replace', 0, 10453]

Known Issues

Encoding errors

Wikimedia SQL dumps use utf-8 encoding. Unfortunately, some fields can contain non-recognized characters, raising an encoding error when attempting to parse the dump file. If this happens while reading in the file, it’s recommended to try again using a different encoding. latin-1 will sometimes solve the problem; if not, you’re encouraged to try with other encodings. If iterating over the rows throws an encoding error, you can try changing the encoding. In this case, you don’t need to recreate the dump – just pass in a new encoding via the dump.encoding attribute.

Parsing errors

Some Wikimedia SQL dumps contain string-type fields that are sometimes not correctly parsed, resulting in fields being split up into several parts. This is more likely to happen when parsing dumps containing file names from Wikimedia Commons or containing external links with many query parameters. If you’re parsing any of the other dumps, you’re unlikely to run into this issue.

In most cases, this issue affects a relatively very small proportion of the total rows parsed. For instance, Wikimedia Commons page dump contains approximately 99 million entries, out of which ~13.000 are incorrectly parsed. Wikimedia Commons page links on the other hand, contains ~760M records, and only 20 are wrongly parsed.

This issue is most commonly caused by the parser mistaking a single quote (or apostrophe, as they’re identical) within a string for the single quote that marks the end of said string. There’s currently no known workaround other than manually removing the rows that contain more fields than expected, or if they are relatively few, manually merging the split fields.

Future versions of mwsql will improve the parser to correctly identify when single quotes should be treated as string delimiters and when they should be escaped. For now, it’s essential to be aware that this problem exists.

Project information

mwsql is released under the GPLv3. You can find the complete documentation at Read the Docs. If you run into bugs, you can file them in our issue tracker. Have ideas on how to make mwsql better? Contributions are most welcome – we have put together a guide on how to get started.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mwsql-1.0.0.tar.gz (22.4 kB view details)

Uploaded Source

Built Distribution

mwsql-1.0.0-py3-none-any.whl (22.4 kB view details)

Uploaded Python 3

File details

Details for the file mwsql-1.0.0.tar.gz.

File metadata

  • Download URL: mwsql-1.0.0.tar.gz
  • Upload date:
  • Size: 22.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.9.17 Linux/6.2.0-1019-azure

File hashes

Hashes for mwsql-1.0.0.tar.gz
Algorithm Hash digest
SHA256 a3919f2f2c5fd64ae25979d5eec3422beea98b283acc77cdb43c12b5753d61c6
MD5 cc43b87e47df92be7edf8deb873a0edc
BLAKE2b-256 9d373aa990633cfbfb61ff037dc6c67852f9af06fb7ffb2653d602638fb3bdbc

See more details on using hashes here.

File details

Details for the file mwsql-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: mwsql-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 22.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.9.17 Linux/6.2.0-1019-azure

File hashes

Hashes for mwsql-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4999358dcd0cb20e2bd6ddb2163e4d501de85e89af13a1f52924f08bf926911f
MD5 744e013b9d98cea6578053a1a63f9518
BLAKE2b-256 3a7ce45b53aa58db3b76c3419e1fd2e437c88b4e46e36a30cde6f37ead134d25

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page