Skip to main content

Edit diffs and type detection for Wikipedia (simple)

Project description

mwsimpleedittypes

Edit diffs and type detection for Wikipedia (simple). The goal is to transform unstructured edits to Wikipedia articles into a structured summary of what actions were taken in the edit. This is a simplified version of mwedittypes, which is a structure-aware version of this library that can also detect content moves and more directly identify edit types.

Installation

You can install mwsimpleedittypes with pip:

$ pip install mwsimpleedittypes

Example

If one revision of wikitext is as follows:

{{Short description|Austrian painter}}
'''Karl Josef Aigen''' (8 October 1684 – 22 October 1762) was a landscape painter, born at Olomouc.

and a second revision of wikitext is as follows:

{{Short description|Austrian landscape painter}}
'''Karl Josef Aigen''' (8 October 1684 – 22 October 1762) was a landscape painter, born at [[Olomouc]].

The changes that happened would be:

  • The addition of landscape to the short description template -- this would be registered as a Template change.
  • The changing of Olomouc to a Wikilink.
  • Notably, despite this change to the template and addition of a link, the "Text" of the article has not changed.

This repository would return this in the following structure: {'Template':{'change':1}, 'Wikilink':{'insert':1}.

Basic Usage

>>> from mwsimpleedittypes import EditTypes
>>> prev_wikitext = '{{Short description|Austrian painter}}'
>>> curr_wikitext = '{{Short description|Austrian [[landscape painter]]}}'
>>> et = EditTypes(prev_wikitext, curr_wikitext, lang='en', timeout=5)
>>> et.get_diff()
{'Wikilink': {'insert': 1}, 'Template': {'change': 1}}

Development

We are happy to receive contributions though will default to keeping the code here relatively general (not overly customized to individual use-cases). Please reach out or open an issue for the changes you would like to merge so that we can discuss beforehand.

Code Summary

The code for computing diffs and running edit-type detection can be found in one file mwsimpleedittypes/differ.py.

The bulk of the library parses a wikitext document into a bag of nodes (Templates, Wikilinks, etc.). This is almost all done via the amazing mwparserfromhell library with a few tweaks:

  • We use link namespace prefixes -- e.g., Category:, Image: -- to separate out categories and media from other wikilinks.
  • We identify some additional media files that are transcluded via templates -- e.g., infoboxes -- or gallery tags.
  • We also add some custom logic for parsing <gallery> tags to identify nested links, etc., which otherwise are treated as text by mwparserfromhell.
  • We use custom logic for converting wikitext into text to best match what words show up in the text of the article.

The diffing component simply takes the symmetric difference of the nodes associated with each wikitext document to identify what has changed.

To accurately, but efficiently, describe the scale of textual changes in edits, we also use some regexes and heuristics to describe how much text was changed in an edit in the node differ. This is generally the toughest part of diffing text but because we do not need to visually describe the diff, just estimate the scale of how much changed, we can use relatively simple methods. To do this, we break down text changes into five categories and identify how much of each changed: paragraphs, sentences, words, punctuation, and whitespace.

Testing

The tests for node/tree differs are contained within the tests directory. They can be run via pytest. We are not even close to full coverage yet given the numerous node types (template, text, etc.) and four actions (insert/remove/change/move) and varying languages for e.g., Text or Category/Media nodes, but we are working on expanding coverage.

Documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mwsimpleedittypes-1.2.1.tar.gz (21.2 kB view details)

Uploaded Source

Built Distribution

mwsimpleedittypes-1.2.1-py3-none-any.whl (20.4 kB view details)

Uploaded Python 3

File details

Details for the file mwsimpleedittypes-1.2.1.tar.gz.

File metadata

  • Download URL: mwsimpleedittypes-1.2.1.tar.gz
  • Upload date:
  • Size: 21.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.64.0 importlib-metadata/4.8.2 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.4 CPython/3.6.7

File hashes

Hashes for mwsimpleedittypes-1.2.1.tar.gz
Algorithm Hash digest
SHA256 a1c1519fac20bdc77d835a8a22481d4dc98e6f16ce0a3c2953e66123ebb72d27
MD5 4a1cfb354ffe4d41b3ae73fa24e965db
BLAKE2b-256 4479a5f87cbe2cdcd621e9f21f1b0e3be686f0297b1abea68c31b1cacdeb355a

See more details on using hashes here.

Provenance

File details

Details for the file mwsimpleedittypes-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: mwsimpleedittypes-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 20.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.64.0 importlib-metadata/4.8.2 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.4 CPython/3.6.7

File hashes

Hashes for mwsimpleedittypes-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 160e197e10d30afa77b1478c5a8e3e33b4d93f178363f904f5bcf240ee6b540d
MD5 ff2dac12592e886856dc87e46afbbe96
BLAKE2b-256 8a6ad92287d3de9514aa4bdb89d66f6ff91515f839d1e67de90f9cbf1f2a340d

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page