Skip to main content

A collection of scripts and utilities for extracting citations to academic literature from Wikipedia's XML database dumps.

Project description

This project contains a utility for extracting academic citation identifiers.

NOTE: As one of its dependencies (Mediawiki-Utilities) requires Python 3 so does mwcites.

pip install mwcites

Usage

There’s really only one utility in this package called mwcitations.

$ mwcitations extract enwiki-20150112-pages-meta-history*.xml*.bz2 > citations.tsv

Documentation

Documentation is provided $ mwcitations extract -h.

Extracts academic citations from articles from the history of Wikipedia
articles by processing a pages-meta-history XML dump and matching regular
expressions to revision content.

Currently supported identifiers include:

 * PubMed
 * DOI
 * ISBN

Outputs a TSV file with the following fields:

 * page_id: The identifier of the Wikipedia article (int), e.g. 1325125
 * page_title: The title of the Wikipedia article (utf-8), e.g. Club cell
 * rev_id: The Wikipedia revision where the citation was first added (int),
           e.g. 282470030
 * timestamp: The timestamp of the revision where the citation was first added.
              (ISO 8601 datetime), e.g. 2009-04-08T01:52:20Z
 * type: The type of identifier, e.g. pmid
 * id: The id of the cited scholarly article (utf-8),
       e.g 10.1183/09031936.00213411

Usage:
    mwcites extract -h | --help
    mwcites extract <dump_file>...

Options:
    -h --help        Shows this documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

mwcites-0.1.0.zip (17.2 kB view details)

Uploaded Source

mwcites-0.1.0.tar.gz (10.8 kB view details)

Uploaded Source

File details

Details for the file mwcites-0.1.0.zip.

File metadata

  • Download URL: mwcites-0.1.0.zip
  • Upload date:
  • Size: 17.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for mwcites-0.1.0.zip
Algorithm Hash digest
SHA256 2ce481bc739f9b2511ed2037f4aa78fdf1672e5c1c5ba9a09d697e64bf9dab2f
MD5 bf6c5c7143d6d7d56fb58b19ab6a6c8b
BLAKE2b-256 029a8452dbf1011e0588c43e8ba2301d88e611d7faf408ec58653f7775dcf488

See more details on using hashes here.

File details

Details for the file mwcites-0.1.0.tar.gz.

File metadata

  • Download URL: mwcites-0.1.0.tar.gz
  • Upload date:
  • Size: 10.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for mwcites-0.1.0.tar.gz
Algorithm Hash digest
SHA256 275b1a05b6e6714c214608dc5b934f9ac47bc01c21846cf4ae7040334a19e2c8
MD5 52ff70a14e9c32a953905fb76204195b
BLAKE2b-256 b2b428bab6a5436594f71679cc4bf7dfe1a1f1382ef94b072e35e80fe1cc1d7b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page