A collection of scripts and utilities for extracting citations to academic literature from Wikipedia's XML database dumps.
Project description
This project contains a utility for extracting academic citation identifiers.
NOTE: As one of its dependencies (Mediawiki-Utilities) requires Python 3 so does mwcites.
pip install mwcites
Usage
There’s really only one utility in this package called mwcitations.
$ mwcitations extract enwiki-20150112-pages-meta-history*.xml*.bz2 > citations.tsv
Documentation
Documentation is provided $ mwcitations extract -h.
Extracts academic citations from articles from the history of Wikipedia articles by processing a pages-meta-history XML dump and matching regular expressions to revision content. Currently supported identifiers include: * PubMed * DOI * ISBN * arXiv Outputs a TSV file with the following fields: * page_id: The identifier of the Wikipedia article (int), e.g. 1325125 * page_title: The title of the Wikipedia article (utf-8), e.g. Club cell * rev_id: The Wikipedia revision where the citation was first added (int), e.g. 282470030 * timestamp: The timestamp of the revision where the citation was first added. (ISO 8601 datetime), e.g. 2009-04-08T01:52:20Z * type: The type of identifier, e.g. pmid, pmcid, doi, isbn or arxiv * id: The id of the cited scholarly article (utf-8), e.g 10.1183/09031936.00213411 Usage: mwcites extract -h | --help mwcites extract <dump_file>... Options: -h --help Shows this documentation
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
mwcites-0.2.0.zip
(17.7 kB
view details)
mwcites-0.2.0.tar.gz
(10.5 kB
view details)
File details
Details for the file mwcites-0.2.0.zip
.
File metadata
- Download URL: mwcites-0.2.0.zip
- Upload date:
- Size: 17.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7670124a4ab55b856022949f0046a05d38a12d14a6f156e1a654747987485e2e |
|
MD5 | 93e15cb66654777667dd497edb91ca4f |
|
BLAKE2b-256 | ecd5e9df07872b866e44a7ab90e0c6fd472e531a5107393e6a1233a58c6017a1 |
File details
Details for the file mwcites-0.2.0.tar.gz
.
File metadata
- Download URL: mwcites-0.2.0.tar.gz
- Upload date:
- Size: 10.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8229377609e2d9ebcd3d8dc3ba8f8283a05029f91062a16c6b715dbd6bd7a536 |
|
MD5 | 035947b31aaaf640e12828e90a18b964 |
|
BLAKE2b-256 | 28c89bb13c1198d47aa59209b5f3551ad3ddd02eaa7de4c9d44fe94e4f634e11 |