A collection of scripts and utilities for extracting citations to academic literature from Wikipedia's XML database dumps.
Project description
This project contains a utility for extracting academic citation identifiers.
NOTE: As one of its dependencies (Mediawiki-Utilities) requires Python 3 so does mwcites.
pip install mwcites
Usage
There’s really only one utility in this package called mwcitations.
$ mwcitations extract enwiki-20150112-pages-meta-history*.xml*.bz2 > citations.tsv
Documentation
Documentation is provided $ mwcitations extract -h.
Extracts academic citations from articles from the history of Wikipedia articles by processing a pages-meta-history XML dump and matching regular expressions to revision content. Currently supported identifiers include: * PubMed * DOI * ISBN Outputs a TSV file with the following fields: * page_id: The identifier of the Wikipedia article (int), e.g. 1325125 * page_title: The title of the Wikipedia article (utf-8), e.g. Club cell * rev_id: The Wikipedia revision where the citation was first added (int), e.g. 282470030 * timestamp: The timestamp of the revision where the citation was first added. (ISO 8601 datetime), e.g. 2009-04-08T01:52:20Z * type: The type of identifier, e.g. pmid * id: The id of the cited scholarly article (utf-8), e.g 10.1183/09031936.00213411 Usage: mwcites extract -h | --help mwcites extract <dump_file>... Options: -h --help Shows this documentation
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
mwcites-0.1.0.zip
(17.2 kB
view details)
mwcites-0.1.0.tar.gz
(10.8 kB
view details)
File details
Details for the file mwcites-0.1.0.zip
.
File metadata
- Download URL: mwcites-0.1.0.zip
- Upload date:
- Size: 17.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2ce481bc739f9b2511ed2037f4aa78fdf1672e5c1c5ba9a09d697e64bf9dab2f |
|
MD5 | bf6c5c7143d6d7d56fb58b19ab6a6c8b |
|
BLAKE2b-256 | 029a8452dbf1011e0588c43e8ba2301d88e611d7faf408ec58653f7775dcf488 |
File details
Details for the file mwcites-0.1.0.tar.gz
.
File metadata
- Download URL: mwcites-0.1.0.tar.gz
- Upload date:
- Size: 10.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 275b1a05b6e6714c214608dc5b934f9ac47bc01c21846cf4ae7040334a19e2c8 |
|
MD5 | 52ff70a14e9c32a953905fb76204195b |
|
BLAKE2b-256 | b2b428bab6a5436594f71679cc4bf7dfe1a1f1382ef94b072e35e80fe1cc1d7b |