waybackprov

Checks the provenance of a URL in the Wayback machine

Project description

Give waybackprov a URL and optionally a --start and --end year and it will use an (undocumented) Internet Archive API call to fetch the provenance data behind the calendar view and summarize which Internet Archive collections are saving the URL the most.

Install

pip install waybackprov

Use

So here's how it works:

% waybackprov https://twitter.com/EPAScottPruitt
364 https://archive.org/details/focused_crawls
306 https://archive.org/details/edgi_monitor
151 https://archive.org/details/www3.epa.gov
 60 https://archive.org/details/epa.gov4
 47 https://archive.org/details/epa.gov5
...

One thing to remember when interpreting this data is that collections can contain other collections. For example the edgi_monitor collection is a subcollection of focused_crawls.

If you use the --collapse option only the most specific collection will be reported for a given crawl. So if coll1 is part of coll2 which is part of coll3, only coll1 will be reported instead of coll1, coll2 and coll3. This does involve collection metadata lookups at the Internet Archive API, so it does slow performance significantly.

If you would rather see the raw data as JSON or CSV use the --format option. When you use either of these formats you will see the metadata for each crawl, rather than a summary.

If you use --verbose a log of what waybackprov is doing will be written to waybackprov.log.

Test

If you would like to test it first install pytest and then:

pytest test.py

Project details

Release history Release notifications | RSS feed

0.0.9

May 19, 2022

0.0.8

Jan 23, 2021

0.0.7

Jul 30, 2018

0.0.6

Jul 24, 2018

0.0.5

Jul 24, 2018

0.0.4

Jul 23, 2018

0.0.3

Jul 21, 2018

0.0.2

Jul 12, 2018

This version

0.0.1

Jul 12, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

waybackprov-0.0.1.tar.gz (3.6 kB view hashes)

Uploaded Jul 12, 2018 Source

Hashes for waybackprov-0.0.1.tar.gz

Hashes for waybackprov-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`0d458d2f6d25631973197e8f0d758bb662f2dd399700d565f84c225e1fa1cf51`
MD5	`23a012865a4abc53b1c88117cd7ccfb5`
BLAKE2b-256	`c08bfe0afdef204a6c0203eeb3adea537b8d7ae218603237bab4c70cfe76359a`