Skip to main content

A simple web spider with pluggable recursion strategies

Project description

A simple web spider with several recursion strategies. Home page is at http://github.com/slinkp/spydey.

It doesn’t do much except follow links and report status. I mostly use it for quick and dirty smoke testing and link checking.

The only unusual feature is the --traversal=pattern option, which does recursive traversal in an unusual order: It tries to recognize patterns in URLs, and will follow URLs of novel patterns before those with patterns it has seen before. When there are no novel patterns to follow, it follows random links to URLs of known patterns. If you use this for smoke-testing a typical modern web app that uses maps URL patterns to views/controllers, this will very quickly hit all your views/controllers at least once… usually. But it’s not very interesting when pointed at a website that has arbitrarily deep trees (static files, VCS repositories, and the like).

Also, it’s designed so that adding a new recursion strategy is trivial. Spydey was originally written for the purpose of experimenting with different recursive crawling strategies. Read the source.

Oh, and if you install Fabulous, console output is in color.

For smoke testing, I typically run it like:

spydey -r --max-requests=100 --traversal=pattern --profile --log-referrer URL

There are a number of other command-line options, many stolen from wget. Use --help to see what they are, currently:

Usage: spydey [options] URL

Options:
 -h, --help            show this help message and exit
 -r, --recursive       recur into subdirectories
 -p, --page-requisites
                       get all images, etc. needed to display HTML page.
 --no-parent           don't ascend to the parent directory.
 -R REJECT, --reject=REJECT
                       Regex for filenames to reject. May be given multiple
                       times.
 -A ACCEPT, --accept=ACCEPT
                       Regex for filenames to accept. May be given multiple
                       times.
 -t TRAVERSAL, --traversal=TRAVERSAL
                       Recursive traversal strategy. Choices are: breadth-
                       first, depth-first, hybrid, pattern, random
 -H, --span-hosts      go to foreign hosts when recursive.
 -w WAIT, --wait=WAIT  wait SECONDS between retrievals.
 --random-wait=RANDOM_WAIT
                       wait from 0...2*WAIT secs between retrievals.
 --loglevel=LOGLEVEL   Log level.
 --log-referrer        Log referrer URL for each request.
 --transient-log       Use Fabulous transient logging config.
 --max-redirect=MAX_REDIRECT
                       Maximum number of redirections to follow for a
                       resource.
 --max-requests=MAX_REQUESTS
                       Maximum number of requests to make before exiting.
 -T TIMEOUT, --timeout=TIMEOUT
                       Set the network timeout in seconds. 0 means no
                       timeout.
 -P, --profile         Print the time to download each resource, and a
                       summary of the 20 slowest at the end.
 -v, --version         Print version information and exit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spydey-0.3.tar.gz (9.0 kB view details)

Uploaded Source

File details

Details for the file spydey-0.3.tar.gz.

File metadata

  • Download URL: spydey-0.3.tar.gz
  • Upload date:
  • Size: 9.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for spydey-0.3.tar.gz
Algorithm Hash digest
SHA256 3fed456104cff28e10de27eed7bdd6012612d410efdd9850ab544a5df41aa24c
MD5 d0a8f7da11283942b2fd30e4db1b4bd9
BLAKE2b-256 7e0bcecc7d85d02f4e0e239dfc1284ebe88a26fc7968be593be674bd8b2b4db2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page