Skip to main content

A minimalistic, recursive web crawling library for Python.

Project description

The solitary and lucid spectator of a multiform, instantaneous and almost intolerably precise world.

Funes the Memorious, Jorge Luis Borges

https://github.com/alephdata/memorious/workflows/memorious/badge.svg

memorious is a light-weight web scraping toolkit. It supports scrapers that collect structured or un-structured data. This includes the following use cases:

  • Make crawlers modular and simple tasks re-usable

  • Provide utility functions to do common tasks such as data storage, HTTP session management

  • Integrate crawlers with the Aleph and FollowTheMoney ecosystem

  • Get out of your way as much as possible

Design

When writing a scraper, you often need to paginate through through an index page, then download an HTML page for each result and finally parse that page and insert or update a record in a database.

memorious handles this by managing a set of crawlers, each of which can be composed of multiple stages. Each stage is implemented using a Python function, which can be re-used across different crawlers.

The basic steps of writing a Memorious crawler:

  1. Make YAML crawler configuration file

  2. Add different stages

  3. Write code for stage operations (optional)

  4. Test, rinse, repeat

Documentation

The documentation for Memorious is available at alephdata.github.io/memorious. Feel free to edit the source files in the docs folder and send pull requests for improvements.

To build the documentation, inside the docs folder run make html

You’ll find the resulting HTML files in /docs/_build/html.

Project details


Release history Release notifications | RSS feed

This version

2.6.2

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

memorious-2.6.2.tar.gz (40.6 kB view details)

Uploaded Source

Built Distribution

memorious-2.6.2-py2.py3-none-any.whl (52.4 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file memorious-2.6.2.tar.gz.

File metadata

  • Download URL: memorious-2.6.2.tar.gz
  • Upload date:
  • Size: 40.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.3

File hashes

Hashes for memorious-2.6.2.tar.gz
Algorithm Hash digest
SHA256 472ad48836198abab7e99648e8d4040a73beb2fdb92bf95c28281ca2ac5cd342
MD5 e01a88951e9fa4a087f3baccd77ea1d2
BLAKE2b-256 782b14133a6d9e36865798207cab1fcdb37563135ccba509fd45b629683c3960

See more details on using hashes here.

Provenance

File details

Details for the file memorious-2.6.2-py2.py3-none-any.whl.

File metadata

  • Download URL: memorious-2.6.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 52.4 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.3

File hashes

Hashes for memorious-2.6.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 de7ed50089066903efe629e703de6384d42b9b1eb19a53a3e3c91da01fdb2860
MD5 95781b40c92899c6b482b91c17d7b05f
BLAKE2b-256 2217428dc8f4cc787047ae2ca730fef66d16dd01e3b2aca3d41c65308124f783

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page