Skip to main content

Crawler for Schul-Cloud Ressources

Project description

Build Status Python Package Index

This crawler fetches ressources from urls and posts them to a server.

Purpose

The purpose of this crawler is:

  • We can provide test data to the API.

  • It can crawl ressources which are not active and cannot post.

  • Other crawl services can use this crawler to upload their conversions.

  • It has the full crawler logic but does not transform into other formats.

    • Maybe we can create recommendations or a library for crawlers from this case.

Requirements

The crawler should work as follows:

  • Provide urls

    • as command line arguments

    • as a link to a file with one url per line

  • Provide ressources

    • as one ressource in a file

    • as a list of ressources

The crawler must be invoked to crawl.

Example

This example gets a ressource from the url and post it to the api.

python3 -m ressource_url_crawler http://localhost:8080 \
        https://raw.githubusercontent.com/schul-cloud/ressources-api-v1/master/schemas/ressource/examples/valid/example-website.json

Authentication

You can specify the authentication like this:

  • --basic=username:password for basic authentication

  • --apikey=apikey for api key authentication

Further Requirements

  • The crawler does not post ressources twice. This can be implemented by

    • caching the ressources locally, to see if they changed

      • compare ressource

      • compare timestamp

    • removing the ressources from the database if they are updated after posting new ressources.

This may require some form of state for the crawler. The state could be added to the ressources in a X-Ressources-Url-Crawler-Source field. This allows local caching and requires getting the objects from the database.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schul_cloud_url_crawler-1.0.13.tar.gz (14.1 kB view details)

Uploaded Source

Built Distribution

schul_cloud_url_crawler-1.0.13-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file schul_cloud_url_crawler-1.0.13.tar.gz.

File metadata

File hashes

Hashes for schul_cloud_url_crawler-1.0.13.tar.gz
Algorithm Hash digest
SHA256 222b21e1b8c3df2fb334932b8b0304b45eda8cc951a4b1977b12f2814ff7265b
MD5 03b77b1c62d7602049f1cc4ab4c36bdb
BLAKE2b-256 9a3ff3b27bfe50f8d2b6236f626cba49dbb97c007a7de6c264019d4bbdeda069

See more details on using hashes here.

File details

Details for the file schul_cloud_url_crawler-1.0.13-py3-none-any.whl.

File metadata

File hashes

Hashes for schul_cloud_url_crawler-1.0.13-py3-none-any.whl
Algorithm Hash digest
SHA256 0e25e2951770e21f6ab8fe474204f7aec0fee8cb593651225db8e9f996bba1c9
MD5 76bfc6ee9f2715c1e207a00352bf47a6
BLAKE2b-256 5457b2556126f7ad6fd9a11419c1fcb2f3d93d8fcd36937437c3aefbc5c78b41

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page