Crawler for Schul-Cloud Ressources
Project description
This crawler fetches ressources from urls and posts them to a server.
Purpose
The purpose of this crawler is:
We can provide tests data to the API.
It can crawl ressources which are not active and cannot post.
Other crawl services can use this crawler to upload their conversions.
It has the full crawler logic but does not transform into other formats.
Maybe we can create recommendations or a library for crawlers from this case.
Requirements
The crawler should work as follows:
Provide urls
as command line arguments
as a link to a file with one url per line
Provide ressources
as one ressource in a file
as a list of ressources
The crawler must be invoked to crawl.
Example
This example gets a ressource from the url and post it to the api.
python3 -m ressource_url_crawler http://localhost:8080 \
https://raw.githubusercontent.com/schul-cloud/ressources-api-v1/master/schemas/ressource/examples/valid/example-website.json
Authentication
You can specify the authentication like this:
--basic=username:password for basic authentication
--apikey=apikey for api key authentication
Further Requirements
The crawler does not post ressources twice. This can be implemented by
caching the ressources locally, to see if they changed
compare ressource
compare timestamp
removing the ressources from the database if they are updated after posting new ressources.
This may require some form of state for the crawler. The state could be added to the ressources in a X-Ressources-Url-Crawler-Source field. This allows local caching and requires getting the objects from the database.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file schul_cloud_url_crawler-1.0.6.tar.gz
.
File metadata
- Download URL: schul_cloud_url_crawler-1.0.6.tar.gz
- Upload date:
- Size: 4.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9a5d58a05cb0634135611f70de2b3e88eb84c6e86f41d1a1f80071603601d03b |
|
MD5 | 2557da9a7e90101cd31775525ee5853b |
|
BLAKE2b-256 | 2e69795d3afa1af1dd9c9a07298c4a49fd985b524522fe529dd7a7c6c72604f8 |
File details
Details for the file schul_cloud_url_crawler-1.0.6-py3-none-any.whl
.
File metadata
- Download URL: schul_cloud_url_crawler-1.0.6-py3-none-any.whl
- Upload date:
- Size: 8.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 26c84ed1073e5946a8464851dd07998638cd99c781756bda0541553d29257c72 |
|
MD5 | aca9e8f319796229d4369cefc5839685 |
|
BLAKE2b-256 | 23431bd113dcecbc1a1b50354f8eb747f394a3faef0b001540e5f68572474194 |