schul-cloud-url-crawler

Crawler for Schul-Cloud Ressources

Project description

This crawler fetches ressources from urls and posts them to a server.

Purpose

The purpose of this crawler is:

We can provide tests data to the API.
It can crawl ressources which are not active and cannot post.
Other crawl services can use this crawler to upload their conversions.
It has the full crawler logic but does not transform into other formats.
- Maybe we can create recommendations or a library for crawlers from this case.

Requirements

The crawler should work as follows:

Provide urls
- as command line arguments
- as a link to a file with one url per line
Provide ressources
- as one ressource in a file
- as a list of ressources

The crawler must be invoked to crawl.

Example

This example gets a ressource from the url and post it to the api.

python3 -m ressource_url_crawler http://localhost:8080 \
        https://raw.githubusercontent.com/schul-cloud/ressources-api-v1/master/schemas/ressource/examples/valid/example-website.json

Authentication

You can specify the authentication like this:

--basic=username:password for basic authentication
--apikey=apikey for api key authentication

Further Requirements

The crawler does not post ressources twice. This can be implemented by
- caching the ressources locally, to see if they changed
  - compare ressource
  - compare timestamp
- removing the ressources from the database if they are updated after posting new ressources.

This may require some form of state for the crawler. The state could be added to the ressources in a X-Ressources-Url-Crawler-Source field. This allows local caching and requires getting the objects from the database.

Project details

Release history Release notifications | RSS feed

1.0.17

May 13, 2017

1.0.16

May 13, 2017

1.0.14

May 13, 2017

1.0.13

May 13, 2017

1.0.12

May 12, 2017

1.0.11

May 12, 2017

This version

1.0.10

May 12, 2017

1.0.9

May 11, 2017

1.0.6

May 4, 2017

1.0.5

May 4, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schul_cloud_url_crawler-1.0.10.tar.gz (10.6 kB view details)

Uploaded May 12, 2017 Source

Built Distribution

schul_cloud_url_crawler-1.0.10-py3-none-any.whl (15.5 kB view details)

Uploaded May 12, 2017 Python 3

File details

Details for the file schul_cloud_url_crawler-1.0.10.tar.gz.

File metadata

Download URL: schul_cloud_url_crawler-1.0.10.tar.gz
Upload date: May 12, 2017
Size: 10.6 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for schul_cloud_url_crawler-1.0.10.tar.gz
Algorithm	Hash digest
SHA256	`fdc12e243154f1258afeca04dc6f49e18650270ec6081f74fb553f6a46453a62`
MD5	`00946199f87c34fe97d32a9bd23db9c6`
BLAKE2b-256	`a1cec4a4f216c5a0993ee52166cccc3fe7aa5faaf95980c373f9d59118fae1b8`

See more details on using hashes here.

File details

Details for the file schul_cloud_url_crawler-1.0.10-py3-none-any.whl.

File metadata

Download URL: schul_cloud_url_crawler-1.0.10-py3-none-any.whl
Upload date: May 12, 2017
Size: 15.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No

File hashes

Hashes for schul_cloud_url_crawler-1.0.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e9a2d76044b24b45bfbf344f0962fdefbcb1d17be374842cd161712c9ba6651f`
MD5	`439ae5151c5f4b546cdfa1a9e0cfdba8`
BLAKE2b-256	`d362c1f6db7cd853dffc8a90cbc15439bc7ca2e524adb77631eeb3d24adb2cf7`