Skip to main content

Async crawler and datalake service for data.gouv.fr

Project description

udata-hydra 🦀

udata-hydra is an async metadata crawler for data.gouv.fr.

URLs are crawled via aiohttp, catalog and crawled metadata are stored in a PostgreSQL database.

CLI

Create database structure

Install udata-hydra dependencies and cli. poetry install

poetry run udata-hydra migrate

Load (UPSERT) latest catalog version from data.gouv.fr

udata-hydra load-catalog

Crawler

udata-hydra-crawl

It will crawl (forever) the catalog according to config set in config.py.

BATCH_SIZE URLs are queued at each loop run.

The crawler will start with URLs never checked and then proceed with URLs crawled before SINCE interval. It will then wait until something changes (catalog or time).

There's a by-domain backoff mecanism. The crawler will wait when, for a given domain in a given batch, BACKOFF_NB_REQ is exceeded in a period of BACKOFF_PERIOD seconds. It will retry until the backoff is lifted.

If an URL matches one of the EXCLUDED_PATTERNS, it will never be checked.

Worker

A job queuing system is used to process long-running tasks. Launch the worker with the following command:

poetry run rq worker -c udata_hydra.worker

API

Run

poetry install
poetry run adev runserver udata_hydra/app.py

Get latest check

Works with ?url={url} and ?resource_id={resource_id}.

$ curl -s "http://localhost:8000/api/checks/latest/?url=http://opendata-sig.saintdenis.re/datasets/661e19974bcc48849bbff7c9637c5c28_1.csv" | json_pp
{
   "status" : 200,
   "catalog_id" : 64148,
   "deleted" : false,
   "error" : null,
   "created_at" : "2021-02-06T12:19:08.203055",
   "response_time" : 0.830198049545288,
   "url" : "http://opendata-sig.saintdenis.re/datasets/661e19974bcc48849bbff7c9637c5c28_1.csv",
   "domain" : "opendata-sig.saintdenis.re",
   "timeout" : false,
   "id" : 114750,
   "dataset_id" : "5c34944606e3e73d4a551889",
   "resource_id" : "b3678c59-5b35-43ad-9379-fce29e5b56fe",
   "headers" : {
      "content-disposition" : "attachment; filename=\"xn--Dlimitation_des_cantons-bcc.csv\"",
      "server" : "openresty",
      "x-amz-meta-cachetime" : "191",
      "last-modified" : "Wed, 29 Apr 2020 02:19:04 GMT",
      "content-encoding" : "gzip",
      "content-type" : "text/csv",
      "cache-control" : "must-revalidate",
      "etag" : "\"20415964703d9ccc4815d7126aa3a6d8\"",
      "content-length" : "207",
      "date" : "Sat, 06 Feb 2021 12:19:08 GMT",
      "x-amz-meta-contentlastmodified" : "2018-11-19T09:38:28.490Z",
      "connection" : "keep-alive",
      "vary" : "Accept-Encoding"
   }
}

Get all checks for an URL or resource

Works with ?url={url} and ?resource_id={resource_id}.

$ curl -s "http://localhost:8000/api/checks/all/?url=http://www.drees.sante.gouv.fr/IMG/xls/er864.xls" | json_pp
[
   {
      "domain" : "www.drees.sante.gouv.fr",
      "dataset_id" : "53d6eadba3a72954d9dd62f5",
      "timeout" : false,
      "deleted" : false,
      "response_time" : null,
      "error" : "Cannot connect to host www.drees.sante.gouv.fr:443 ssl:True [SSLCertVerificationError: (1, \"[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.drees.sante.gouv.fr'. (_ssl.c:1122)\")]",
      "catalog_id" : 232112,
      "url" : "http://www.drees.sante.gouv.fr/IMG/xls/er864.xls",
      "headers" : {},
      "id" : 165107,
      "created_at" : "2021-02-06T14:32:47.675854",
      "resource_id" : "93dfd449-9d26-4bb0-a6a9-ee49b1b8a4d7",
      "status" : null
   },
   {
      "timeout" : false,
      "deleted" : false,
      "response_time" : null,
      "error" : "Cannot connect to host www.drees.sante.gouv.fr:443 ssl:True [SSLCertVerificationError: (1, \"[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.drees.sante.gouv.fr'. (_ssl.c:1122)\")]",
      "domain" : "www.drees.sante.gouv.fr",
      "dataset_id" : "53d6eadba3a72954d9dd62f5",
      "created_at" : "2020-12-24T17:06:58.158125",
      "resource_id" : "93dfd449-9d26-4bb0-a6a9-ee49b1b8a4d7",
      "status" : null,
      "catalog_id" : 232112,
      "url" : "http://www.drees.sante.gouv.fr/IMG/xls/er864.xls",
      "headers" : {},
      "id" : 65092
   }
]

Get crawling status

$ curl -s "http://localhost:8000/api/status/crawler/" | json_pp
{
   "fresh_checks_percentage" : 0.4,
   "pending_checks" : 142153,
   "total" : 142687,
   "fresh_checks" : 534,
   "checks_percentage" : 0.4
}

Get worker status

$ curl -s "http://localhost:8000/api/status/worker/" | json_pp
{
   "queued" : {
      "default" : 0,
      "high" : 825,
      "low" : 655
   }
}

Get crawling stats

$ curl -s "http://localhost:8000/api/stats/" | json_pp
{
   "status" : [
      {
         "count" : 525,
         "percentage" : 98.3,
         "label" : "ok"
      },
      {
         "label" : "error",
         "percentage" : 1.3,
         "count" : 7
      },
      {
         "label" : "timeout",
         "percentage" : 0.4,
         "count" : 2
      }
   ],
   "status_codes" : [
      {
         "code" : 200,
         "count" : 413,
         "percentage" : 78.7
      },
      {
         "code" : 501,
         "percentage" : 12.4,
         "count" : 65
      },
      {
         "percentage" : 6.1,
         "count" : 32,
         "code" : 404
      },
      {
         "code" : 500,
         "percentage" : 2.7,
         "count" : 14
      },
      {
         "code" : 502,
         "count" : 1,
         "percentage" : 0.2
      }
   ]
}

Using Webhook integration

** Set the config values**

Create a config.toml where your service and commands are launched, or specify a path to a TOML file via the HYDRA_SETTINGS environment variable. config.toml or equivalent will override values from udata_hydra/config_default.toml, lookup there for values that can/need to be defined.

UDATA_URI = "https://dev.local:7000/api/2"
UDATA_URI_API_KEY = "example.api.key"
SENTRY_DSN = "https://{my-sentry-dsn}"

The webhook integration sends HTTP messages to udata when resources are analyzed or checked to fill resources extras.

Regarding analysis, there is a phase called "change detection". It will try to guess if a resource has been modified based on different criterions:

  • harvest modified date in catalog
  • content-length and last-modified headers
  • checksum comparison over time

The payload should look something like:

{
   "analysis:filesize": 91661,
   "analysis:mime-type": "application/zip",
   "analysis:checksum": "bef1de04601dedaf2d127418759b16915ba083be",
   "analysis:last-modified-at": "2022-11-27T23:00:54.762000",
   "analysis:last-modified-detection": "harvest-resource-metadata",
}

Development

docker-compose

Multiple docker-compose files are provided:

  • a minimal docker-compose.yml with PostgreSQL
  • docker-compose.broker.yml adds a Redis broker
  • docker-compose.test.yml launches a test DB, needed to run tests

NB: you can launch compose from multiple files like this: docker-compose -f docker-compose.yml -f docker-compose.test.yml up

Logging & Debugging

The log level can be adjusted using the environment variable LOG_LEVEL. For example, to set the log level to DEBUG when initializing the database, use LOG_LEVEL="DEBUG" udata-hydra init_db .

Writing a migration

  1. Add a file named migrations/{YYYYMMDD}_{from}_up_{to}.sql and write the SQL you need to perform migration. from should be the revision from before (eg rev1), to the revision you're aiming at (eg rev2)
  2. Modify the latest revision (eg rev2) in migrations/_LATEST_REVISION
  3. udata-hydra migrate will use the info from _LATEST_REVISION to upgrade to rev2. You can also specify udata-hydra migrate --revision rev2

Deployment

3 services need to be deployed for the full stack to run:

  • worker
  • api / app
  • crawler

Refer to each section to learn how to launch them. The only differences from dev to prod are:

  • use HYDRA_SETTINGS env var to point to your custom config.toml
  • use HYDRA_APP_SOCKET_PATH to configure where aiohttp should listen to a reverse proxy connection (eg nginx) and use udata-hydra-app to launch the app server

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

udata_hydra-1.0.0rc1703.tar.gz (23.1 kB view details)

Uploaded Source

Built Distribution

udata_hydra-1.0.0rc1703-py3-none-any.whl (24.5 kB view details)

Uploaded Python 3

File details

Details for the file udata_hydra-1.0.0rc1703.tar.gz.

File metadata

  • Download URL: udata_hydra-1.0.0rc1703.tar.gz
  • Upload date:
  • Size: 23.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.9.15 Linux/5.15.0-1021-aws

File hashes

Hashes for udata_hydra-1.0.0rc1703.tar.gz
Algorithm Hash digest
SHA256 9b41bcaba7a594a46fa5277bdf0577e5dfa3e84821c5382b737934ad9fa766ea
MD5 81f68fe4945d91a6addf3e34c376fbb1
BLAKE2b-256 975e22d62592587295d694969cd95fbfa73f506dd28badc07e00e77eada0a7ba

See more details on using hashes here.

Provenance

File details

Details for the file udata_hydra-1.0.0rc1703-py3-none-any.whl.

File metadata

  • Download URL: udata_hydra-1.0.0rc1703-py3-none-any.whl
  • Upload date:
  • Size: 24.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.9.15 Linux/5.15.0-1021-aws

File hashes

Hashes for udata_hydra-1.0.0rc1703-py3-none-any.whl
Algorithm Hash digest
SHA256 feeff68acb80e97a0309bda951cc3f1c28d7f8a355f93a411229b30ec585ea5d
MD5 2045569a83941ccb0f054af56f498351
BLAKE2b-256 403c1aaa5233819c0cecbdb63fad99a4e185cfb212751a68da61c4d8e1ddd674

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page