Skip to main content

DCAT to Intake Catalog translation layer

Project description

intake-dcat

Binder

This is an intake data source for DCAT catalogs.

These catalogs are a standardized format for describing metadata and access information for public datasets, as described here. Many Socrata and ESRI data portals publish data.json files in this format describing their catalogs. Two examples of thes can be found at

https://data.lacity.org/data.json

http://geohub.lacity.org/data.json

This project provides an opinionated way for users to load datasets from these catalogs into the scientific Python ecosystem. At the moment it loads CSVs into Pandas dataframes and GeoJSON files into GeoDataFrames, and ESRI Shapefiles into GeoDataFrames. Future formats could include plain JSON and Parquet.

Requirements

intake >= 0.4.4
intake_geopandas >= 0.2.2
geopandas >= 0.5.0

Installation

intake-dcat is published on PyPI. You can install it by running the following in your terminal:

pip install intake-dcat

You can test the functionality by opening the example notebooks in the examples/ directory:

Usage

The package can be imported using

from intake_dcat import DCATCatalog

Loading a catalog

You can load data from a DCAT catalog by providing the URL to the data.json file:

catalog = DCATCatalog('http://geohub.lacity.org/data.json', name='geohub')
len(list(catalog))

You can display the items in the catalog

for entry_id, entry in catalog.items():
    display(entry)

If the catalog has too many entries to comfortably print all at once, you can narrow it by searching for a term (e.g. 'district'):

for entry_id, entry in catalog.search('district').items():
  display(entry)

Loading a dataset

Once you have identified a dataset, you can load it into a dataframe using read():

df = entry.read()

This will automatically load that dataset into a Pandas dataframe, or a GeoDataFrame, depending on the source format.

Command Line Interface

intake-dcat provides a small command line interface for some common operations. These are invoked using intake-dcat <subcommand> <options>

The mirror command

This command loads a manifest file that lists a set of DCAT entries, uploads them to a specified s3 bucket, and outputs a new catalog with identical entries pointing to the bucket.

An example manifest is given by

# Name of the LA open data portal
la-open-data:
  # URL to the open data portal catalog
  url: https://data.lacity.org/data.json
  # The s3 bucket to upload the data to
  bucket_uri: s3://my-bucket
  # A list of data resources to mirror
  items:
    lapd_metrics: https://data.lacity.org/api/views/t6kt-2yic
# Name of the LA GeoHub data portal
la-geohub:
  # URL to the open data portal catalog
  url: http://geohub.lacity.org/data.json
  # The s3 bucket to upload the data to
  bucket_uri: s3://my-bucket
  # A list of data resources to mirror
  items:
    bikeways: http://geohub.lacity.org/datasets/2602345a7a8549518e8e3c873368c1d9_0 
    city_boundary: http://geohub.lacity.org/datasets/09f503229d37414a8e67a7b6ceb9ec43_7

This can be mirrored using the command

intake-dcat mirror manifest.yml > new-catalog.yml

This command uses the boto3 library and assumes it can find AWS credentials. For more information see this documentation.

The create command

This command creates a new intake catalog from a DCAT catalog, and outputs it to standard out. An example command is given by

intake-dcat create data.lacity.org/data.json > catalog.yml

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intake-dcat-0.1.0.tar.gz (10.9 kB view details)

Uploaded Source

Built Distribution

intake_dcat-0.1.0-py3-none-any.whl (12.3 kB view details)

Uploaded Python 3

File details

Details for the file intake-dcat-0.1.0.tar.gz.

File metadata

  • Download URL: intake-dcat-0.1.0.tar.gz
  • Upload date:
  • Size: 10.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for intake-dcat-0.1.0.tar.gz
Algorithm Hash digest
SHA256 cd68807eef5327dba1b1c3c3de8a916e223155692b70417a5585e0879509cc01
MD5 769413ca15719fa6771a96f2e6d9fc1e
BLAKE2b-256 ed63e47a15da4c88d54eda1e16d8c956f21f98461784801c53cfa1b5685567a9

See more details on using hashes here.

File details

Details for the file intake_dcat-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: intake_dcat-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 12.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for intake_dcat-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f5d1d491b4ff190ed8eff4a3c993a2660a7f1d3c172209101c60144f12cc9a02
MD5 007cf3c39b7c39260de30771b280dfe9
BLAKE2b-256 178cba1db2f88c9e5b5d426278dddf11ede22e5207400dadb7b32ef8ca3f44e4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page