DCAT to Intake Catalog translation layer
Project description
intake-dcat
This is an intake data source for DCAT catalogs.
These catalogs are a standardized format for describing metadata and access information
for public datasets, as described here.
Many Socrata and ESRI data portals publish data.json
files in this format describing their catalogs.
Two examples of thes can be found at
https://data.lacity.org/data.json
http://geohub.lacity.org/data.json
This project provides an opinionated way for users to load datasets from these catalogs into the scientific Python ecosystem. At the moment it loads CSVs into Pandas dataframes and GeoJSON files into GeoDataFrames, and ESRI Shapefiles into GeoDataFrames. Future formats could include plain JSON and Parquet.
Requirements
intake >= 0.4.4
intake_geopandas >= 0.2.2
geopandas >= 0.5.0
Installation
intake-dcat
is published on PyPI.
You can install it by running the following in your terminal:
pip install intake-dcat
You can test the functionality by opening the example notebooks in the examples/
directory:
Usage
The package can be imported using
from intake_dcat import DCATCatalog
Loading a catalog
You can load data from a DCAT catalog by providing the URL to the data.json
file:
catalog = DCATCatalog('http://geohub.lacity.org/data.json', name='geohub')
len(list(catalog))
You can display the items in the catalog
for entry_id, entry in catalog.items():
display(entry)
If the catalog has too many entries to comfortably print all at once, you can narrow it by searching for a term (e.g. 'district'):
for entry_id, entry in catalog.search('district').items():
display(entry)
Loading a dataset
Once you have identified a dataset, you can load it into a dataframe using read()
:
df = entry.read()
This will automatically load that dataset into a Pandas dataframe, or a GeoDataFrame, depending on the source format.
Command Line Interface
intake-dcat
provides a small command line interface for some common operations.
These are invoked using intake-dcat <subcommand> <options>
The mirror
command
This command loads a manifest file that lists a set of DCAT entries, uploads them to a specified s3 bucket, and outputs a new catalog with identical entries pointing to the bucket.
An example manifest is given by
# Name of the LA open data portal
la-open-data:
# URL to the open data portal catalog
url: https://data.lacity.org/data.json
# The s3 bucket to upload the data to
bucket_uri: s3://my-bucket
# A list of data resources to mirror
items:
lapd_metrics: https://data.lacity.org/api/views/t6kt-2yic
# Name of the LA GeoHub data portal
la-geohub:
# URL to the open data portal catalog
url: http://geohub.lacity.org/data.json
# The s3 bucket to upload the data to
bucket_uri: s3://my-bucket
# A list of data resources to mirror
items:
bikeways: http://geohub.lacity.org/datasets/2602345a7a8549518e8e3c873368c1d9_0
city_boundary: http://geohub.lacity.org/datasets/09f503229d37414a8e67a7b6ceb9ec43_7
This can be mirrored using the command
intake-dcat mirror manifest.yml > new-catalog.yml
This command uses the boto3
library and assumes it can find AWS credentials.
For more information see this documentation.
The create
command
This command creates a new intake catalog from a DCAT catalog, and outputs it to standard out. An example command is given by
intake-dcat create data.lacity.org/data.json > catalog.yml
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file intake-dcat-0.2.3.tar.gz
.
File metadata
- Download URL: intake-dcat-0.2.3.tar.gz
- Upload date:
- Size: 11.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 386be34f0357cdaae5a24bfd0628720397118a56036b5bde1a7ee4520f50934f |
|
MD5 | ac847074cee6205a49ce17cdb07d28a8 |
|
BLAKE2b-256 | aa0c694a5ec18866617735c94dad7a96eaddd14d4ad04d37eb54e9ee4db1966a |
File details
Details for the file intake_dcat-0.2.3-py3-none-any.whl
.
File metadata
- Download URL: intake_dcat-0.2.3-py3-none-any.whl
- Upload date:
- Size: 13.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9fbe229b5e33228107b2049d725884fecce5c55913179bb19899afe5ac7bea3a |
|
MD5 | 9f9243a3f3c9f64b12d46482cca1016a |
|
BLAKE2b-256 | a9a72ac91601a3822ff3559f282de16d66c9a043c7d0eb3a1a3aaabf7a37e35c |