Downloads and caches files for knowledge graph ETL
Project description
KG-Hub Downloader
| Documentation | Repository | PyPI |
Overview
This is a configuration based file caching downloader with initial support for http requests & queries against elasticsearch.
Installation
KGHub Downloader is available to install via pip:
pip install kghub-downloader
Configure
The downloader requires a YAML file which contains a list of target URLs to download, and local names to save those downloads.
For an example, see example/download.yaml
Available options are:
-
*url: The URL to download from. Currently supported:
http(s)
ftp
- with
glob:
option to download files with specific extensions (only with ftp as of now and looks recursively).
- with
- Google Cloud Storage (
gs://
) - Google Drive (
gdrive://
or https://drive.google.com/...). The file must be publicly accessible. - Amazon AWS S3 bucket (
s3://
) - GitHub Release Assets (
git://RepositoryOwner/RepositoryName
)
If the URL includes a name in
{CURLY_BRACES}
, it will be expanded from environment variables. -
glob: An optional glob pattern to limit downloading files (FTP only)
-
local_name: The name to save the file as locally
-
tag: A tag to use to filter downloads
-
api: The API to use to download the file. Currently supported:
elasticsearch
-
elastic search options
- query_file: The file containing the query to run against the index
- index: The elastic search index for query
* Note:
Google Cloud Storage URLs require that you have set up your credentials as described here. You must:
- create a service account
- add the service account to the relevant bucket and
- download a JSON key for that service account.
Then, set theGOOGLE_APPLICATION_CREDENTIALS
environment variable to point to that file.Mirorring local files to Amazon AWS S3 bucket requires the following:
- Create an AWS account
- Create an IAM user in AWS: This enables getting the
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
needed for authentication. These two should be stored as environment variables in the user's system.- Create an S3 bucket: This will be the destination for pushing local files.
You can also include any secrets like API keys you have set as environment variables using {VARIABLE_NAME}
, for example:
---
- url: "https://example.com/myfancyfile.json?key={YOUR_SECRET}"
localname: myfancyfile.json
Note: YOUR_SECRET
MUST as an environment variable, and be sure to include the {curly braces} in the url string.
Usage
Downloader can be used directly in Python or via command line
In Python
from kghub_downloader.download_utils import download_from_yaml
download_from_yaml(yaml_file="download.yaml", output_dir="data")
Command Line
To download files listed in a download.yaml file:
$ downloader [OPTIONS] [YAML_FILE]
Arguments:
[YAML_FILE]
: List of files to download in YAML format [default: download.yaml]
Options:
--output-dir TEXT
: Path to output directory [default: .]--ignore-cache / --no-ignore-cache
: Ignoring already downloaded files and download again [default: no-ignore-cache]--progress / --no-progress
: Show progress for individual downloads [default: progress]--fail-on-error / --no-fail-on-error
: Do not attempt to download more files if one raises an error [default: no-fail-on-error]--snippet-only / --no-snippet-only
: Only download a snippet of the file. [HTTP(S) resources only. [default: no-snippet-only]--verbose / --no-verbose
: Show verbose output [default: no-verbose]--tags TEXT
: Optional list of tags to limit downloading to--mirror TEXT
: Optional remote storage URL to mirror download to. Supported buckets: Google Cloud Storage
Examples:
$ downloader --output_dir example_output --tags zfin_gene_to_phenotype example.yaml
$ downloader --output_dir example_output --mirror gs://your-bucket/desired/directory
# Note that if your YAML file is named `download.yaml`,
# the argument can be omitted from the CLI call.
$ downloader --output_dir example_output
Development
Install
git clone https://github.com/monarch-initiative/kghub-downloader.git
cd kghub-downloader
poetry install
Run tests
poetry run pytest
NOTE: The tests require gcloud credentials to be set up as described above, using the Monarch github actions service account.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file kghub_downloader-0.4.1.tar.gz
.
File metadata
- Download URL: kghub_downloader-0.4.1.tar.gz
- Upload date:
- Size: 12.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.8.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4e92736aacd1d6d2fc4f403546465e0682c6301df0b27c33c331e146783cbb32 |
|
MD5 | a0c64cb76fd4634b39ddf92e71c3509b |
|
BLAKE2b-256 | 3cea2938d6801ac76cc4e7bff4141555f465482780b33f1840af7d941989734e |
File details
Details for the file kghub_downloader-0.4.1-py3-none-any.whl
.
File metadata
- Download URL: kghub_downloader-0.4.1-py3-none-any.whl
- Upload date:
- Size: 13.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.8.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 518ffb94db3a7774c58e87234267a1894144f01d130c6080db8a96c69759f90b |
|
MD5 | aab3d6af5cf4d89b5bfca901bc4c043c |
|
BLAKE2b-256 | bd514c738dee61178faa37d5681c7e9fcb1853b45cd14af11593172be8ef4291 |