Skip to main content

Simple data loader for CGP HCA Data Store

Project description

cgp-dss-data-loader

Simple data loader for CGP HCA Data Store

Common Setup

  1. (optional) We recommend using a Python 3 virtual environment.

  2. Run:

    pip3 install cgp-dss-data-loader

Setup for Development

  1. Clone the repo:

    git clone https://github.com/DataBiosphere/cgp-dss-data-loader.git

  2. Go to the root directory of the cloned project:

    cd cgp-dss-data-loader

  3. Make sure you are on the branch develop.

  4. Run (ideally in a new virtual environment):

    make develop

Cloud Credentials Setup

Because this program uses Amazon Web Services and Google Cloud Platform, you will need to set up credentials for both of these before you can run the program.

AWS credentials

  1. If you haven't already you will need to make an IAM user and create a new access key. Instructions are here.

  2. Next you will need to store your credentials so that Boto can access them. Instructions are here.

GCP credentials

  1. Follow the steps here to set up your Google Credentials.

Running Tests

Run:

make test

Getting Data from Gen3 and Loading it

  1. The first step is to extract the Gen3 data you want using the sheepdog exporter. The TopMed public data extracted from sheepdog is available on the release page under Assets. Assuming you use this data, you will now have a file called topmed-public.json

  2. Make sure you are running the virtual environment you set up in the Setup instructions.

  3. Now you will need to transform the data into the 'standard' loader format. Do this using the newt-transformer. You can follow the common setup, then the section for transforming data from sheepdog.

  4. Now that we have our new transformed output we can run it with the loader.

    If you used the standard transformer use the command:

    dssload --no-dry-run --dss-endpoint MY_DSS_ENDPOINT --staging-bucket NAME_OF_MY_S3_BUCKET transformed-topmed-public.json
    
  5. You did it!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cgp-dss-data-loader-1.1.0.tar.gz (16.1 kB view details)

Uploaded Source

File details

Details for the file cgp-dss-data-loader-1.1.0.tar.gz.

File metadata

  • Download URL: cgp-dss-data-loader-1.1.0.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.3

File hashes

Hashes for cgp-dss-data-loader-1.1.0.tar.gz
Algorithm Hash digest
SHA256 d67433cd93a1bb9336cf7505e9ad8a68ec111994d9ae5388b7bf764d0201a275
MD5 1010730468564d023cd28a179b78b30b
BLAKE2b-256 99afc5d322e3f6ff620ab5f8e4de6423e6ff64d9453c87c934b7e8545e4c1c3e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page