Project description

cgp-dss-data-loader

Simple data loader for CGP HCA Data Store

Common Setup

(optional) We recommend using a Python 3 virtual environment.
Run:

pip3 install cgp-dss-data-loader

Setup for Development

Clone the repo:

git clone https://github.com/DataBiosphere/cgp-dss-data-loader.git
Go to the root directory of the cloned project:

cd cgp-dss-data-loader
Make sure you are on the branch develop.
Run (ideally in a new virtual environment):

make develop

Cloud Credentials Setup

Because this program uses Amazon Web Services and Google Cloud Platform, you will need to set up credentials for both of these before you can run the program.

AWS credentials

If you haven't already you will need to make an IAM user and create a new access key. Instructions are here.
Next you will need to store your credentials so that Boto can access them. Instructions are here.

GCP credentials

Follow the steps here to set up your Google Credentials.

Running Tests

Run:

make test

Getting Data from Gen3 and Loading it

The first step is to extract the Gen3 data you want using the sheepdog exporter. The TopMed public data extracted from sheepdog is available on the release page under Assets. Assuming you use this data, you will now have a file called topmed-public.json
Make sure you are running the virtual environment you set up in the Setup instructions.
Now we need to transform the data. We can transform to the outdated gen3 format, or to the new standard format.
- For the standard format, follow instructions at newt-transformer.
- For the old Gen3 format, run this from the root of the project:
```
python transformer/gen3_transformer.py /path/to/topmed_public.json --output-json transformed-topmed-public.json
```

Now that we have our new transformed output we can run it with the loader.

If you used the standard transformer use the command:

dssload --no-dry-run --dss-endpoint MY_DSS_ENDPOINT --staging-bucket NAME_OF_MY_S3_BUCKET standard --json-input-file transformed-topmed-public.json

Otherwise for the outdated gen3 format run:

dssload --no-dry-run --dss-endpoint MY_DSS_ENDPOINT --staging-bucket NAME_OF_MY_S3_BUCKET gen3 --json-input-file transformed-topmed-public.json

You did it!

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language
- Python :: 3.6
Topic
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

1.1.0

Sep 25, 2018

This version

0.1.0

Jul 12, 2018

0.0.2

Jun 29, 2018

0.0.1

Jun 29, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cgp-dss-data-loader-0.1.0.tar.gz (14.9 kB view details)

Uploaded Jul 12, 2018 Source

File details

Details for the file cgp-dss-data-loader-0.1.0.tar.gz.

File metadata

Download URL: cgp-dss-data-loader-0.1.0.tar.gz
Upload date: Jul 12, 2018
Size: 14.9 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for cgp-dss-data-loader-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4ea4978cf44a8d80725208768c853f06e12c2f79d0950422a0ef6b0b5e5cd6e7`
MD5	`2742e9aa6aee1143eef1a2b374d1f864`
BLAKE2b-256	`ce6e957ca4761a5dfc58bb8fab9361642713f7af7497d7cd13fa667edcf95d3a`