Skip to main content

Accessioning tool to submit genomics pipeline outputs to the ENCODE Portal

Project description

accession

Python module and command line tool to submit genomics pipeline analysis output files and metadata to the ENCODE Portal

Table of Contents

Installation

Install the module with pip:

$ pip install accession

Setting environmental variables

You will need ENCODE DCC credentials from the ENCODE Portal. Set them in your command line tool like so:

$ export DCC_API_KEY=XXXXXXXX
$ export DCC_SECRET_KEY=yyyyyyyyyyy

You will also need Google Application Credentials in your environment. Obtain and set your service account credentials:

$ export GOOGLE_APPLICATION_CREDENTIALS=<path_to_service_account_file>

Usage

$ accession --accession-metadata metadata.json \
            --accession-steps steps.json \
            --server dev \
            --lab /labs/encode-processing-pipeline/ \
            --award U41HG007000

Arguments

Metadata JSON

This file is an output of a pipeline analysis run. The example file has all of the tasks and produced files.

Accession Steps

The accessioning steps configuration file specifies the task and file names in the output metadata JSON and the order in which the files and metadata will be submitted. Accessioning code will selectively submit the specified files to the ENCODE Portal. A single step is configured in the following way:

{
        "dcc_step_version":     "/analysis-step-versions/kundaje-lab-atac-seq-trim-align-filter-step-v-1-0/",
        "dcc_step_run":         "atac-seq-trim-align-filter-step-run-v1",
        "wdl_task_name":        "filter",
        "wdl_files":            [
            {
                "filekey":                  "nodup_bam",
                "output_type":              "alignments",
                "file_format":              "bam",
                "quality_metrics":          ["cross_correlation", "samtools_flagstat"],
                "derived_from_files":       [{
                    "derived_from_task":        "trim_adapter",
                    "derived_from_filekey":     "fastqs",
                    "derived_from_inputs":      "true"
                }]
            }
        ]
}

dcc_step_version and dcc_step_run must exist on the portal.

wdl_task_name is the name of the task that has the files to be accessioned.

wdl_files specifies the set of files to be accessioned.

filekey is a variable that stores the file path in the metadata file.

output_type, file_format, and file_format_type are ENCODE specific metadata that are required by the Portal

quality_metrics is a list of methods that will be called in during the accessioning to attach quality metrics to the file

possible_duplicate indicates that there could be files that have an identical content. If the possible_duplicate flag is set and the current file being accessioned has md5sum that's identical to the md5sum of another file in the same task, the current file will not be accessioned. Optimal IDR peaks and conservative IDR peaks are an example set of files that can have an identical md5sum.

derived_from_files specifies the list of files the current file being accessioned derives from. The parent files must have been accessioned before the current file can be submitted.

derived_from_inputs is used when indicating that the parent files were not produced during the pipeline analysis. Instead, these files are initial inputs to the pipeline. Raw fastqs and genome references are examples of such files.

derived_from_output_type is required in the case the parent file has a possible duplicate.

Server

prod and dev indicates the server where the files are being accessioned to. dev points to test.encodedcc.org. The server parameter can be explicitly passed as test.encodedcc.org or encodeproject.org.

Lab and Award

These are unique identifiers that are expected to be already present on the ENCODE Portal.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

accession-0.0.16.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

accession-0.0.16-py3-none-any.whl (13.4 kB view details)

Uploaded Python 3

File details

Details for the file accession-0.0.16.tar.gz.

File metadata

  • Download URL: accession-0.0.16.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.19.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.5.1

File hashes

Hashes for accession-0.0.16.tar.gz
Algorithm Hash digest
SHA256 ff268224c16b8b2ef4a2e5bef0fcbf54faf67556bc34b60ff35a246df320b086
MD5 d3a71d4605e47ab22212f5c32b40b9ac
BLAKE2b-256 aee9beb785d33b8dc7f7197ed14824c4686697ec0e30ff91010a6cb19a6ab5f8

See more details on using hashes here.

File details

Details for the file accession-0.0.16-py3-none-any.whl.

File metadata

  • Download URL: accession-0.0.16-py3-none-any.whl
  • Upload date:
  • Size: 13.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.19.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.5.1

File hashes

Hashes for accession-0.0.16-py3-none-any.whl
Algorithm Hash digest
SHA256 6434ecf6d1dc08a25a3d62018e4bf00443e2c83049eda62b2f3c7243736dd109
MD5 4d7b8feac0fcf681ef9e818d56a46508
BLAKE2b-256 8e39e3693386dc3e697bf070a9d31782076cbfb5c411c768eeee47b9c1c304ec

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page