Skip to main content

Python package wrapping ENCODE epigenomic data for a number of reference cell lines.

Project description

Travis CI build SonarCloud Quality SonarCloud Maintainability Codacy Maintainability Maintainability Pypi project Pypi total project downloads

Python package wrapping ENCODE epigenomic data for several reference cell lines.

How do I install this package?

As usual, just download it using pip:

pip install epigenomic_dataset

Tests Coverage

Since some software handling coverages sometimes get slightly different results, here’s three of them:

Coveralls Coverage SonarCloud Coverage Code Climate Coverate

Preprocessed data for cis-regulatory regions

We have already downloaded and obtained the max window value for each promoter and enhancer region for the cell lines A549, GM12878, H1, HEK293, HepG2, K562 and MCF7 in the dataset Fantom and cell lines A549, GM12878, H1, HepG2 and K562 for the Roadmap dataset taking in consideration all the target features listed in the complete table of epigenomes.

The thresholds used for classifying the activations of enhancers and promoters in Fantom are the default explained in the sister pipeline CRR labels which handles the download and preprocessing of the data from Fantom and Roadmap.

Dataset

Cell line

Promoters

Enhancers

Fantom

A549

200

1000

200

1000

Fantom

GM12878

200

1000

200

1000

Fantom

H1

200

1000

200

1000

Fantom

HEK293

200

1000

200

1000

Fantom

HepG2

200

1000

200

1000

Fantom

K562

200

1000

200

1000

Fantom

MCF-7

200

1000

200

1000

Roadmap

A549

200

1000

200

1000

Roadmap

GM12878

200

1000

200

1000

Roadmap

H1

200

1000

200

1000

Roadmap

HepG2

200

1000

200

1000

Roadmap

K562

200

1000

200

1000

Here are the labels for all the considered cell lines.

Dataset

Promoters

Enhancers

Fantom

200

1000

200

1000

Roadmap

200

1000

200

1000

TODO: align promoters and enhancers in a reference labels dataset.

The complete pipeline used to retrieve the CRR epigenomic data is available here.

Automatic retrieval of preprocessed data

You can automatically retrieve the data as follows:

from epigenomic_dataset import load_epigenomes

X, y = load_epigenomes(
    cell_line = "K562",
    dataset = "fantom",
    regions = "promoters",
    window_size = 200,
    root = "datasets" # Path where to download data
)

Pipeline for epigenomic data

The considered raw data are from this query from the ENCODE project

You can find the complete table of the available epigenomes here. These datasets were selected to have (at time of the writing, 07/02/2020) the least possible amount of known problems, such as low read resolution.

You can run the pipeline as follows: suppose you want to extract the epigenomic features for the cell lines HepG2 and H1:

from epigenomic_dataset import build

build(
    bed_path="path/to/my/bed/file.bed",
    cell_lines=["HepG2", "H1"]
)

If you want to specify where to store the files use:

from epigenomic_dataset import build

build(
    bed_path="path/to/my/bed/file.bed",
    cell_lines=["HepG2", "H1"],
    path="path/to/my/target"
)

By default, the downloaded bigWig files are not deleted. You can choose to delete the files as follows:

from epigenomic_dataset import build

build(
    bed_path="path/to/my/bed/file.bed",
    cell_lines=["HepG2", "H1"],
    path="path/to/my/target",
    clear_download=True
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

epigenomic_dataset-1.1.6.tar.gz (69.3 kB view details)

Uploaded Source

File details

Details for the file epigenomic_dataset-1.1.6.tar.gz.

File metadata

  • Download URL: epigenomic_dataset-1.1.6.tar.gz
  • Upload date:
  • Size: 69.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for epigenomic_dataset-1.1.6.tar.gz
Algorithm Hash digest
SHA256 43e04e9682d75cfcf83b720af00364219643f099d960768db26ddb69df736880
MD5 907e60333ea58b7847bf11aa0aa93836
BLAKE2b-256 91c111d5afbea6a974e08b49afd89752a2afa66a9c40f62016b0dcbd0e51f85c

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page