ucsc-genomes-downloader

Python package to quickly download genomes from the UCSC.

These details have not been verified by PyPI

Project links

Homepage

Project description

Python package to quickly download and work with genomes from the UCSC.

How do I install this package?

As usual, just download it using pip:

pip install ucsc_genomes_downloader

Getting COVID-19 Genome

To download the covid-19 genome just run:

from ucsc_genomes_downloader import Genome
covid = Genome("wuhCor1")

genome = covid["NC_045512v2"]

Tests Coverage

Since some software handling coverages sometimes get slightly different results, here’s three of them:

Usage examples

Simply instantiate a new genome

To download and load into memory the chromosomes of a given genomic assembly you can use the following code snippet:

from ucsc_genomes_downloader import Genome
hg19 = Genome(assembly="hg19")

Downloading selected chromosomes

If you want to select a subset of chromosomes to be downloaded you can use the attribute “chromosomes”:

from ucsc_genomes_downloader import Genome
hg19 = Genome("hg19", chromosomes=["chr1", "chr2"])

Getting gaps regions

The method returns a DataFrame in bed-like format that contains the regions where only n or N nucleotides are present.

all_gaps = hg19.gaps() # Returns gaps (region formed of Ns) for all chromosomes
# Returns gaps for chromosome chrM
chrM_gaps = hg19.gaps(chromosomes=["chrM"])

Getting filled regions

The method returns a DataFrame in bed-like format that contains the regions where no unknown nucleotides are present, basically the complementary of the gaps method.

all_filled = hg19.filled() # Returns filled for all chromosomes
# Returns filled for chromosome chrM
chrM_filled = hg19.filled(chromosomes=["chrM"])

Removing genome’s cache

To delete the cache of the genome, including chromosomes and metadata you can use the delete method:

hg19.delete()

Genome objects representation

When printed, a Genome object has a human-readable representation. This allows you to print lists of Genome objects as follows:

print([
    hg19,
    hg38,
    mm10
])

# >>> [
#    Human, Homo sapiens, hg19, 2009-02-28, 25 chromosomes,
#    Human, Homo sapiens, hg38, 2013-12-29, 25 chromosomes,
#    Mouse, Mus musculus, mm10, 2011-12-29, 22 chromosomes
# ]

Obtaining a given bed file sequences

Given a pandas DataFrame in bed-like format, you can obtain the corresponding genomic sequences for the loaded assembly using the bed_to_sequence method:

my_bed = pd.read_csv("path/to/my/file.bed", sep="\t")
sequences = hg19.bed_to_sequence(my_bed)

Properties

A Genome object has the following properties:

hg19.assembly # Returns "hg19"
hg19.date # Returns "2009-02-28" as datetime object
hg19.organism # Returns "Human"
hg19.scientific_name # Returns "Homo sapiens"
hg19.description # Returns the brief description as provided from UCSC
hg19.path # Returns path where genome is cached

Utilities

Retrieving a list of the available genomes

You can get a complete list of the genomes available from the UCSC website with the following method:

from ucsc_genomes_downloader.utils import get_available_genomes
all_genomes = get_available_genomes()

Tessellating bed files

Create a tessellation of a given size of a given bed-like pandas dataframe.

Available alignments are to the left, right or center.

from ucsc_genomes_downloader.utils import tessellate_bed
import pandas as pd

my_bed = pd.read_csv("path/to/my/file.bed", sep="\t")
tessellated = tessellate_bed(
    my_bed,
    window_size=200,
    alignment="left"
)

Expand bed files regions

Expand a given dataframe in bed-like format using selected alignment.

Available alignments are to the left, right or center.

from ucsc_genomes_downloader.utils import expand_bed_regions
import pandas as pd

my_bed = pd.read_csv("path/to/my/file.bed", sep="\t")
expanded = expand_bed_regions(
    my_bed,
    window_size=1000,
    alignment="left"
)

Wiggle bed files regions

Generate new bed regions based on a given bed file by wiggling the initial regions.

from ucsc_genomes_downloader.utils import wiggle_bed_regions
import pandas as pd

my_bed = pd.read_csv("path/to/my/file.bed", sep="\t")
expanded = wiggle_bed_regions(
    my_bed,
    max_wiggle_size=100, # Maximum amount to wiggle region
    wiggles=10, # Number of wiggled samples to introduce
    seed=42 # Random seed for reproducibility
)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.1.26

Apr 29, 2022

1.1.25

Jul 31, 2020

1.1.24

Jul 20, 2020

1.1.23

Jul 19, 2020

1.1.22

Jul 18, 2020

1.1.21

Jul 18, 2020

1.1.20

Jul 18, 2020

1.1.19

Jul 18, 2020

1.1.18

Mar 29, 2020

1.1.17

Mar 22, 2020

This version

1.1.16

Mar 19, 2020

1.1.15

Mar 10, 2020

1.1.14

Mar 1, 2020

1.1.13

Feb 20, 2020

1.1.12

Feb 3, 2020

1.1.11

Feb 3, 2020

1.1.10

Feb 1, 2020

1.1.9

Jan 30, 2020

1.1.8

Jan 30, 2020

1.1.7

Jan 29, 2020

1.1.6

Jan 29, 2020

1.1.5

Jan 28, 2020

1.1.4

Jan 25, 2020

1.1.2

Jan 20, 2020

1.1.1

Jan 9, 2020

1.1.0

Jan 7, 2020

1.0.2

Dec 14, 2019

1.0.1

Jun 2, 2019

1.0.0

May 27, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ucsc_genomes_downloader-1.1.16.tar.gz (13.9 kB view hashes)

Uploaded Mar 19, 2020 Source

Hashes for ucsc_genomes_downloader-1.1.16.tar.gz

Hashes for ucsc_genomes_downloader-1.1.16.tar.gz
Algorithm	Hash digest
SHA256	`bb1b60b9eae4f2b62b419bce974ef3f6132615c70ba1235586d2a8b43d25f18a`
MD5	`37015a7842dffc7d74c02f725e7db5dc`
BLAKE2b-256	`3e43a8f541acc78b4b98ed6b6c5aafb76e252dc57e9374befb4394dbfc2f586e`