Skip to main content

Python REST API for Entrez E-Utilities: stateless, easy to use, reliable.

Project description

easy-entrez

tests CodeQL Documentation Status DOI

Python REST API for Entrez E-Utilities, aiming to be easy to use and reliable.

Easy-entrez:

  • makes common tasks easy thanks to simple Pythonic API,
  • is typed and integrates well with mypy,
  • is tested on Windows, Mac and Linux across Python 3.7, 3.8, 3.9 and 3.10,
  • is limited in scope, allowing to focus on the reliability of the core code,
  • does not use the stateful API as it is error-prone as seen on example of the alternative entrezpy.

Examples

from easy_entrez import EntrezAPI

entrez_api = EntrezAPI(
    'your-tool-name',
    'e@mail.com',
    # optional
    return_type='json'
)

# find up to 10 000 results for cancer in human
result = entrez_api.search('cancer AND human[organism]', max_results=10_000)

# data will be populated with JSON or XML (depending on the `return_type` value)
result.data

See more in the Demo notebook and documentation.

For a real-world example (i.e. used for this publication) see notebooks in multi-omics-state-of-the-field repository.

Fetching genes for a variant from dbSNP

Fetch the SNP record for rs6311:

rs6311 = entrez_api.fetch(['rs6311'], max_results=1, database='snp').data[0]
rs6311

Display the result:

from easy_entrez.parsing import xml_to_string

print(xml_to_string(rs6311))

Find the gene names for rs6311:

namespaces = {'ns0': 'https://www.ncbi.nlm.nih.gov/SNP/docsum'}
genes = [
    name.text
    for name in rs6311.findall('.//ns0:GENE_E/ns0:NAME', namespaces)
]
print(genes)

['HTR2A']

Fetch data for multiple variants at once:

result = entrez_api.fetch(['rs6311', 'rs662138'], max_results=10, database='snp')
gene_names = {
    'rs' + document_summary.get('uid'): [
        element.text
        for element in document_summary.findall('.//ns0:GENE_E/ns0:NAME', namespaces)
    ]
    for document_summary in result.data
}
print(gene_names)

{'rs6311': ['HTR2A'], 'rs662138': ['SLC22A1']}

Obtaining the chromosomal position from SNP rsID number

from pandas import DataFrame

result = entrez_api.fetch(['rs6311', 'rs662138'], max_results=10, database='snp')

variant_positions = DataFrame([
    {
        'id': 'rs' + document_summary.get('uid'),
        'chromosome': chromosome,
        'position': position
    }
    for document_summary in result.data
    for chrom_and_position in document_summary.findall('.//ns0:CHRPOS', namespaces)
    for chromosome, position in [chrom_and_position.text.split(':')]
])

variant_positions
id chromosome position
0 rs6311 13 46897343
1 rs662138 6 160143444

Converting full variation/mutation data to tabular format

Parsing utilities can quickly extract the data to a VariantSet object holding pandas DataFrames with coordinates and alternative alleles frequencies:

from easy_entrez.parsing import parse_dbsnp_variants

variants = parse_dbsnp_variants(result)
variants

<VariantSet with 2 variants>

To get the coordinates:

variants.coordinates
rs_id ref alts chrom pos chrom_prev pos_prev consequence
rs6311 C A,T 13 46897343 13 47471478 upstream_transcript_variant,intron_variant,genic_upstream_transcript_variant
rs662138 C G 6 160143444 6 160564476 intron_variant

For frequencies:

variants.alt_frequencies.head(5)  # using head to only display first 5 for brevity
rs_id allele source_frequency total_count study count
0 rs6311 T 0.44349 2221 1000Genomes 984.991
1 rs6311 T 0.411261 1585 ALSPAC 651.849
2 rs6311 T 0.331696 1486 Estonian 492.9
3 rs6311 T 0.35 14 GENOME_DK 4.9
4 rs6311 T 0.402529 56309 GnomAD 22666

Obtaining the SNP rs ID number from chromosomal position

You can use the query string directly:

results = entrez_api.search(
    '13[CHROMOSOME] AND human[ORGANISM] AND 31873085[POSITION]',
    database='snp',
    max_results=10
)
print(results.data['esearchresult']['idlist'])

['59296319', '17076752', '7336701', '4']

Or pass a dictionary (no validation of arguments is performed, AND conjunction is used):

results = entrez_api.search(
    dict(chromosome=13, organism='human', position=31873085),
    database='snp',
    max_results=10
)
print(results.data['esearchresult']['idlist'])

['59296319', '17076752', '7336701', '4']

The base position should use the latest genome assembly (GRCh38 at the time of writing); you can use the position in previous assembly coordinates by replacing POSITION with POSITION_GRCH37. For more information of the arguments accepted by the SNP database see the entrez help page on NCBI website.

Obtaining amino acids change information for variants in given range

First we search for dbSNP rs identifiers for variants in given region:

dbsnp_ids = (
    entrez_api
    .search(
        '12[CHROMOSOME] AND human[ORGANISM] AND 21178600:21178720[POSITION]',
        database='snp',
        max_results=100
    )
    .data
    ['esearchresult']
    ['idlist']
)

Then fetch the variant data for identifiers:

variant_data = entrez_api.fetch(
    ['rs' + rs_id for rs_id in dbsnp_ids],
    max_results=10,
    database='snp'
)

And parse the data, extracting the HGVS out of summary:

from easy_entrez.parsing import parse_dbsnp_variants
from pandas import Series


def select_protein_hgvs(items):
    return [
        [sequence, hgvs]
        for entry in items
        for sequence, hgvs in [entry.split(':')]
        if hgvs.startswith('p.')
    ]


protein_hgvs = (
    parse_dbsnp_variants(variant_data)
    .summary
    .HGVS
    .apply(select_protein_hgvs)
    .explode()
    .dropna()
    .apply(Series)
    .rename(columns={0: 'sequence', 1: 'hgvs'})
)
protein_hgvs.head()
rs_id sequence hgvs
rs1940853486 NP_006437.3 p.Gly203Ter
rs1940853414 NP_006437.3 p.Glu202Gly
rs1940853378 NP_006437.3 p.Glu202Lys
rs1940853299 NP_006437.3 p.Lys201Thr
rs1940852987 NP_006437.3 p.Asp198Glu

Find PubMed ID from DOI

When searching GWAS catalog PMID is needed over DOI. You can covert one to the other using:

def doi_term(doi: str) -> str:
    """Prepare DOI for PubMed search"""
    doi = (
        doi
        .replace('http://', 'https://')
        .replace('https://doi.org/', '')
    )
    return f'"{doi}"[Publisher ID]'


result = entrez_api.search(
    doi_term('https://doi.org/10.3389/fcell.2021.626821'),
    database='pubmed',
    max_results=1
)
result.data['esearchresult']['idlist']

['33834021']

Installation

Requires Python 3.6+. Install with:

pip install easy-entrez

If you wish to enable (optional, tqdm-based) progress bars use:

pip install easy-entrez[with_progress_bars]

If you wish to enable (optional, pandas-based) parsing utilities use:

pip install easy-entrez[with_parsing_utils]

Alternatives

You might want to try:

  • biopython.Entrez - biopython is a heavy dependency, but probably good choice if you already use it
  • pubmedpy - provides interesting utilities for parsing the responses
  • entrez - appears to have a comparable scope but quite different API
  • entrezpy - this one did not work well for me (hence this package), but may have improved since

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

easy_entrez-0.3.5.tar.gz (22.2 kB view details)

Uploaded Source

Built Distribution

easy_entrez-0.3.5-py3-none-any.whl (21.0 kB view details)

Uploaded Python 3

File details

Details for the file easy_entrez-0.3.5.tar.gz.

File metadata

  • Download URL: easy_entrez-0.3.5.tar.gz
  • Upload date:
  • Size: 22.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/5.1.0 pkginfo/1.7.1 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.6

File hashes

Hashes for easy_entrez-0.3.5.tar.gz
Algorithm Hash digest
SHA256 062748250e9b949fd1d140e9f25f1ef406f964ac0047f107f3af1a82c73313fb
MD5 3f873c69a5a5afedec4a75c1783a3aa8
BLAKE2b-256 a9d9ad65341edf9c611e3558be0050417583f7d321f881f345de5077bc752331

See more details on using hashes here.

File details

Details for the file easy_entrez-0.3.5-py3-none-any.whl.

File metadata

  • Download URL: easy_entrez-0.3.5-py3-none-any.whl
  • Upload date:
  • Size: 21.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/5.1.0 pkginfo/1.7.1 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.6

File hashes

Hashes for easy_entrez-0.3.5-py3-none-any.whl
Algorithm Hash digest
SHA256 59e9669956d352ba9e5cd87d44e784824afd7acea1e690f1a105e95a2e47135a
MD5 a64ecfe9ae561c9430d1212797a7ad89
BLAKE2b-256 c29ef076a597fa7896b1d505d1e64e8a7fab459c76d6f91abd65e5f35c705bbc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page