Skip to main content

Download data from URLs quickly, with integrity

Project description

getm: Fast reads with integrity for data URLs

getm provides fast binary reads for HTTP URLs using multiprocessing and shared memory.

Data is downloaded in background processes and made availabe as references to shared memory. There are no buffer copies, but memory references must be released by the caller, which makes working with getm a bit different than typical Python IO streams. But still easy, and fast. In the case of part iteration, memoryview objects are released for you.

Python API methods accept a parameter, concurrency, which controls the mode of operation of mget:

  1. Default concurrency == 1: Download data in a single background process, using a single HTTP request that is kept alive during the course of the download.
  2. concurrency > 1: Up to concurrency HTTP range requests will be made concurrently, each in a separate background process.
  3. concurrency == None: Data is read on the main process. In this mode, getm is a wrapper for requests.

Python API

import getm

# Readable stream:
with getm.urlopen(url) as fh:
    data = fh.read(size)
	data.release()

# Process data in parts:
for part in getm.iter_content(url, chunk_size=1024 * 1024):
    my_chunk_processor(part)
	# Note that 'part.release()' is not needed in an iterator context

CLI

getm https://my-cool-url my-local-file

Testing

During tests, signed URLs are generated that point to data in S3 and GS buckets. Data is repopulated during each test. You must have credentials available to read and write to the test buckets, and to generate signed URLs.

Set the following environment variables to the GS and S3 test bucket names, respectively:

  • GETM_GS_TEST_BUCKET
  • GETM_S3_TEST_BUCKET

GCP Credentials

Generating signed URLs during tests requires service account credentials, which are made available to the test suite by setting the environment variable

export GETM_GOOGLE_APPLICATION_CREDENTIALS=my-creds.json

AWS Credentials

Follow these instructions for configuring the AWS CLI.

Installation

pip install getm

Links

Project home page GitHub
Package distribution PyPI

Bugs

Please report bugs, issues, feature requests, etc. on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

getm-0.0.3.tar.gz (24.9 kB view details)

Uploaded Source

File details

Details for the file getm-0.0.3.tar.gz.

File metadata

  • Download URL: getm-0.0.3.tar.gz
  • Upload date:
  • Size: 24.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.7.10

File hashes

Hashes for getm-0.0.3.tar.gz
Algorithm Hash digest
SHA256 4be9d64ee889972403563e6c2c82bf3bef50955e5aad6cfa92f1d41c46de4f5e
MD5 99d412800f5d25325b5b523f01ed80b9
BLAKE2b-256 5c3954dba0905d3dc14f3bdc79f2c829122d1747422e672197fb7ef3c6f8f11f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page