Skip to main content

Download data from URLs quickly, with integrity

Project description

getm: Fast reads with integrity for data URLs

getm provides fast binary reads for HTTP URLs using multiprocessing and shared memory.

Data is downloaded in background processes and made availabe as references to shared memory. There are no buffer copies, but memory references must be released by the caller, which makes working with getm a bit different than typical Python IO streams. But still easy, and fast.

Python API methods accept a parameter, concurrency, which controls the mode of operation of mget:

  1. Default concurrency == 1: Download data in a single background process, using a single HTTP request that is kept alive during the course of the download.
  2. concurrency > 1: Up to concurrency HTTP range requests will be made concurrently, each in a separate background process.
  3. concurrency == None: Data is read on the main process. In this mode, getm is a wrapper for requests.

Python API

import getm

# Readable stream:
with getm.urlopen(url) as fh:
    data = fh.read(size)
	data.release()

# Process data in parts:
for part in getm.iter_content(url, chunk_size=1024 * 1024):
    my_chunk_processor(part)
	part.release()

CLI

getm https://my-cool-url my-local-file

Testing

During tests, signed URLs are generated that point to data in S3 and GS buckets. The data is repopulated during each test.

Credentials

To run tests you must be properly credentialed. For S3 you must have access to the test bucket, and sufficient privliages to upload data and generate signed URLs. For GS you must have service account credentials privlidged with similar access to the GS bucket. The service account are made available by setting the environment variable

export GOOGLE_APPLICATION_CREDENTIALS=my-creds.json

Installation

pip install getm

Links

Project home page GitHub
Package distribution PyPI

Bugs

Please report bugs, issues, feature requests, etc. on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

getm-0.0.1.tar.gz (19.2 kB view details)

Uploaded Source

File details

Details for the file getm-0.0.1.tar.gz.

File metadata

  • Download URL: getm-0.0.1.tar.gz
  • Upload date:
  • Size: 19.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.10

File hashes

Hashes for getm-0.0.1.tar.gz
Algorithm Hash digest
SHA256 3add74305eccf0cff8785fc9480a6d0dd48dc167672e1867940c53da37a972e0
MD5 fd7b54756574c37217e1ffc84277b835
BLAKE2b-256 e12f115d217fe76b52e7286d368daa442085b839be88ad07366b85f4f328672d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page