Skip to main content

Git-Annex Mass Downloader and Metadata-er

Project description

Project Status: Unsupported – The project has reached a stable, usable state but the author(s) have ceased all work on it. A new maintainer may be desired. CI Status https://codecov.io/gh/jwodder/gamdam/branch/master/graph/badge.svg https://img.shields.io/pypi/pyversions/gamdam.svg MIT License

GitHub | PyPI | Issues | Changelog

gamdam is the Git-Annex Mass Downloader and Metadata-er. It takes a stream of JSON Lines describing what to download and what metadata each file has, downloads them in parallel to a git-annex repository, attaches the metadata using git-annex’s metadata facilities, and commits the results.

This program was written as an experiment/proof-of-concept for a larger program and is no longer maintained. However, the author has also produced a Rust translation of this program at <https://github.com/jwodder/gamdam-rust> which is currently being maintained.

Installation

gamdam requires Python 3.8 or higher. Just use pip for Python 3 (You have pip, right?) to install gamdam and its dependencies:

python3 -m pip install gamdam

gamdam also requires git-annex v10.20220222 or higher to be installed separately in order to run.

Usage

gamdam [<options>] [<input-file>]

gamdam reads a series of JSON entries from a file (or from standard input if no file is specified) following the input format described below. It feeds the URLs and output paths to git-annex addurl, and once each file has finished downloading, it attaches any listed metadata and extra URLs using git-annex metadata and git-annex registerurl, respectively.

Note that the latter step can only be performed on files tracked by git-annex; if you, say, have configured git-annex to not track text files, then any text files downloaded will not have any metadata or alternative URLs registered.

Options

--addurl-opts OPTIONS

Extra options to pass to the git-annex addurl command. Note that multiple options & arguments need to be quoted as a single string, which must also use proper shell quoting internally; e.g., --addurl-opts="--user-agent 'gamdam via git-annex'".

-C DIR, --chdir DIR

The directory in which to download files; defaults to the current directory. If the directory does not exist, it will be created. If the directory does not belong to a Git or git-annex repository, it will be initialized as one.

-F FILE, --failures FILE

If any files fail to download, write their input records back out to FILE

-J INT, --jobs INT

Number of parallel jobs for git-annex addurl to use; by default, the process is instructed to use one job per CPU core.

-l LEVEL, --log-level LEVEL

Set the log level to the given value. Possible values are “CRITICAL”, “ERROR”, “WARNING”, “INFO”, “DEBUG” (all case-insensitive) and their Python integer equivalents. [default: INFO]

-m TEXT, --message TEXT

The commit message to use when saving. This may contain a {downloaded} placeholder which will be replaced with the number of files successfully downloaded.

--no-save-on-fail

Don’t commit the downloaded files if any files failed to download

--save, --no-save

Whether to commit the downloaded files once they’ve all been downloaded [default: --save]

Input Format

Input is a series of JSON objects, one per line (a.k.a. “JSON Lines”). Each object has the following fields:

url

(required) A URL to download

path

(required) A relative path where the contents of the URL should be saved. If an entry with a given path is encountered while another entry with the same path is being downloaded, the later entry is discarded, and a warning is emitted.

If a file already exists at a given path, git-annex will try to register the URL as an additional location for the file, failing if the resource at the URL is not the same size as the extant file.

metadata

A collection of metadata in the form used by git-annex metadata, i.e., a dict mapping key names to lists of string values.

extra_urls

A list of alternative URLs for the resource, to be attached to the downloaded file with git-annex registerurl.

If a given input line is invalid, it is discarded, and an error message is emitted.

Library Usage

gamdam can also be used as a Python library. It exports the following:

async def download(
    repo: pathlib.Path,
    objects: AsyncIterator[Downloadable],
    jobs: Optional[int] = None,
    addurl_opts: Optional[List[str]] = None,
    subscriber: Optional[anyio.abc.ObjectSendStream[DownloadResult]] = None,
) -> Report

Download the items yielded by the async iterator objects to the directory repo (which must be part of a git-annex repository) and set their metadata. jobs is the number of parallel jobs for the git-annex addurl process to use; a value of None means to use one job per CPU core. addurl_opts contains any additional arguments to append to the git-annex addurl command.

If subscriber is supplied, it will be sent a DownloadResult (see below) for each completed download, both successful and failed. This can be used to implement custom post-processing of downloads.

class Downloadable(pydantic.BaseModel):
    path: pathlib.Path
    url: pydantic.AnyHttpUrl
    metadata: Optional[Dict[str, List[str]]] = None
    extra_urls: Optional[List[pydantic.AnyHttpUrl]] = None

Downloadable is a pydantic model used to represent files to download; see Input Format above for the meanings of the fields.

class DownloadResult(pydantic.BaseModel):
    downloadable: Downloadable
    success: bool
    key: Optional[str] = None
    error_messages: Optional[List[str]] = None

DownloadResult is a pydantic model used to represent a completed download. It contains the original Downloadable, a flag to indicate download success, the downloaded file’s git-annex key (only set if the download was successful and the file is tracked by git-annex) and any error messages from the addurl process (only set if the download failed).

@dataclass
class Report:
    downloaded: int
    failed: int

Report is used as the return value of download(); it contains the number of files successfully downloaded and the number of failed downloads.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gamdam-0.5.0.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

gamdam-0.5.0-py3-none-any.whl (12.3 kB view details)

Uploaded Python 3

File details

Details for the file gamdam-0.5.0.tar.gz.

File metadata

  • Download URL: gamdam-0.5.0.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for gamdam-0.5.0.tar.gz
Algorithm Hash digest
SHA256 23fca2b899f5f5382d6a6821490a851b605ccfae9a2ba9ad4c2c302dd5e8571a
MD5 2694c18e5ae2b227820cbf2433da5bc0
BLAKE2b-256 09b20b83eae7aaad650b4741ecf1029447869331f7f4c1e34f3fe105e1d7384e

See more details on using hashes here.

File details

Details for the file gamdam-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: gamdam-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 12.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for gamdam-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 460d89c1f6d67c1c97cd239aeee86fbc8345dc637e9b52561ee2d5252e951ca6
MD5 352491a35cbbdde9899f5ceda6b1e2da
BLAKE2b-256 2ef3b12fb6bd0193283ecdc8de9ae197f67746d9cac65cf0a2cdcb884830081d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page