Skip to main content

Context manager for enforcing links between data pipeline outputs and git history.

Project description

PyPI version pipeline coverage

gittrail - Linking data pipeline outputs to git history

Versioning of code with git is easy, versioning data pipeline inputs/outputs is hard.

GitTrail helps you to maintain a traceable data lineage by enforcing a link between data files and the commit history of your processing code.

Like blockchain, but easier.

How it works

GitTrail is used as a context manager around the code that executes your data processing:

with GitTrail(
    repo="/path/to/my_data_processing_code",
    data="/path/to/my_data_storage",
):
    # TODO: download the pipeline inputs to [data]

Inbetween GitTrail sessions you may edit your pipeline code, make commits etc.

When your next data processing stage is ready:

with GitTrail(
    repo="/path/to/my_data_processing_code",
    data="/path/to/my_data_storage",
):
    # TODO: run data analysis on inputs from [data]
    # TODO: save results to [data]

Upon entering the context GitTrail attaches a log handler to re-route all logging into a *.log file in a subdirectory of [data]. When the context exits, the logger is detached and session metadata is stored in a *.json file. The metadata includes the current git commit of your [repo], as well MD5 hashes of the files inside [data].

Within the context, the following two rules are enforced:

  1. The working tree of your code [repo] must be clean (no uncommitted changes).
  2. All files currently found in [data] must have been created/changed in a previous GitTrail context.

Taken together this means that:

  • You're not allowed to add/edit/anything in [data] by hand.
  • Your data processing code may continue to evolve as you're moving forward through your pipeline.
  • You can amend/rewind/rewrite git commits of your processing code, but the corresponding files in [data] and the audit trail session file must be deleted.
  • All files in the [data] are linked to the processing code that produced them.

Limitations

GitTrail can't police everything, so keep the following in mind:

  • Data outside of [data], for example a database, is not tracked. If you're reading/writing data outside of [data] think about how you can trace that in your git history and/or [data] audit trail.
  • Code outside of [repo] is not tracked. Unless your [repo] specifies exact dependency versions, your code may not be 100 % reproducible.
  • Audit trail files are not cryptographically signed, so if you mess with them that's not tracked.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gittrail-0.1.1.tar.gz (24.0 kB view details)

Uploaded Source

Built Distribution

gittrail-0.1.1-py3-none-any.whl (24.4 kB view details)

Uploaded Python 3

File details

Details for the file gittrail-0.1.1.tar.gz.

File metadata

  • Download URL: gittrail-0.1.1.tar.gz
  • Upload date:
  • Size: 24.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for gittrail-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ffb96c6d2eb8b4a011a299aae02dacff80492e5bcb2aff0333639c558af9a09c
MD5 e9a10585fb2b8ad7fad6b9a2816adb84
BLAKE2b-256 b04e6328a009cf39b4007bc2d2ea7d24d7af5896e778a70c71f69651f51c2c66

See more details on using hashes here.

Provenance

File details

Details for the file gittrail-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: gittrail-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 24.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for gittrail-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 db8f399d94aff522a0dd53c4cd24f79f9ea2caee079d8f06cea7ff341f3847eb
MD5 e4d36d3ac7385dec1cc4070650109e3c
BLAKE2b-256 6e331e7c8f1a5ef051e86af133777e2e2cfb05cecea865f0d489fb58d6c2bdca

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page