Skip to main content

Context manager for enforcing links between data pipeline outputs and git history.

Project description

PyPI version pipeline coverage

gittrail - Linking data pipeline outputs to git history

Versioning of code with git is easy, versioning data pipeline inputs/outputs is hard.

GitTrail helps you to maintain a traceable data lineage by enforcing a link between data files and the commit history of your processing code.

Like blockchain, but easier.

How it works

GitTrail is used as a context manager around the code that executes your data processing:

with GitTrail(
    repo="/path/to/my_data_processing_code",
    data="/path/to/my_data_storage",
):
    # TODO: download the pipeline inputs to [data]

Inbetween GitTrail sessions you may edit your pipeline code, make commits etc.

When your next data processing stage is ready:

with GitTrail(
    repo="/path/to/my_data_processing_code",
    data="/path/to/my_data_storage",
):
    # TODO: run data analysis on inputs from [data]
    # TODO: save results to [data]

Upon entering the context GitTrail attaches a log handler to re-route all logging into a *.log file in a subdirectory of [data]. When the context exits, the logger is detached and session metadata is stored in a *.json file. The metadata includes the current git commit of your [repo], as well MD5 hashes of the files inside [data].

Within the context, the following two rules are enforced:

  1. The working tree of your code [repo] must be clean (no uncommitted changes).
  2. All files currently found in [data] must have been created/changed in a previous GitTrail context.

Taken together this means that:

  • You're not allowed to add/edit/anything in [data] by hand.
  • Your data processing code may continue to evolve as you're moving forward through your pipeline.
  • You can amend/rewind/rewrite git commits of your processing code, but the corresponding files in [data] and the audit trail session file must be deleted.
  • All files in the [data] are linked to the processing code that produced them.

Limitations

GitTrail can't police everything, so keep the following in mind:

  • Data outside of [data], for example a database, is not tracked. If you're reading/writing data outside of [data] think about how you can trace that in your git history and/or [data] audit trail.
  • Code outside of [repo] is not tracked. Unless your [repo] specifies exact dependency versions, your code may not be 100 % reproducible.
  • Audit trail files are not cryptographically signed, so if you mess with them that's not tracked.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gittrail-0.1.1.tar.gz (24.0 kB view hashes)

Uploaded Source

Built Distribution

gittrail-0.1.1-py3-none-any.whl (24.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page