Context manager for enforcing links between data pipeline outputs and git history.
Project description
gittrail
- Linking data pipeline outputs to git history
Versioning of code with git is easy, versioning data pipeline inputs/outputs is hard.
GitTrail
helps you to maintain a traceable data lineage by enforcing a
link between data files and the commit history of your processing code.
Like blockchain, but easier.
How it works
GitTrail
is used as a context manager around the code that executes your data processing:
with GitTrail(
repo="/path/to/my_data_processing_code",
data="/path/to/my_data_storage",
):
# TODO: download the pipeline inputs to [data]
Inbetween GitTrail sessions you may edit your pipeline code, make commits etc.
When your next data processing stage is ready:
with GitTrail(
repo="/path/to/my_data_processing_code",
data="/path/to/my_data_storage",
):
# TODO: run data analysis on inputs from [data]
# TODO: save results to [data]
Upon entering the context GitTrail
attaches a log handler to re-route all logging into a *.log
file in a subdirectory of [data].
When the context exits, the logger is detached and session metadata is stored in a *.json
file.
The metadata includes the current git commit of your [repo], as well MD5 hashes of the files inside [data].
Within the context, the following two rules are enforced:
- The working tree of your code [repo] must be clean (no uncommitted changes).
- All files currently found in [data] must have been created/changed in a previous
GitTrail
context.
Taken together this means that:
- You're not allowed to add/edit/anything in [data] by hand.
- Your data processing code may continue to evolve as you're moving forward through your pipeline.
- You can amend/rewind/rewrite git commits of your processing code, but the corresponding files in [data] and the audit trail session file must be deleted.
- All files in the [data] are linked to the processing code that produced them.
Limitations
GitTrail
can't police everything, so keep the following in mind:
- Data outside of [data], for example a database, is not tracked. If you're reading/writing data outside of [data] think about how you can trace that in your git history and/or [data] audit trail.
- Code outside of [repo] is not tracked. Unless your [repo] specifies exact dependency versions, your code may not be 100 % reproducible.
- Audit trail files are not cryptographically signed, so if you mess with them that's not tracked.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file gittrail-0.1.1.tar.gz
.
File metadata
- Download URL: gittrail-0.1.1.tar.gz
- Upload date:
- Size: 24.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ffb96c6d2eb8b4a011a299aae02dacff80492e5bcb2aff0333639c558af9a09c |
|
MD5 | e9a10585fb2b8ad7fad6b9a2816adb84 |
|
BLAKE2b-256 | b04e6328a009cf39b4007bc2d2ea7d24d7af5896e778a70c71f69651f51c2c66 |
Provenance
File details
Details for the file gittrail-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: gittrail-0.1.1-py3-none-any.whl
- Upload date:
- Size: 24.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | db8f399d94aff522a0dd53c4cd24f79f9ea2caee079d8f06cea7ff341f3847eb |
|
MD5 | e4d36d3ac7385dec1cc4070650109e3c |
|
BLAKE2b-256 | 6e331e7c8f1a5ef051e86af133777e2e2cfb05cecea865f0d489fb58d6c2bdca |