cas-manifest allows developers to store artifacts in a _content-addressable_ store using a self-describing _manifest_
Project description
CAS-Manifest
This package facilitates storing artifacts in Content Addressable Storage via the hashfs
library. In a CAS regime, the hash of the artifact's contents is used as the key.
It further requires that artifacts are pydantic
models - this allows for stable serialization of the artifacts, and for data to be self-describing.
Consider an example usage profile: let's say that your application works with datasets, some of which are serialized as csv files, others of which are serialized as tsv files. Some have header rows, and some do not. Rather than write data-loading code that tries to infer the correct way to deserialize a dataset file, cas-manifest
serializes all relevant
attributes of the dataset along with the data file itself. Your code might look like this:
from hashfs import HashFS
from cas_manifest.registry import Registry
from my_classes import CSVDataset, TSVDataset
fs = HashFS('/path/to/data')
dataset_hash = '5fef4a'
registry = Registry(fs, [CSVDataset, TSVDataset])
obj = registry.load(dataset_hash)
# obj is an instance of either CSVDataset or TSVDataset
Why CAS?
In short, CAS enforces immutability. When using CAS, a key's contents can never be changed. The following comes naturally:
- No more
data_final__2_new
files - all objects are uniquely specified - No cache invalidation - cache objects freely, knowing that their contents will never change upstream
- No more provenance questions - models can be robustly linked to the datasets used to train them
Why manifests?
In a CAS regime, keys are deliberately opaque. By using manifests, artifacts can be self-descriptive. It can include instructions for deserialization, links to other artifacts, and any other metadata you can think up. In combination with CAS, you can ensure that your metadata and underlying data never go out of sync, since your metadata will refer to an immutable reference to underlying data.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cas-manifest-0.3.3.tar.gz
.
File metadata
- Download URL: cas-manifest-0.3.3.tar.gz
- Upload date:
- Size: 5.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.2 CPython/3.7.7 Darwin/19.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 471d4ab5f932f70dd4597959d0cf089ef0077066c21cb7e8901f2e4eb4cb055c |
|
MD5 | fdeb8e02cf571e65814968f633aa7db0 |
|
BLAKE2b-256 | 92c75b83badca8dbba07410fae021dcc17e0e252f7e30884bd8337fe5cd9ca8c |
File details
Details for the file cas_manifest-0.3.3-py3-none-any.whl
.
File metadata
- Download URL: cas_manifest-0.3.3-py3-none-any.whl
- Upload date:
- Size: 6.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.2 CPython/3.7.7 Darwin/19.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 258964d361efeabc6e82e3e0c8fabb8ab76232fd66f6a12e33ef6ad12bb1b7b7 |
|
MD5 | 55553c0ca0401690100e04305bee05c9 |
|
BLAKE2b-256 | 11a12689106e8863ac5461acdac3be06c0048ecb9dc0307efdf02d08781e08ff |