cas-manifest allows developers to store artifacts in a content-addressable store using a self-describing manifest
Project description
cas-manifest provides a means to store data that is always immutable, stable when you want it to be, and flexible when you need it to be.
cas-manifest stores data artifacts via content addressible storage. It facilitates the use of CAS with standard, serializable wrappers that coexist with and support data.
Why CAS?
In short, CAS enforces immutability. When using CAS, a key's contents can never be changed. The following comes naturally:
- No more
data_final__2_new
files - all objects are uniquely specified - No cache invalidation - cache objects freely, knowing that their contents will never change upstream
- No more provenance questions - models can be robustly linked to the datasets used to train them
In a CAS store, instead of put-ing a Value at a Key, you put a Value and get back the Key uniquely determined by that value.
Why manifests?
It's all well and good to stuff some data into a binary artifact and keep a key that references it. The hard part is, when you're given a key, how do you know what's stored there? If you do settle on a standard serialization method, how do you let it evolve while maintaining backward compatibility?
cas-manifest encourages the use of manifest classes to address these challenges. These manifest classes include code to serialize and deserialize artifacts. They provide a place to store metadata about the artifacts - this may be used for deserialization, used to indicate how the loaded data should be used, or may simply be informational. Finally, fields in the manifest class may reference other objects in CAS, allowing objects to be composed and reused. In combination with CAS, you can ensure that your metadata and underlying data never go out of sync, since your metadata will refer to an immutable reference to underlying data.
Example
Implementing Serialization
Let's say that we wish to store datasets that we represent in memory as pandas Dataframes. We'll create a subclass of cas_manifest.Serializable
to represent what we wish to store:
class CSVSerializable(Serializable[pd.DataFrame]):
column_names: List[str]
path: Ref
@classmethod
def pack(cls, inst: pd.DataFrame, fs: HashFS) -> CSVSerializable:
with tempfile.TemporaryDirectory() as tmpdir:
tmp_path = Path(tmpdir) / 'tmp.csv'
with open(tmp_path, mode='w') as f:
inst.to_csv(f, header=False, index=False)
csv_addr = fs.put(tmp_path)
return CSVSerializable(path=Ref(csv_addr.id), column_names=inst.columns.to_list())
def unpack(self, fs: HashFS) -> pd.DataFrame:
addr = fs.get(self.path.hash_str)
df = pd.read_csv(addr.abspath, names=self.column_names)
return df
Let's break down the items one by one:
class CSVSerializable(Serializable[pd.DataFrame]):
The type parameter of Serializable
is the type that we use in memory. Whatever it is you use in application code, that's what you put here.
column_names: List[str]
path: Ref
cas-manifest uses pydantic to define its manifest classes. This allows you to specify class fields at the top level. In order to ensure that they can be serialized, these fields need to be simple types, or other pydantic classes. Ref
is a special wrapper class used to refer to other objects in cas.
@classmethod
def pack(cls, inst: pd.DataFrame, fs: HashFS) -> CSVSerializable:
with tempfile.TemporaryDirectory() as tmpdir:
tmp_path = Path(tmpdir) / 'tmp.csv'
with open(tmp_path, mode='w') as f:
inst.to_csv(f, header=False, index=False)
csv_addr = fs.put(tmp_path)
return CSVSerializable(path=Ref(csv_addr.id), column_names=inst.columns.to_list())
pack
is a required method; this specifies how your data should be serialized. Take note of the types of the arguments.
In the body of this method, we save our dataframe as a csv, without a header row. We then put
that into the HashFS
instance that pack
takes as an argument. HashFS
provides the implementation of CAS
that we use. That put
operation returns the address (or key) for our csv. We can then construct an instance of our wrapper class, which contains a Ref
to the csv file, and the column names as fields in the manifest class.
def unpack(self, fs: HashFS) -> pd.DataFrame:
addr = fs.get(self.path.hash_str)
df = pd.read_csv(addr.abspath, names=self.column_names)
return df
unpack
is another required method; this indicates how your serialized data should be deserialized. Again, note the types of the arguments and the return types. In this case, we get the location on the filesystem for the csv file that we saved. We then call pandas.read_csv
to read it, supplying the column names stored in the manifest class.
Note that, when this pattern is followed in the real world, it can often be confusing to keep track of whether column names are stored as a header row or kept elsewhere, whether there should be an index column, etc. This sounds silly, but it's a real problem - especially if you ever want to change your mind! cas-manifest standardizes these decisions by embedding the logic in the manifest class.
Storing and retrieving data
Now that we've implemented all that, how do we use it?
First, we need to put something into cas. Let's say that we have a DataFrame named df
. We'll also need an instance of HashFS
from the hashfs
package. AWS users may wish to make use of S3HashFS
in this package, which provides an implementation of HashFS
backed by S3. We can then do the following
import pandas as pd
from hashfs import HashFS, HashAddress
df: pd.DataFrame = ...
fs_instance: HashFS = ...
addr: HashAddress = CSVSerializable.dump(df, fs_instance)
Given an instance of HashFS
, we can call dump
on CSVSerializable
to serialize our DataFrame and store it in fs_instance
. The returned object is a HashAddress
, which includes the immutable hash of the serialized object, as well as helper information like a path to its location on disk. If we wanted access to the serialized representation, we could also have called CSVSerializable.pack(df, fs_instance)
to get an instance of CSVSerializable
.
Now, how do we retrieve our serialized object from storage? Again with our instance of HashFS
, we'll do the following
hash_str = addr.id
# We create a Registry that knows what classes to expect
registry: SerializableRegistry[pd.DataFrame] = \
SerializableRegistry(fs=fs_instance, classes=[CSVSerializable])
# We can `open` a hash address to get access to the dataframe
with registry.open(hash_str) as df:
pass # df is the DataFrame that we saved before
# Or we can get the serialized form directly
serialized: CSVSerializable = registry.load(addr.id)
Why is open
a context manager? Some implementations of Serializable
may create temporary resources that need to be cleaned up, so we treat open
like opening and closing a file.
Evolving the serialization schema
Now, let's imagine that we decide we want to change our seralization format. Perhaps we'd like to make use of numpy's serialization methods to store the data in our dataframe. We can create another subclass of Serializable
, implementing pack
and unpack
as before:
class NPYSerializable(Serializable[pd.DataFrame]):
...
We'll skip the implementation for brevity here, but one is available in tests/dataset.py
.
We can serialize a dataframe in this new format just as we did with CSVSerializable
. When we want to load data, that's where things get interesting:
registry_2: SerializableRegistry[pd.DataFrame] = \
SerializableRegistry(fs=fs_instance, classes=[CSVSerializable, NPYSerializable])
registry_2
now knows how to deserialize data stored in either format. You can pass it a hash string corresponding to either format, and it will correctly deserialize it into a DataFrame.
This means that you won't have to implement code to sniff out how data is stored on disk and sprinkle it around your codebase. You can consolidate your serde logic in a class, and let cas-manifest sort out how to handle it from there.
Gotchas
- Regarding portability and schema evolution: keep in mind that your code is not serialized. So, in order to load an object of type
X
, you must still haveX
available in your codebase. Instantiating your registry should make this part fairly clear - Related to the above, if you make changes to a class, you must ensure that they are backward-compatible (e.g. adding optional fields) in order to be able to load older data.
- Typing: I've done my best to supply correct type annotations, but mypy struggles to infer return types of some generic functions. Explicit type annotations can be helpful.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cas-manifest-0.4.1.tar.gz
.
File metadata
- Download URL: cas-manifest-0.4.1.tar.gz
- Upload date:
- Size: 8.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.2 CPython/3.7.7 Darwin/19.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6847fb2511c7768f92cd6fbbf842e0fc8ca0e794952364db1fcc518c2cafdb4b |
|
MD5 | a67657ef06d5fd2ac19f114eb6d2104e |
|
BLAKE2b-256 | 6192f7078bc77c5d2c0e2aad0adeec1db81dbd817bfb8a758425483fa0ac0ef1 |
File details
Details for the file cas_manifest-0.4.1-py3-none-any.whl
.
File metadata
- Download URL: cas_manifest-0.4.1-py3-none-any.whl
- Upload date:
- Size: 8.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.2 CPython/3.7.7 Darwin/19.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4fb6c720f2be2b57207c5cf7d0ebc8be7eb309fd63d29d51d6702e7b79c591f3 |
|
MD5 | 72bb890c6bfd76c955d91fbc32670acd |
|
BLAKE2b-256 | 47a1abadfc9dbcd7c461d5fedd4b1dcb79295848d478042fcb5eedb428aa6dfc |