Easy sqlite-backed persistent cache for dataclasses
Project description
What is Cachew?
TLDR: cachew lets you cache function calls into an sqlite database on your disk in a matter of single decorator (similar to functools.lru_cache). The difference from functools.lru_cache
is that cached data is persisted between program runs, so next time you call your function, it will only be a matter of reading from the cache.
Cache is invalidated automatically if your function's arguments change, so you don't have to think about maintaining it.
In order to be cacheable, your function needs to return (an Iterator, that is generator, tuple or list) of simple data types:
- primitive types:
str
/int
/float
/datetime
- NamedTuples
- dataclasses
That allows to automatically infer schema from type hints (PEP 526) and not think about serializing/deserializing.
Motivation
I often find myself processing big chunks of data, merging data together, computing some aggregates on it or extracting few bits I'm interested at. While I'm trying to utilize REPL as much as I can, some things are still fragile and often you just have to rerun the whole thing in the process of development. This can be frustrating if data parsing and processing takes seconds, let alone minutes in some cases.
Conventional way of dealing with it is serializing results along with some sort of hash (e.g. md5) of input files, comparing on the next run and returning cached data if nothing changed.
Simple as it sounds, it is pretty tedious to do every time you need to memorize some data, contaminates your code with routine and distracts you from your main task.
Examples
Processing Wikipedia
Imagine you're working on a data analysis pipeline for some huge dataset, say, extracting urls and their titles from Wikipedia archive.
Parsing it (extract_links
function) takes hours, however, as long as the archive is same you will always get same results. So it would be nice to be able to cache the results somehow.
With this library your can achieve it through single @cachew
decorator.
>>> from typing import NamedTuple, Iterator
>>> class Link(NamedTuple):
... url : str
... text: str
...
>>> @cachew
... def extract_links(archive_path: str) -> Iterator[Link]:
... for i in range(5):
... # simulate slow IO
... # this function runs for five seconds for the purpose of demonstration, but realistically it might take hours
... import time; time.sleep(1)
... yield Link(url=f'http://link{i}.org', text=f'text {i}')
...
>>> list(extract_links(archive_path='wikipedia_20190830.zip')) # that would take about 5 seconds on first run
[Link(url='http://link0.org', text='text 0'), Link(url='http://link1.org', text='text 1'), Link(url='http://link2.org', text='text 2'), Link(url='http://link3.org', text='text 3'), Link(url='http://link4.org', text='text 4')]
>>> from timeit import Timer
>>> res = Timer(lambda: list(extract_links(archive_path='wikipedia_20190830.zip'))).timeit(number=1)
... # second run is cached, so should take less time
>>> print(f"call took {int(res)} seconds")
call took 0 seconds
>>> res = Timer(lambda: list(extract_links(archive_path='wikipedia_20200101.zip'))).timeit(number=1)
... # now file has changed, so the cache will be discarded
>>> print(f"call took {int(res)} seconds")
call took 5 seconds
When you call extract_links
with the same archive, you start getting results in a matter of milliseconds, as fast as sqlite reads it.
When you use newer archive, archive_path
changes, which will make cachew invalidate old cache and recompute it, so you don't need to think about maintaining it separately.
Incremental data exports
This is my most common usecase of cachew, which I'll illustrate with example.
I'm using an environment sensor to log stats about temperature and humidity. Data is synchronized via bluetooth in the sqlite database, which is easy to access. However sensor has limited memory (e.g. 1000 latest measurements). That means that I end up with a new database every few days which contains, each of them containing only slice of data I need: e.g.:
...
20190715100026.db
20190716100138.db
20190717101651.db
20190718100118.db
20190719100701.db
...
To access all of historic temperature data, I have two options:
-
Go through all the data chunks every time I wan to access them and 'merge' into a unified stream of measurements, e.g. something like:
def measurements(chunks: List[Path]) -> Iterator[Measurement]: for chunk in chunks: # read measurements from 'chunk' and yield unseen ones
This is very easy, but slow and you waste CPU for no reason every time you need data.
-
Keep a 'master' database and write code to merge chunks in it.
This is very efficient, but tedious:
- requires serializing/deserializing data -- boilerplate
- requires manually managing sqlite database -- error prone, hard to get right every time
- requires careful scheduling, ideally you want to access new data without having to refresh cache
Cachew gives me best of two worlds and makes it easy and efficient. Only thing you have to do is to decorate your function:
@cachew("/data/cache/measurements.sqlite")
def measurements(chunks: List[Path]) -> Iterator[Measurement]:
# ...
-
as long as
chunks
stay same, data stays same so you always read from sqlite cache which is very fast -
you don't need to maintain the database, cache is automatically refreshed when
chunks
change (i.e. you got new data)All the complexity of handling database is hidden in
cachew
implementation.
How it works
Basically, your data objects get flattened out and python types are mapped onto sqlite types and back.
When the function is called, cachew computes the hash of your function's arguments and compares it against the previously stored hash value.
- If they match, it would deserialize and yield whatever is stored in the cache database
- If the hash mismatches, the original function is called and new data is stored along with the new hash
Features
-
supported types:
-
primitive:
str
,int
,float
,bool
,datetime
,date
,dict
-
Optional types
-
Union types
-
Exceptions (experimental, enabled by calling
cachew.experimental.enable_exceptions
)Enables support for caching Exceptions. Exception arguments are going to be serialized as strings.
It's useful for defensive error handling, in case of cachew in particular for preserving error state.
I elaborate on it here: mypy-driven error handling.
-
-
detects datatype schema changes and discards old data automatically
Performance
Updating cache takes certain overhead, but that would depend on how complicated your datatype in the first place, so I'd suggest measuring if you're not sure.
During reading cache all that happens is reading rows from sqlite and mapping them onto your target datatype, so the only overhead would be from reading sqlite, which is quite fast.
I haven't set up formal benchmarking/regression tests yet, so don't want to make specific claims, however that would almost certainly make your programm faster if computations take more than several seconds.
Using
See docstring for up-to-date documentation on parameters and return types. You can also use extensive unit tests as a reference.
Some useful arguments of @cachew
decorator:
-
cache_path
can be a filename, or you can specify a callable that returns a path and depends on function's arguments.It's not required to specify the path (it will be created in
/tmp
) but recommended. -
hashf
is a function that determines whether your arguments have changed.By default it just uses string representation of the arguments, you can also specify a custom callable.
For instance, it can be used to discard cache if the input file was modified.
-
cls
is the type that would be serialized. It is inferred from return type annotations by default, but can be specified if you don't control the code you want to cache.
Installing
Package is available on pypi.
pip install cachew
Developing
I'm using tox to run tests, and circleci.
Implementation
-
why tuples and dataclasses?
Tuples are natural in Python for quickly grouping together return results.
NamedTuple
anddataclass
specifically provide a very straightforward and self documenting way to represent data in Python. Very compact syntax makes it extremely convenient even for one-off means of communicating between couple of functions.If you want to find out more why you should use more dataclasses in your code I suggest these links:
-
why not pickle?
Pickling is a bit heavyweight for plain data class. There are many reports of pickle being slower than even JSON and it's also security risk. Lastly, it can only be loaded via Python, whereas sqlite has numerous bindings and tools to explore and interface.
-
why
sqlite
database for storage?It's pretty efficient and sequence of namedtuples maps onto database rows in a very straightforward manner.
-
why not
pandas.DataFrame
?DataFrames are great and can be serialised to csv or pickled. They are good to have as one of the ways you can interface with your data, however hardly convenient to think about it abstractly due to their dynamic nature. They also can't be nested.
-
why not ORM?
ORMs tend to be pretty invasive, which might complicate your scripts or even ruin performance. It's also somewhat an overkill for such a specific purpose.
- E.g. SQLAlchemy requires you using custom sqlalchemy specific types and inheriting a base class. Also it doesn't support nested types.
-
why not marshmallow?
Marshmallow is a common way to map data into db-friendly format, but it requires explicit schema which is an overhead when you have it already in the form of type annotations. I've looked at existing projects to utilize type annotations, but didn't find them covering all I wanted:
Tips and tricks
Optional dependency
You can benefit from cachew
even if you don't want to bloat your app's dependencies. Just use the following snippet:
def mcachew(*args, **kwargs):
"""
Stands for 'Maybe cachew'.
Defensive wrapper around @cachew to make it an optional dependency.
"""
try:
import cachew
except ModuleNotFoundError:
import warnings
warnings.warn('cachew library not found. You might want to install it to speed things up. See https://github.com/karlicoss/cachew')
return lambda orig_func: orig_func
else:
return cachew.cachew(*args, **kwargs)
Now you can use @mcachew
in place of @cachew
, and be certain things don't break if cachew
is missing.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cachew-0.6.2.tar.gz
.
File metadata
- Download URL: cachew-0.6.2.tar.gz
- Upload date:
- Size: 34.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.8.0 tqdm/4.32.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 460add858e070ef6bbe20d83c4793e4afc695460705f49632bf40f220fc94815 |
|
MD5 | c50a886293b08cc91b6ea390f2057c7f |
|
BLAKE2b-256 | 3b416d1d8808bdd3ba80d1438987c01cade15412b3fbfc713a2a20970274dd99 |
File details
Details for the file cachew-0.6.2-py2.py3-none-any.whl
.
File metadata
- Download URL: cachew-0.6.2-py2.py3-none-any.whl
- Upload date:
- Size: 24.2 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.8.0 tqdm/4.32.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5259e2c93593cc939d1c498be69ab290af6fe9b92ba4af2f89ae5380d8ab8b3b |
|
MD5 | 16c53b10b8a6b30a15bc2e01787543f9 |
|
BLAKE2b-256 | 9e9660d3fc2299893857d6c82d9977a5b602b14e4219996e279c36b6a3c66c6c |