Easy sqlite-backed persistent cache for dataclasses
Project description
Cachew: quick NamedTuple/dataclass cache
TLDR: cachew can persistently cache any sequence (an Iterator) over NamedTuples or dataclasses into an sqlite database on your disk. Database schema is automatically inferred from type annotations (PEP 526).
It works in a similar manner to functools.lru_cache: caching your data is just a matter of decorating it.
The difference from functools.lru_cache
is that data is preserved between program runs.
Motivation
I often find myself processing big chunks of data, computing some aggregates on it or extracting only bits I'm interested at. While I'm trying to utilize REPL as much as I can, some things are still fragile and often you just have to rerun the whole thing in the process of development. This can be frustrating if data parsing and processing takes seconds, let alone minutes in some cases.
Conventional way of dealing with it is serializing results along with some sort of hash (e.g. md5) of input files, comparing on the next run and returning cached data if nothing changed.
Simple as it sounds, it is pretty tedious to do every time you need to memorize some data, contaminates your code with routine and distracts you from your main task.
Example
Imagine you're working on a data analysis pipeline for some huge dataset, say, extracting urls and their titles from Wikipedia archive.
Parsing it (extract_links
function) takes hours, however, the archive is presumably updated not very frequently.
With this library your can achieve it through single @cachew
decorator.
>>> from typing import NamedTuple, Iterator
>>> class Link(NamedTuple):
... url : str
... text: str
...
>>> @cachew
... def extract_links(archive: str) -> Iterator[Link]:
... for i in range(5):
... import time; time.sleep(1) # simulate slow IO
... yield Link(url=f'http://link{i}.org', text=f'text {i}')
...
>>> list(extract_links(archive='wikipedia_20190830.zip')) # that would take about 5 seconds on first run
[Link(url='http://link0.org', text='text 0'), Link(url='http://link1.org', text='text 1'), Link(url='http://link2.org', text='text 2'), Link(url='http://link3.org', text='text 3'), Link(url='http://link4.org', text='text 4')]
>>> from timeit import Timer
>>> res = Timer(lambda: list(extract_links(archive='wikipedia_20190830.zip'))).timeit(number=1) # second run is cached, so should take less time
>>> print(f"took {int(res)} seconds to query cached items")
took 0 seconds to query cached items
How it works
Basically, your data objects get flattened out and python types are mapped onto sqlite types and back
When the function is called, cachew computes the hash of your function's arguments and compares it against the previously stored hash value.
If they match, it would deserialize and yield whatever is stored in the cache database, if the hash mismatches, the original data provider is called and new data is stored along with the new hash.
Features
- supports primitive types:
str
,int
,float
,bool
,datetime
,date
- supports Optional
- supports nested datatypes
- supports return type inference: 1, 2
- detects datatype schema changes and discards old data automatically
Using
See docstring for up-to-date documentation on parameters and return types. You can also use extensive unit tests as a reference.
Some highlights:
-
cache_path
can be a filename, or you can specify a callable returning path and depending on function's arguments.It's not required to specify the path (it will be created in
/tmp
) but recommended. -
hashf
by default just hashes all the arguments, you can also specify a custom callable.For instance, it can be used to discard cache the input file was modified.
-
cls
is deduced from return type annotations by default, but can be specified if you don't control the code you want to cache.
Installing
Package is available on pypi.
pip install cachew
Developing
I'm using tox to run tests, and circleci.
Implementation
-
why tuples and dataclasses?
Tuples are natural in Python for quickly grouping together return results.
NamedTuple
anddataclass
specifically provide a very straighforward and self documenting way way to represent a bit of data in Python. Very compact syntax makes it extremely convenitent even for one-off means of communicating between couple of functions.If you want to find out more why you should use more dataclasses in your code I suggest these links: What are data classes?, basic data classes.
-
why not pickle?
Pickling is a bit heavyweight for plain data class. There are many reports of pickle being slower than even JSON and it's also security risk. Lastly, it can only be loaded via Python.
-
why
sqlite
database for storage?It's pretty effecient and sequence of namedtuples maps onto database rows in a very straighforward manner.
-
why not
pandas.DataFrame
?DataFrames are great and can be serialised to csv or pickled. They are good to have as one of the ways you can interface with your data, however hardly convenitent to think about it abstractly due to their dynamic nature. They also can't be nested.
-
why not ORM?
ORMs tend to be pretty invasive, which might complicate your scripts or even ruin performance. It's also somewhat an overkill for such a specific purpose.
- E.g. SQLAlchemy requires you using custom sqlalchemy specific types and inheriting a base class. Also it doesn't support nested types.
-
why not marshmallow?
Marshmallow is a common way to map data into db-friendly format, but it requires explicit schema which is an overhead when you have it already in the form of type annotations. I've looked at existing projects to utilise type annotations, but didn't find them covering all I wanted:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file cachew-0.5-py2.py3-none-any.whl
.
File metadata
- Download URL: cachew-0.5-py2.py3-none-any.whl
- Upload date:
- Size: 22.7 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 65facc770d0bd018780441fd247410b3354d4719bfb97cc6832d6a0b4e4e9934 |
|
MD5 | ead2da1e8b8d9ea3ab9dfb16c7308c81 |
|
BLAKE2b-256 | dbe84883c3e02b3a2413f9142bfae8c44d3a400102608907e50bb3a79355a273 |