Easy sqlite-backed persistent cache for dataclasses

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Programming Language
Topic
- Database

Project description

Cachew: quick NamedTuple/dataclass cache

TLDR: cachew can persistently cache any sequence (an Iterator) over NamedTuples or dataclasses into an sqlite database on your disk. Database schema is automatically inferred from type annotations (PEP 526).

It works in a similar manner to functools.lru_cache: caching your data is just a matter of decorating it.

The difference from functools.lru_cache is that data is preserved between program runs.

Motivation

I often find myself processing big chunks of data, computing some aggregates on it or extracting only bits I'm interested at. While I'm trying to utilize REPL as much as I can, some things are still fragile and often you just have to rerun the whole thing in the process of development. This can be frustrating if data parsing and processing takes seconds, let alone minutes in some cases.

Conventional way of dealing with it is serializing results along with some sort of hash (e.g. md5) of input files, comparing on the next run and returning cached data if nothing changed.

Simple as it sounds, it is pretty tedious to do every time you need to memorize some data, contaminates your code with routine and distracts you from your main task.

Example

Imagine you're working on a data analysis pipeline for some huge dataset, say, extracting urls and their titles from Wikipedia archive. Parsing it (extract_links function) takes hours, however, the archive is presumably updated not very frequently.

With this library your can achieve it through single @cachew decorator.

>>> from typing import NamedTuple, Iterator
>>> class Link(NamedTuple):
...     url : str
...     text: str
...
>>> @cachew
... def extract_links(archive: str) -> Iterator[Link]:
...     for i in range(5):
...         import time; time.sleep(1) # simulate slow IO
...         yield Link(url=f'http://link{i}.org', text=f'text {i}')
...
>>> list(extract_links(archive='wikipedia_20190830.zip')) # that would take about 5 seconds on first run
[Link(url='http://link0.org', text='text 0'), Link(url='http://link1.org', text='text 1'), Link(url='http://link2.org', text='text 2'), Link(url='http://link3.org', text='text 3'), Link(url='http://link4.org', text='text 4')]

>>> from timeit import Timer
>>> res = Timer(lambda: list(extract_links(archive='wikipedia_20190830.zip'))).timeit(number=1) # second run is cached, so should take less time
>>> print(f"took {int(res)} seconds to query cached items")
took 0 seconds to query cached items

How it works

Basically, your data objects get flattened out and python types are mapped onto sqlite types and back

When the function is called, cachew computes the hash of your function's arguments and compares it against the previously stored hash value.

If they match, it would deserialize and yield whatever is stored in the cache database, if the hash mismatches, the original data provider is called and new data is stored along with the new hash.

Features

supports primitive types: str, int, float, bool, datetime, date
supports Optional
supports nested datatypes
supports return type inference: 1, 2
detects datatype schema changes and discards old data automatically

Using

See docstring for up-to-date documentation on parameters and return types. You can also use extensive unit tests as a reference.

Some highlights:

cache_path can be a filename, or you can specify a callable returning path and depending on function's arguments.

It's not required to specify the path (it will be created in /tmp) but recommended.
hashf by default just hashes all the arguments, you can also specify a custom callable.

For instance, it can be used to discard cache the input file was modified.
cls is deduced from return type annotations by default, but can be specified if you don't control the code you want to cache.

Installing

Package is available on pypi.

pip install cachew

Developing

I'm using tox to run tests, and circleci.

Implementation

why tuples and dataclasses?

Tuples are natural in Python for quickly grouping together return results. NamedTuple and dataclass specifically provide a very straighforward and self documenting way way to represent a bit of data in Python. Very compact syntax makes it extremely convenitent even for one-off means of communicating between couple of functions.

If you want to find out more why you should use more dataclasses in your code I suggest these links: What are data classes?, basic data classes.
why not pickle?

Pickling is a bit heavyweight for plain data class. There are many reports of pickle being slower than even JSON and it's also security risk. Lastly, it can only be loaded via Python.
why sqlite database for storage?

It's pretty effecient and sequence of namedtuples maps onto database rows in a very straighforward manner.
why not pandas.DataFrame?

DataFrames are great and can be serialised to csv or pickled. They are good to have as one of the ways you can interface with your data, however hardly convenitent to think about it abstractly due to their dynamic nature. They also can't be nested.
why not ORM?

ORMs tend to be pretty invasive, which might complicate your scripts or even ruin performance. It's also somewhat an overkill for such a specific purpose.
- E.g. SQLAlchemy requires you using custom sqlalchemy specific types and inheriting a base class. Also it doesn't support nested types.
why not marshmallow?

Marshmallow is a common way to map data into db-friendly format, but it requires explicit schema which is an overhead when you have it already in the form of type annotations. I've looked at existing projects to utilise type annotations, but didn't find them covering all I wanted:
- https://marshmallow-annotations.readthedocs.io/en/latest/ext/namedtuple.html#namedtuple-type-api
- https://pypi-hypernode.com/project/marshmallow-dataclass

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Programming Language
Topic
- Database

Release history Release notifications | RSS feed

0.18.20241020

Oct 19, 2024

0.17.20241017

Oct 17, 2024

0.16.20240129

Jan 29, 2024

0.15.20231019

Oct 19, 2023

0.14.20231004

Oct 4, 2023

0.14.20230922

Sep 22, 2023

0.14.20230920

Sep 19, 2023

0.14.20230919

Sep 19, 2023

0.13.0

Jun 9, 2023

0.12.0

Jun 8, 2023

0.11.0

Jan 28, 2023

0.10.0

Apr 11, 2022

0.9.0

Mar 20, 2021

0.8.1

Nov 10, 2020

0.8.0

Oct 8, 2020

0.7.0

Jul 26, 2020

0.6.3

Apr 18, 2020

0.6.2

Jan 8, 2020

0.6.1

Jan 5, 2020

0.6

Dec 20, 2019

This version

0.5.1

Dec 8, 2019

0.5

Dec 8, 2019

0.4

Aug 18, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

cachew-0.5.1-py2.py3-none-any.whl (22.8 kB view details)

Uploaded Dec 8, 2019 Python 2 Python 3

File details

Details for the file cachew-0.5.1-py2.py3-none-any.whl.

File metadata

Download URL: cachew-0.5.1-py2.py3-none-any.whl
Upload date: Dec 8, 2019
Size: 22.8 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for cachew-0.5.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`a949a8e3db11070c73bfc1a626863d4e3af03aa182a6c9d108736088b0ce522d`
MD5	`7b1846117221e512390e4af92232aef3`
BLAKE2b-256	`00f0d4a53bc5cc5edc48e4da0ba496053a6d3992ac2eb6a1003e033ebaa07629`