Skip to main content

Framework agnostic backends for MCMC sample storage

Project description

PyPI version pipeline coverage

Where do you want to store your MCMC draws? In memory? On disk? Or in a database running in a datacenter?

No matter where you want to put them, or which PPL generates them: McBackend takes care of your MCMC samples.

Quickstart

The mcbackend package consists of three parts:

Part 1: A schema for MCMC run & chain metadata

No matter which programming language your favorite PPL is written in, the ProtocolBuffers from McBackend can be used to generate code in languages like C++, C#, Python and many more to represent commonly used metadata about MCMC runs, chains and model variables.

The definitions in protobufs/meta.proto are designed to maximize compatibility with ArviZ objects, making it easy to transform MCMC draws stored according to the McBackend schema to InferenceData objects for plotting & analysis.

Part 2: A storage backend interface

The draws and stats created by MCMC sampling algorithms at runtime need to be stored somewhere.

This "somewhere" is called the storage backend in PPLs/MCMC frameworks like PyMC or emcee.

Most storage backends must be initialized with metadata about the model variables so they can, for example, pre-allocated memory for the draws and stats they're about to receive. After then receiving thousands of draws and stats they must then provide methods by which the draws/stats can be retrieved.

The mcbackend.core module has classes such as Backend, Run, and Chain to define these interfaces for any storage backend, no matter if it's an in-memory, filesystem or database storage. Albeit this implementation is currently Python-only, the interface signature should be portable to e.g. C++.

Via mcbackend.backends the McBackend package then provides backend implementations. Currently you may choose from:

backend = mcbackend.NumPyBackend()
backend = mcbackend.ClickHouseBackend( client=clickhouse_driver.Client("localhost") )

# All that matters:
isinstance(backend, mcbackend.Backend)
# >>> True

Part 3: PPL adapters

Anything that is a Backend can be wrapped by an adapter that makes it compatible with your favorite PPL.

In the example below, a ClickHouseBackend is initialized to store MCMC draws from a PyMC model in a ClickHouse database. See below for how to run it in Docker.

import clickhouse_driver
import mcbackend
import pymc as pm

# 1. Create _any_ kind of backend
ch_client = clickhouse_driver.Client("localhost")
backend = mcbackend.ClickHouseBackend(ch_client)

with pm.Model():
    # 3. Create your model
    ...
    # 4. Wrap the PyMC adapter around an `mcbackend.Backend`
    #    This generates and prints a short `trace.run_id` by which
    #    this MCMC run is identified in the (database) backend.
    trace = mcbackend.pymc.TraceBackend(backend)

    # 5. Hit the inference button ™
    pm.sample(trace=trace)

Instead of using PyMC's built-in NumPy backend, the MCMC draws now end up in ClickHouse.

Retrieving the draws & stats

Continuing the example from above we can now retrieve draws from the backend.

Note that since this example wrote the draws to ClickHouse, we could run the code below on another machine, and even while the above model is still sampling!

backend = mcbackend.ClickHouseBackend(ch_client)

# Fetch the run from the database (downloads just metadata)
run = backend.get_run(trace.run_id)

# Get all draws from a chain
chain = run.get_chains()[0]
chain.get_draws("my favorite variable")
# >>> array([ ... ])

# Convert everything to `InferenceData`
idata = run.to_inferencedata()
print(idata)
# >>> Inference data with groups:
# >>> 	> posterior
# >>> 	> sample_stats
# >>> 	> observed_data
# >>> 	> constant_data
# >>>
# >>> Warmup iterations saved (warmup_*).

Contributing what's next

McBackend just started and is looking for contributions. For example:

  • Schema discussion: Which metadata is needed? (related: PyMC #5160)
  • Interface discussion: How should Backend/Run/Chain evolve?
  • Python Backends for disk storage (HDF5, *.proto, ...)
  • An emcee adapter (#11).
  • C++ Backend/Run/Chain interfaces
  • C++ ClickHouse backend (via clickhouse-cpp)
  • A webinterface that goes beyond the Streamlit proof-of-concept (see mcbackend-server)

As the schema and API stabilizes a mid-term goal might be to replace PyMC BaseTrace/MultiTrace entirely to rely on mcbackend.

Getting rid of MultiTrace was a long-term goal behind making pm.sample(return_inferencedata=True) the default.

Development

First clone the repository and install mcbackend locally:

pip install -e .

To run the tests:

pip install -r requirements-dev.txt
pytest -v

Some tests need a ClickHouse database server running locally. To start one in Docker:

docker run --detach --rm --name mcbackend-db -p 9000:9000 --ulimit nofile=262144:262144 yandex/clickhouse-server

Compiling the ProtocolBuffers

If you don't already have it, first install the protobuf compiler:

conda install protobuf

To compile the *.proto files for languages other than Python, check the ProtocolBuffers documentation.

The following script compiles them for Python using the betterproto compiler plugin to get nice-looking dataclasses. It also copies the generated files to the right place in mcbackend.

python protobufs/generate.py

Experimental: mcbackend-server

This repository also includes an experimental Streamlit app for querying the ClickHouse backend and creating ArviZ plots already while an MCMC is still running.

⚠ This part will eventually move into its own repository. ⚠

First build the Docker image:

docker build -t mcbackend-server:0.1.0 .

Then start the container. The following two commands should be executed in the root path of the repository.

⚠ You may need to adapt the hostname line. ⚠

On Windows:

docker run ^
  --rm --name mcbackend-server ^
  -p 8501:8501 ^
  -e DB_HOST=%COMPUTERNAME% ^
  -v %cd%:/mcbackend ^
  -v %cd%/mcbackend-server/app.py:/mcbackend-server/app.py ^
  mcbackend-server:0.1.0

On Linux:

docker run \
  --rm --name mcbackend-server \
  -p 8501:8501 \
  -e DB_HOST=$hostname \
  -v $pwd:/mcbackend \
  -v $pwd/mcbackend-server/app.py:/mcbackend-server/app.py \
  mcbackend-server:0.1.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcbackend-0.2.7.tar.gz (42.6 kB view details)

Uploaded Source

Built Distribution

mcbackend-0.2.7-py3-none-any.whl (45.9 kB view details)

Uploaded Python 3

File details

Details for the file mcbackend-0.2.7.tar.gz.

File metadata

  • Download URL: mcbackend-0.2.7.tar.gz
  • Upload date:
  • Size: 42.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for mcbackend-0.2.7.tar.gz
Algorithm Hash digest
SHA256 00f31dc46f48847ee82c355d039fae7434c97422f615fda7711b8dd2b92e24c8
MD5 a9b90376a95e07d56dfaa7f8d7596171
BLAKE2b-256 e67c5fcd49ab82e26b34ff7d969a1203a07b0b3af5c9c7fecdf858164292884c

See more details on using hashes here.

File details

Details for the file mcbackend-0.2.7-py3-none-any.whl.

File metadata

  • Download URL: mcbackend-0.2.7-py3-none-any.whl
  • Upload date:
  • Size: 45.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for mcbackend-0.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 903aa7a12008f8da58287ff79122e4281075686a434f4fba1a46fd57422a619d
MD5 b57ce0342cea4ae73f9d5fd0160a717e
BLAKE2b-256 9fa05a2d761c4d7fb68d22c677538ed48e1274e674f8f7b374ecde0294b28af6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page