Skip to main content

No project description provided

Project description

beakers

beakers is an experimental lightweight declarative ETL framework for Python

Right now this is an experiment to explore some ideas around ETL.

It is still very experimental with no stability guarantees. If you're interested in poking around, thoughts and feedback are welcome, please reach out before contributing code though as a lot is still in flux.

(Intended) Features

  • Declarative ETL graph comprised of Python functions & Pydantic models
  • Developer-friendly CLI for running processes
  • Synchronous mode for ease of debugging or simple pipelines
  • Data checkpoints stored in local database for intermediate caching & resuming interrupted runs
  • Asynchronous task execution
  • Support for multiple backends (sqlite, postgres, etc)
  • Robust error handling, including retries

Guiding Principles

  • Lightweight - Writing a single python file should be enough to get started. It should be as easy to use as a script in that sense.
  • Data-centric - Looking at the definition should make it clear what data exists at what step.
  • Modern Python - Take full advantage of recent additions to Python, including type hints, asyncio, and libraries like pydantic.
  • Developer Experience - The focus should be on the developer experience, a nice CLI, helpful error messages.

Anti-Principles

Unlike most tools in this space, this is not a complete "enterprise grade" ETL solution.

It isn't a perfect analogy by any means but beakers strives to be to luigi what flask is to Django. If you are building your entire business around ETL, it makes sense to invest in the infrastructure & tooling to make that work. Maybe structuring your code around beakers will make it easier to migrate to one of those tools than if you had written a bespoke script. Plus, beakers is Python, so you can always start by running it from within a bigger framework.

Concepts

Like most ETL tools, beakers is built around a directed acyclic graph (DAG).

The nodes on this graph are known as "beakers", and the edges are often called "transforms".

(Note: These names aren't final, suggestions welcome.)

Beakers

Each node in the graph is called a "beaker". A beaker is a container for some data.

Each beaker has a name and a type. The name is used to refer to the beaker elsewhere in the graph. The type, represented by a pydantic model, defines the structure of the data. By leveraging pydantic we get a lot of nice features for free, like validation and serialization.

Transform

Edges in the graph represent dataflow between beakers. Each edge has a concept of a "source beaker" and a "destination beaker".

These come in two main flavors:

  • Transforms - A transform places new data in the destination beaker based on data already in the source beaker. An example of this might be a transform that takes a list of URLs and downloads the HTML for each one, placing the results in a new beaker.

  • Filter - A filter can be used to stop the flow of data from one beaker to another based on some criteria.

Seed

A concept somewhat unique to beakers is the "seed". A seed is a function that returns initial data for a beaker.

This is useful for things like starting the graph with a list of URLs to scrape, or a list of images to process.

A beaker can have any number of seeds, for example one might have a short list of URLs to use for testing, and another that reads from a database.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

databeakers-0.2.2.tar.gz (27.9 kB view details)

Uploaded Source

Built Distribution

databeakers-0.2.2-py3-none-any.whl (29.6 kB view details)

Uploaded Python 3

File details

Details for the file databeakers-0.2.2.tar.gz.

File metadata

  • Download URL: databeakers-0.2.2.tar.gz
  • Upload date:
  • Size: 27.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.11.4 Darwin/21.6.0

File hashes

Hashes for databeakers-0.2.2.tar.gz
Algorithm Hash digest
SHA256 8da9edc62e99ac44bca365f94f072c8a8e5694ae804a2b86474e46c77dc4c80a
MD5 3dae9b20e9ebc47d3d9a6ecfce1f1424
BLAKE2b-256 b9c0a28af46fa46352982a2d838e92e8ba2d34e24a1524692c66f8b326ae1e34

See more details on using hashes here.

File details

Details for the file databeakers-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: databeakers-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 29.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.11.4 Darwin/21.6.0

File hashes

Hashes for databeakers-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8474eab1c025a92c4e361d33b4e1edb654cb57a79791aba03692b47e8359f034
MD5 5a9f663338a2170a9f12f5c0ee9e3d67
BLAKE2b-256 faf9e003f01d321960b867e9480f6633072c87a76a6ea8d057ed001164e5e051

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page