Skip to main content

Framework to develop datapipelines from files on disk to full dissemenation API

Project description

Maggma

Static Badge testing codecov python

What is Maggma

Maggma is a framework to build scientific data processing pipelines from data stored in a variety of formats -- databases, Azure Blobs, files on disk, etc., all the way to a REST API. The rest of this README contains a brief, high-level overview of what maggma can do. For more, please refer to the documentation.

Installation from PyPI

Maggma is published on the Python Package Index. The preferred tool for installing packages from PyPi is pip. This tool is provided with all modern versions of Python.

Open your terminal and run the following command.

pip install --upgrade maggma

Basic Concepts

maggma's core classes -- Store and Builder -- provide building blocks for modular data pipelines. Data resides in one or more Store and is processed by a Builder. The results of the processing are saved in another Store, and so on:

flowchart LR
    s1(Store 1) --Builder 1--> s2(Store 2) --Builder 2--> s3(Store 3)
s2 -- Builder 3-->s4(Store 4)

Store

A major challenge in building scalable data pipelines is dealing with all the different types of data sources out there. Maggma's Store class provides a consistent, unified interface for querying data from arbitrary data sources. It was originally built around MongoDB, so it's interface closely resembles PyMongo syntax. However, Maggma makes it possible to use that same syntax to query other types of databases, such as Amazon S3, GridFS, or files on disk, and many others. Stores implement methods to connect, query, find distinct values, groupby fields, update documents, and remove documents.

The example below demonstrates inserting 4 documents (python dicts) into a MongoStore with update, then accessing the data using count, query, and distinct.

>>> turtles = [{"name": "Leonardo", "color": "blue", "tool": "sword"},
               {"name": "Donatello","color": "purple", "tool": "staff"},
               {"name": "Michelangelo", "color": "orange", "tool": "nunchuks"},
               {"name":"Raphael", "color": "red", "tool": "sai"}
            ]
>>> store = MongoStore(database="my_db_name",
                       collection_name="my_collection_name",
                       username="my_username",
                       password="my_password",
                       host="my_hostname",
                       port=27017,
                       key="name",
                    )
>>> with store:
        store.update(turtles)
>>> store.count()
4
>>> store.query_one({})
{'_id': ObjectId('66746d29a78e8431daa3463a'), 'name': 'Leonardo', 'color': 'blue', 'tool': 'sword'}
>>> store.distinct('color')
['purple', 'orange', 'blue', 'red']

Builder

Builders represent a data processing step, analogous to an extract-transform-load (ETL) operation in a data warehouse model. Much like Store provides a consistent interface for accessing data, the Builder classes provide a consistent interface for transforming it. Builder transformation are each broken into 3 phases: get_items, process_item, and update_targets:

  1. get_items: Retrieve items from the source Store(s) for processing by the next phase
  2. process_item: Manipulate the input item and create an output document that is sent to the next phase for storage.
  3. update_target: Add the processed item to the target Store(s).

Both get_items and update_targets can perform IO (input/output) to the data stores. process_item is expected to not perform any IO so that it can be parallelized by Maggma. Builders can be chained together into an array and then saved as a JSON file to be run on a production system.

Origin and Maintainers

Maggma has been developed and is maintained by the Materials Project team at Lawrence Berkeley National Laboratory and the Materials Project Software Foundation.

Maggma is written in Python and supports Python 3.9+.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

maggma-0.69.3.tar.gz (226.4 kB view details)

Uploaded Source

Built Distribution

maggma-0.69.3-py3-none-any.whl (122.0 kB view details)

Uploaded Python 3

File details

Details for the file maggma-0.69.3.tar.gz.

File metadata

  • Download URL: maggma-0.69.3.tar.gz
  • Upload date:
  • Size: 226.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for maggma-0.69.3.tar.gz
Algorithm Hash digest
SHA256 3213cbb1c9dace1bdfa0fd27662c2cf98eef3a35ea9b4892452069b506ebbac7
MD5 83a3f026ccde726a79f4e76a44d9d09f
BLAKE2b-256 c95d936bc77770455e6acf13d6548e2b0a47e41a3e21d9a790f93b99e3850c7d

See more details on using hashes here.

Provenance

File details

Details for the file maggma-0.69.3-py3-none-any.whl.

File metadata

  • Download URL: maggma-0.69.3-py3-none-any.whl
  • Upload date:
  • Size: 122.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for maggma-0.69.3-py3-none-any.whl
Algorithm Hash digest
SHA256 9eddc786038cc99f8753e389e535de6564c05f04af8118f374d7a9fc9640807c
MD5 189fc67c365fb175491beb78ba6c0fb9
BLAKE2b-256 4932e778f495534fc31ee624c8a5297bf5ecc0346efaf5510f9db66293aeacd4

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page