Skip to main content

A runtime system for NMDC data management and orchestration

Project description

A runtime system for NMDC data management and orchestration.

Service Status

http://nmdcstatus.polyneme.xyz/

How It Fits In

  • nmdc-metadata tracks issues related to NMDC metadata, which may necessitate work across multiple repos.

  • nmdc-schema houses the LinkML schema specification, as well as generated artifacts (e.g. JSON Schema).

  • nmdc-server houses code specific to the data portal -- its database, back-end API, and front-end application.

  • workflow_documentation references workflow code spread across several repositories, that take source data and produce computed data.

  • This repo (nmdc-runtime)

    • houses code that takes source data and computed data, and transforms it to broadly accommodate downstream applications such as the data portal
    • manages execution of the above (i.e., lightweight data transformations) and also of computationally- and data-intensive workflows performed at other sites, ensuring that claimed jobs have access to needed configuration and data resources.

Data exports

The NMDC metadata as of 2021-10 is available here:

https://drs.microbiomedata.org/ga4gh/drs/v1/objects/sys086d541

The link returns a GA4GH DRS API bundle object record, with the NMDC metadata collections (study_set, biosample_set, etc.) as contents, each a DRS API blob object.

For example the blob for the study_set collection export, named "study_set.jsonl.gz", is listed with DRS API ID "sys0xsry70". Thus, it is retrievable via

https://drs.microbiomedata.org/ga4gh/drs/v1/objects/sys0xsry70

The returned blob object record lists https://nmdc-runtime.files.polyneme.xyz/nmdcdb-mongoexport/2021-10-14/study_set.jsonl.gz as the url for an access method.

The 2021-10 exports are currently all accessible at https://nmdc-runtime.files.polyneme.xyz/nmdcdb-mongoexport/2021-10-14/${COLLECTION_NAME}.jsonl.gz, but the DRS API indirection allows these links to change in the future, for mirroring via other URLs, etc. So, the DRS API links should be the links you share.

Overview

The runtime features:

  1. Dagster orchestration:

    • dagit - a web UI to monitor and manage the running system.
    • dagster-daemon - a service that triggers pipeline runs based on time or external state.
    • PostgresSQL database - for storing run history, event logs, and scheduler state.
    • workspace code
      • Code to run is loaded into a Dagster workspace. This code is loaded from one or more dagster repositories. Each Dagster repository may be run with a different Python virtual environment if need be, and may be loaded from a local Python file or pip installed from an external source. In our case, each Dagster repository is simply loaded from a Python file local to the nmdc-runtime GitHub repository, and all code is run in the same Python environment.
      • A Dagster repository consists of solids and pipelines, and optionally schedules and sensors.
        • solids represent individual units of computation
        • pipelines are built up from solids
        • schedules trigger recurring pipeline runs based on time
        • sensors trigger pipeline runs based on external state
      • Each pipeline can declare dependencies on any runtime resources or additional configuration. There are TerminusDB and MongoDB resources defined, as well as preset configuration definitions for both "dev" and "prod" modes. The presets tell Dagster to look to a set of known environment variables to load resources configurations, depending on the mode.
  2. A TerminusDB database supporting revision control of schema-validated data.

  3. A MongoDB database supporting write-once, high-throughput internal data storage by the nmdc-runtime FastAPI instance.

  4. A FastAPI service to interface with the orchestrator and database, as a hub for data management and workflow automation.

Local Development

Ensure Docker (and Docker Compose) are installed.

Ensure you have a .env file for the docker services to source from. You may copy .env.example to .env (which is gitignore'd) to get started.

# To load env in your shell session
# export $(grep -v '^#' .env | xargs)

If you are connecting to resources that require a ssh tunnel, for example a MongoDB that is only accessible on the NERSC network,

make nersc-ssh-tunnel

could be useful for you, directly or as a template.

Finally,

make up-dev

Docker Compose is used to start local MongoDB and PostgresSQL (used by Dagster) instances, as well as a Dagster web server (dagit) and daemon (dagster-daemon).

The Dagit web server is viewable at http://localhost:3000/.

The FastAPI service is viewable at http://localhost:8000/ -- e.g., rendered documentation at http://localhost:8000/redoc/.

Local Testing

Tests can be found in tests and are run with the following commands:

make up-test
make test

As you create Dagster solids and pipelines, add tests in tests/ to check that your code behaves as desired and does not break over time.

For hints on how to write tests for solids and pipelines in Dagster, see their documentation tutorial on Testing.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nmdc_runtime-0.1.6.tar.gz (63.6 MB view details)

Uploaded Source

Built Distribution

nmdc_runtime-0.1.6-py3-none-any.whl (199.1 kB view details)

Uploaded Python 3

File details

Details for the file nmdc_runtime-0.1.6.tar.gz.

File metadata

  • Download URL: nmdc_runtime-0.1.6.tar.gz
  • Upload date:
  • Size: 63.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.12

File hashes

Hashes for nmdc_runtime-0.1.6.tar.gz
Algorithm Hash digest
SHA256 fa31dc2e4f24a3a88f9521fc41bbe9bf9bf029588fbc2e5d025f00004b5b3aec
MD5 ed1390684a6f7529490cad938eef87f9
BLAKE2b-256 82265ed0f7ec2108a87e7f3689f6d8daa6ff3aca90fdb10ede7b65f8c6b46fa3

See more details on using hashes here.

File details

Details for the file nmdc_runtime-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: nmdc_runtime-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 199.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.12

File hashes

Hashes for nmdc_runtime-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 62cd7cd6675a26a7086595aff1c2b1c1af38f55504dab9ebac22fd2fe66c93e1
MD5 06f95205c083a0ca2878959c2a6d6e1e
BLAKE2b-256 44e7dbd9b95d667293a0c7623f51f920d3e4eec16fc97c449668591ac2e404c0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page