Skip to main content

Data Science Operations on top of dvc

Project description

DSO: data science operations

DSO Kraken

DSO is a command line helper for building reproducible data anlaysis projects with ease. It builds on top of dvc for data versioning and provides project templates, linting checks, hierarchical overlay of configuration files and integrates with quarto and jupyter notebooks.

At Boehringer Ingelheim, we introduced DSO to meet the high quality standards required for biomarker analysis in clinical trials. DSO is still under early development and we value community feedback.

Getting started

What is DVC?

DVC is like "git for data". It can version large data files and data directories alongside source code tracked with git. In addition to versioning files, dvc can be used to run analyses in a reproducible way by declaring input and output files as well as commands to be executed in a dvc.yaml configuration file. After executing an analysis, timestamps and checksums of all input and output files are stored in a lock file, providing a provenance record. Different analysis tasks are organized in stages. Since input and output files of each stage are declared, dvc can build a dependency graph of the stages to re-execute stages as appropriate if input data or preprocessing steps have been updated.

Creating a project from a template

There are three types of DSO templates: project, folders and stages. A project is the root of your project and always a git repository at the same time. It can be created using dso init. A stage is an executable step of your analysis (usually one script with defined inputs and outputs) organized in a folder. Stages cannot be nested. A folder is used to organize stages in a hierarchical way within the project.

You can use dso init to create a new project

$> dso init
Please enter the name of the project, e.g. "single_cell_lung_atlas": my_cool_project
Please add a short description of the project: This analysis solves *all* the problems!

Within a project, you can use dso create to initalize folders and stages from a predefined template

$> dso create stage
? Choose a template: (Use arrow keys)
   bash
 » quarto
Please enter the name of the stage, e.g. "01_preprocessing": 02_quality_control
Please add a short description of the stage: Make a PCA to detect outliers

How-to write and use config files

The config files in a project, subfolder or stage are the cornerstone of any reproducable analysis by minimising analysis configuration errors within related scripts. Additionally, config files reduce the time needed to modify your scripts when changing configurations such as p-value cutoffs, excluded samples, output directory, data input, and many more.

A config file of a project, subfolder, or stage contains all necessary parameters that should be consistent across the analyses. Therefore, changing parameters is done within the config files and not individually within an analysis script.

In DSO two parameter files are given called params.yaml and params.in.yaml. params.yaml is an autogenerated YAML containing all the parameters specified in the params.in.yaml and other params.yaml files in its parent directories (see figure below for an example how this behaves in real). params.yaml will be compiled when running dso compile-config.

Hierarchical configuration schema
$> dso compile-config
[08/22/24 20:53:43] INFO     Detected /home/grst/my_cool_project as project root.
                    INFO     Compiling a total of 2 config files.
                    INFO     Configuration compiled successfully.

Linting checks

Dso provides linting checks that detect common errors in analysis projects. Right now only few checks are implemented, but more will be available in the future.

To run the linting checks manuall, execute

$> dso lint
[08/22/24 20:53:43] INFO     Compiled a list of 22 to be linted

However, it is preferable to execute linting checks as pre-commit hooks and/or as continuous integration checks. A .pre-commit-config.yaml comes with the DSO project template. Simply activate it using pre-commit install.

Reproducing projects

To reproduce/execute all stages within a project, run

$> dso repro

This is a thin wrapper around dvc repro that compiles all configuration files beforehand. DVC will only reproduce stages defined in the dvc.yaml where changes have been made. When dependencies have been changed, previous stages will also be re-run.

Integration with quarto

DSO provides some additional tooling around quarto documents for generating reproducible reports. When you create a quarto stage via dso create stage --template quarto you are all set to use this tooling:

  • Render quarto stages to html via dso exec quarto .
  • Inherit quarto configuration through the project from the params.yaml files. Quarto configuration can be placed in dso.quarto, e.g.
    dso:
      quarto:
         author:
           - Jane Doe
         execute:
           warning: false
    
  • Add a disclaimer box and watermarks to all plots (e.g. to mark them as drafts) by adding additional settings
    dso:
      quarto:
        watermark:
          text: DRAFT
        disclaimer:
          title: This document is a DRAFT
          text: Please do not share!
    

To access stage parameters and resolve file paths relative to the stage directory from within R, we provide the companion package dso-r that provides the two functions read_params(stage_name) and stage_here(path).

Installation

DSO requires Python 3.10 or later.

You can install the latest version with pip using

pip install dso-core

Alternatively, you can install the development version from GitHub:

pip install git+https://github.com/Boehringer-Ingelheim/dso.git@main

Release notes

See the changelog.

Credits

dso was initially developed by

DSO depends on many great open source projects, most notably dvc, hiyapyco and jinja2.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dso_core-0.9.0.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

dso_core-0.9.0-py3-none-any.whl (410.2 kB view details)

Uploaded Python 3

File details

Details for the file dso_core-0.9.0.tar.gz.

File metadata

  • Download URL: dso_core-0.9.0.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for dso_core-0.9.0.tar.gz
Algorithm Hash digest
SHA256 09702951b2fbd18722593cba1006e0656d1881ed862ff6a768555cd97cf0bf7d
MD5 698d905b01d729fc20c906585808ec4e
BLAKE2b-256 4254ac3f073de1491353d6fecbf0dfcc5327072050e4837456a865e2fad694a7

See more details on using hashes here.

Provenance

File details

Details for the file dso_core-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: dso_core-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 410.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for dso_core-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 435f18afa04821751609aa61c865b2995d9e785eb56bd1251807d91ecfa9bd60
MD5 b9eb25a6d0f8ab33f0a58125bcd05e11
BLAKE2b-256 9ff75fe160136917eeebb5350b150c30d5305056bd311e9f1c23764e68f965eb

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page