Skip to main content

LaminDB: Manage R&D data & analyses.

Project description

Stars codecov pypi

LaminDB: Data lake for biology

LaminDB is an API layer for your existing infrastructure to manage your existing data.

Public beta: Currently only recommended for collaborators as we still make breaking changes.

Update 2023-06-14:

- We completed a major migration from SQLAlchemy/SQLModel to Django, available in 0.42.0.
- The last version that is fully compatible with SQLAlchemy/SQLModel is 0.41.2.

Features

Free:

  • Track data lineage across notebooks, pipelines & apps.
  • Manage biological registries, ontologies & features.
  • Persist, load & stream data objects with a single line of code.
  • Query for anything, define & manage custom schemas.
  • Manage data on your laptop, server or cloud infra.
  • Use a mesh of distributed LaminDB instances for different teams and purposes.
  • Share instances through a Hub akin to GitHub.

Enterprise:

  • Explore & share data, submit samples & track lineage with LaminApp (deployable in your infra).
  • Receive support, code templates & services for a BioTech data & analytics platform.

Usage overview

Use the CLI to initialize a data lake with local or cloud default storage:

$ lamin init --storage ./myartifacts  # or s3://my-bucket, gs://my-bucket, etc.

Within Python, import lamindb:

import lamindb as ln

Store, query, search & load data artifacts

Store a DataFrame in default storage:

df = pd.DataFrame({"feat1": [1, 2], "feat2": [3, 4]})  # AnnData works, too

ln.File(df, name="My dataset1").save()  # create a File artifact and save it

You'll have the full power of unconstrained SQL to query for metadata, but the simplest query for an artifact is:

file = ln.File.select(name="My dataset1").one()  # get exactly one result

If you don't have specific metadata in mind, search for the artifact:

ln.File.search("dataset1")

Load the artifact back into memory:

df = file.load()

Or get a backed accessor to stream its content from the cloud

conn = file.backed()  # currently works only for AnnData, not yet for DataFrame

Track & query data lineage

ln.File.select(created_by__handle="lizlemon").df()   # all files ingested by lizlemon
ln.File.select().order_by("-updated_at").first()  # latest updated file

Notebooks

Track a Jupyter Notebook:

ln.track()  # auto-detect & save notebook metadata
ln.File("my_artifact.parquet").save()  # this file is an output of the notebook run

When you query this file later on you'll know from which notebook it came:

file = ln.File.select(name="my_artifact.parquet").one()  # query for a file
file.transform  # notebook with id, title, filename, version, etc.
file.run  # the notebook run that created the file

Or you query for notebooks directly:

transforms = ln.Transform.select(  # all notebooks with 'T cell' in the title created in 2022
    name__contains="T cell", type="notebook", created_at__year=2022
).all()
ln.File.select(transform__in=transforms).all()  # data artifacts created by these notebooks

Pipelines

To save a pipeline to the Transform registry, call

ln.Transform(name="Awesom-O", version="0.41.2").save()  # save a pipeline, optionally with metadata

Track a pipeline run:

transform = ln.Transform.select(name="Awesom-O", version="0.41.2").one()  # select pipeline from the registry
ln.track(transform)  # create a new global run context
ln.File("s3://my_samples01/my_artifact.fastq.gz").save()  # link file against run & transform

Now, you can query, e.g., for

ln.Run.select(transform__name="Awesom-O").order_by("-created_at").df()  # get the latest pipeline runs

Auto-complete categoricals

When you're unsure about spellings, use a lookup object:

lookup = ln.Transform.lookup()
ln.Run.select(transform=lookup.awesome_o)

Load your data lake instance from anywhere

Let other users access your work including all lineage & metadata via a single line:

$ lamin load myaccount/myartifacts

Manage biological registries

lamin init --storage ./myobjects --schema bionty

...

Track biological features

...

Track biological samples

...

Manage custom schemas

  1. Create a GitHub repository with Django ORMs similar to github.com/laminlabs/lnschema-lamin1
  2. Create & deploy migrations via lamin migrate create and lamin migrate deploy

It's fastest if we do this for you based on our templates within an enterprise plan, but you can fully manage the process yourself.

Installation

pyversions

pip install lamindb  # basic data lake
pip install 'lamindb[bionty]'  # biological entities
pip install 'lamindb[nbproject]'  # Jupyter notebook tracking
pip install 'lamindb[aws]'  # AWS dependencies (s3fs, etc.)
pip install 'lamindb[gcp]'  # GCP dependencies (gcfs, etc.)

Quick setup

Why do I have to sign up?

  • Data lineage requires a user identity (who modified which data when?).
  • Collaboration requires a user identity (who shares this with me?).

Signing up takes 1 min.

We do not store any of your data, but only basic metadata about you (email address, etc.) & your LaminDB instances (S3 bucket names, etc.).

  • Sign up: lamin signup <email>
  • Log in: lamin login <handle>

How does it work?

LaminDB builds semantics of R&D and biology onto well-established tools:

  • SQLite & Postgres for SQL databases using Django ORM (previously: SQLModel)
  • S3, GCP & local storage for object storage using fsspec
  • Configurable storage formats: pyarrow, anndata, zarr, etc.
  • Biological knowledge sources & ontologies: see Bionty

LaminDB is open source.

Architecture

LaminDB consists of the lamindb Python package (repository here) with its components:

  • bionty: Basic biological entities (usable standalone).
  • lamindb-setup: Setup & configure LaminDB, client for Lamin Hub.
  • lnschema-core: Core schema, ORMs to model data objects & data lineage.
  • lnschema-bionty: Bionty schema, ORMs that are coupled to Bionty's entities.
  • lnschema-lamin1: Exemplary configured schema to track samples, treatments, etc.
  • nbproject: Parse metadata from Jupyter notebooks.

LaminHub & LaminApp are not open-sourced, and neither are templates that model lab operations.

Lamin's packages build on the infrastructure listed above. Previously, they were based on SQLAlchemy/SQLModel instead of Django, and cloudpathlib instead of fsspec.

Notebooks

  • Find all guide notebooks here.
  • You can run these notebooks in hosted versions of JupyterLab, e.g., Saturn Cloud, Google Vertex AI, and others or on Google Colab.
  • Jupyter Lab & Notebook offer a fully interactive experience, VS Code & others require using the CLI (lamin track my-notebook.ipynb)

Documentation

Read the docs.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lamindb-0.44.0.tar.gz (214.5 kB view details)

Uploaded Source

Built Distribution

lamindb-0.44.0-py3-none-any.whl (54.7 kB view details)

Uploaded Python 3

File details

Details for the file lamindb-0.44.0.tar.gz.

File metadata

  • Download URL: lamindb-0.44.0.tar.gz
  • Upload date:
  • Size: 214.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.31.0

File hashes

Hashes for lamindb-0.44.0.tar.gz
Algorithm Hash digest
SHA256 ccea2b07028289feec851ce40b17eaabbcd83d7b353b2844b217df1fe78f780a
MD5 100148b2c58acc4e5a0f9f66a500e8aa
BLAKE2b-256 d04b979189a17c5a18cd26533238ade92ae2840c042e78acd6f9990079613f52

See more details on using hashes here.

Provenance

File details

Details for the file lamindb-0.44.0-py3-none-any.whl.

File metadata

  • Download URL: lamindb-0.44.0-py3-none-any.whl
  • Upload date:
  • Size: 54.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.31.0

File hashes

Hashes for lamindb-0.44.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d969c99b97e765f1a995d3eb838455eec51dc8e14b434de735a393f42682db0c
MD5 1b7a8624555f7fcf65c4b705885ac1c5
BLAKE2b-256 216159e27e4c75e55dd638108c021ae783fe54844d410dbb1cebdd82f0dd2dd6

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page