Skip to main content

LaminDB: Manage R&D data & analyses.

Project description

Stars codecov pypi Documentation

LaminDB

Open-source data lake, warehouse & feature store for biology.

Manage your existing data & analyses in your existing infrastructure.

Public beta: Currently only recommended for collaborators as we still make breaking changes.

Update 2023-06-14:

- We completed a major migration from SQLAlchemy/SQLModel to Django, available in 0.42.0.
- The last version before the migration is 0.41.2.

What?

LaminDB is a free & open-source Python library allowing you to:

You can combine LaminDB with LaminApp & consulting services on an enterprise plan:

  • LaminApp: Explore & collaborate on data in a UI (deployable in your infrastructure).
  • Services: Support & code templates for a BioTech data & analytics platform.

Usage overview

Import lamindb and initialize a data lake instance with local or cloud default storage:

import lamindb as ln

ln.setup.init(storage="./mydata")  # or s3://my-bucket, gs://my-bucket, etc.

Store, query, search & load data objects

Store a DataFrame in default storage:

df = pd.DataFrame({"feat1": [1, 2], "feat2": [3, 4]})  # AnnData works, too

ln.File(df, name="My dataset1").save()  # create a File object and save/upload it

You have the full power of SQL to query for metadata, but the simplest query for a file is:

file = ln.File.select(name="My dataset1").one()  # get exactly one result

If you don't have specific metadata in mind, run a search:

ln.File.search("dataset1")

Once you queried or searched it, load a file back into memory:

df = file.load()

Or get a backed accessor to stream its content from the cloud:

backed = file.backed()  # currently works for AnnData, zarr, HDF5, not yet for DataFrame

Store, query & search files

The same API works for any file:

file = ln.File("s3://my-bucket/images/image001.jpg")  # or a local path
file.save()  # register the file

Query by key (the relative path within your storage):

file.select(key_startswith="images/").df()  # all files in folder "images/" in default storage

Auto-complete categoricals

When you're unsure about spellings, use a lookup object:

users = ln.User.lookup()
ln.File.select(created_by=users.lizlemon)

Track & query data lineage

In addition to basic provenance information (created_by, created_at, created_by), you can track which notebooks, pipelines & apps transformed files.

Notebooks

Track a Jupyter Notebook:

ln.track()  # auto-detect & save notebook metadata
ln.File("my_artifact.parquet").save()  # this file is now aware that it was saved in this notebook

When you query the file, later on, you'll know from which notebook it came:

file = ln.File.select(name="my_artifact.parquet").one()  # query for a file
file.transform  # the notebook with id, title, filename, version, etc.
file.run  # the specific run of the notebook that created the file

Alternatively, you can query for notebooks and find the files written by them:

transforms = ln.Transform.select(  # all notebooks with 'T cell' in the title created in 2022
    name__contains="T cell", type="notebook", created_at__year=2022
).all()
ln.File.select(transform__in=transforms).df()  # the files created by these notebooks

Pipelines

This works like for notebooks just that you need to provide pipeline metadata yourself.

To save a pipeline to the Transform registry, call

ln.Transform(name="Awesom-O", version="0.41.2").save()  # save a pipeline, optionally with metadata

Track a pipeline run:

transform = ln.Transform.select(name="Awesom-O", version="0.41.2").one()  # select pipeline from the registry
ln.track(transform)  # create a new global run context
ln.File("s3://my_samples01/my_artifact.fastq.gz").save()  # file gets auto-linked against run & transform

Now, you can query for the latest pipeline runs:

ln.Run.select(transform=transform).order_by("-created_at").df()  # get the latest pipeline runs

Run inputs

To track run inputs, pass is_run_input to any File accessor: .stage(), .load() or .backed(). For instance,

file.load(is_run_input=True)

You can also track inputs by default by setting ln.settings.track_run_inputs = True.

Load your data lake from anywhere

If provided with access, others can load your data lake via a single line:

$ lamin load myaccount/myartifacts

Manage biological registries

lamin init --storage ./bioartifacts --schema bionty

...

Track biological features

...

Track biological samples

...

Manage custom schemas

  1. Create a GitHub repository with Django ORMs similar to github.com/laminlabs/lnschema-lamin1
  2. Create & deploy migrations via lamin migrate create and lamin migrate deploy

It's fastest if we do this for you based on our templates within an enterprise plan, but you can fully manage the process yourself.

Installation

pyversions

pip install lamindb  # basic data lake
pip install 'lamindb[jupyter]'  # Jupyter notebook tracking
pip install 'lamindb[bionty]'  # basic biological entities
pip install 'lamindb[fcs]'  # .fcs files (flow cytometry)
pip install 'lamindb[zarr]'  # zarr storage (streaming arrays)
pip install 'lamindb[aws]'  # AWS (s3fs, etc.)
pip install 'lamindb[gcp]'  # Google Cloud (gcfs, etc.)

Sign up

Why do I have to sign up?

  • Data lineage requires a user identity (who modified which data when?).
  • Collaboration requires a user identity (who shares this with me?).

Signing up takes 1 min.

We do not store any of your data, but only basic metadata about you (email address, etc.) & your LaminDB instances (S3 bucket names, etc.).

  • Sign up: lamin signup <email>
  • Log in: lamin login <handle>

How does it work?

LaminDB builds semantics of R&D and biology onto well-established tools:

  • SQLite & Postgres for SQL databases using Django ORM (previously: SQLModel)
  • S3, GCP & local storage for object storage using fsspec
  • Configurable storage formats: pyarrow, anndata, zarr, etc.
  • Biological knowledge sources & ontologies: see Bionty

LaminDB is open source.

Architecture

LaminDB consists of the lamindb Python package (repository here) with its components:

  • bionty: Basic biological entities (usable standalone).
  • lamindb-setup: Setup & configure LaminDB, client for Lamin Hub.
  • lnschema-core: Core schema, ORMs to model data objects & data lineage.
  • lnschema-bionty: Bionty schema, ORMs that are coupled to Bionty's entities.
  • lnschema-lamin1: Exemplary configured schema to track samples, treatments, etc.
  • nbproject: Parse metadata from Jupyter notebooks.

LaminHub & LaminApp are not open-sourced, and neither are templates that model lab operations.

Lamin's packages build on the infrastructure listed above.

Notebooks

  • Find all guide notebooks here.
  • You can run these notebooks in hosted versions of JupyterLab, e.g., Saturn Cloud, Google Vertex AI, Google Colab, and others.
  • Jupyter Lab & Notebook offer a fully interactive experience, VS Code & others require using the CLI (lamin track my-notebook.ipynb)

Documentation

Read the docs.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lamindb-0.46a1.tar.gz (290.3 kB view details)

Uploaded Source

Built Distribution

lamindb-0.46a1-py3-none-any.whl (54.0 kB view details)

Uploaded Python 3

File details

Details for the file lamindb-0.46a1.tar.gz.

File metadata

  • Download URL: lamindb-0.46a1.tar.gz
  • Upload date:
  • Size: 290.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.31.0

File hashes

Hashes for lamindb-0.46a1.tar.gz
Algorithm Hash digest
SHA256 0fa13af4b069f0fe2bfd46bdd662dd3bef113546abd4940bf46ddbf4248fa6c1
MD5 1a8c00f3cea5b3d8d9f2932f32bf230c
BLAKE2b-256 7c981aeb787379e114a0b2282988cf3000cb485178c51b6be7e82055bc0c1a94

See more details on using hashes here.

Provenance

File details

Details for the file lamindb-0.46a1-py3-none-any.whl.

File metadata

  • Download URL: lamindb-0.46a1-py3-none-any.whl
  • Upload date:
  • Size: 54.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.31.0

File hashes

Hashes for lamindb-0.46a1-py3-none-any.whl
Algorithm Hash digest
SHA256 79396e8eb40456e44f58d2626c0e8d1d03c58b2bf4002a62636af8532b43e9c6
MD5 891e681323cc6a626450e139c83efe60
BLAKE2b-256 070ae17c9d18c73db0d6a6c6a12e5be57079fe08b6f348fc531090724ab6904f

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page