Skip to main content

LaminDB: Manage R&D data & analyses.

Project description

Stars codecov pypi

LaminDB: Manage R&D data & analyses

Curate, store, track, query, integrate, and learn from biological data.

LaminDB is an open-source data lake for R&D in biology. It manages indexed object storage (local directories, S3, GCP) with a mapped SQL database (SQLite, Postgres, and soon, BigQuery).

One cool thing is that you can readily create distributed LaminDB instances at any scale. Get started on your laptop, deploy in the cloud, or work with a mesh of instances for different teams and purposes.

Public beta: Currently only recommended for collaborators as we still make breaking changes.

Installation

LaminDB is a python package available for Python versions 3.8+.

pip install lamindb

If you need to work with bionty (feature parsing) and wetlab schemas:

pip install 'lamindb[bionty,wetlab]'

Import

In your python script, import LaminDB as:

import lamindb as ln

Quick setup

Quick setup on the command line:

  • Sign up via lamin signup <email>
  • Log in via lamin login <handle>
  • Set up an instance via lamin init --storage <storage> --schema <schema_modules>

:::{dropdown} Example code

lamin signup testuser1@lamin.ai
lamin login testuser1
lamin init --storage ./mydata --schema bionty,wetlab

:::

See {doc}/guide/setup for more.

Track & query data

Track data sources, data, and metadata

::::{tab-set} :::{tab-item} Within an interactive notebook

import lamindb as ln

ln.Run() # data source (a run record) is created
#> ℹ️ Instance: testuser1/mydata
#> ℹ️ User: testuser1
#> ℹ️ Loaded notebook: Notebook(id='OdlFhFWW7qg3', v='0', name='04-memory', title='Track in-memory data objects', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 15, 16, 14, 42))
#> ℹ️ Loaded run:
#> Run(id='L1oBMKW60ndt5YtjRqav', notebook_id='sePTpDsGJRq3', notebook_v='0', created_by='bKeW4T6E', created_at=datetime.datetime(2023, 3, 14, 21, 49, 36))

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})

# create a data object with SQL metadata record including hash
# link run record
dobject = ln.DObject(df, name="My dataframe")
#> DObject(id='dZvGD7YUKCKG4X4aLd5K', name='My dataframe', suffix='.parquet', size=2240, hash='R2_kKlH1nBGesMdyulMYkA', source_id='L1oBMKW60ndt5YtjRqav', storage_id='wor0ul6c')

# upload serialized version to the configured storage
# commit a DObject record to the SQL database
ln.add(dobject)
#> DObject(id='dZvGD7YUKCKG4X4aLd5K', name='My dataframe', suffix='.parquet', size=2240, hash='R2_kKlH1nBGesMdyulMYkA', source_id='L1oBMKW60ndt5YtjRqav', storage_id='wor0ul6c', created_at=datetime.datetime(2023, 3, 14, 21, 49, 46))

::: :::{tab-item} Within a regular pipeline

# create (or query) a pipeline record
pipeline = lns.Pipeline(name="My pipeline")
#> Pipeline(id='fhn5Zydf', v='1', name='My pipeline', created_by='bKeW4T6E')

# create a run from the above pipeline as the data source
run = ln.Run(pipeline=pipeline)
#> Run(id='2aaKWH8dwBE6hnj3n9K9', pipeline_id='fhn5Zydf', pipeline_v='1', created_by='bKeW4T6E')

# access pipeline from run via
print(run.pipeline)
#> Pipeline(id='fhn5Zydf', v='1', name='My pipeline', created_by='bKeW4T6E')

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})

# create a data object with SQL metadata record including hash and link run record
dobject = ln.DObject(df, name="My dataframe", source=run)
#> DObject(id='dZvGD7YUKCKG4X4aLd5K', name='My dataframe', suffix='.parquet', size=2240, hash='R2_kKlH1nBGesMdyulMYkA', source_id='L1oBMKW60ndt5YtjRqav', storage_id='wor0ul6c')

# Tip: If you work with a single thread, you can pass `global_context=True` to ln.Run(), allowing you to omit source=run

# upload serialized version to the configured storage
# commit a DObject record to the SQL database
ln.add(dobject)
#> DObject(id='dZvGD7YUKCKG4X4aLd5K', name='My dataframe', suffix='.parquet', size=2240, hash='R2_kKlH1nBGesMdyulMYkA', source_id='L1oBMKW60ndt5YtjRqav', storage_id='wor0ul6c', created_at=datetime.datetime(2023, 3, 14, 21, 49, 46))

::: ::::

Query & load data

dobject = ln.select(ln.DObject, name="My dataframe").one()
#> [DObject(id='dZvGD7YUKCKG4X4aLd5K', name='My dataframe', suffix='.parquet', size=2240, hash='R2_kKlH1nBGesMdyulMYkA', source_id='L1oBMKW60ndt5YtjRqav', storage_id='wor0ul6c', created_at=datetime.datetime(2023, 3, 14, 21, 49, 46))]
df = dobject.load()
#>      a	b
#>  0	1	3
#>  1	2	4

Get the data ingested by the latest run:

run = ln.select(ln.Run).order_by(ln.Run.created_at.desc()).first()
#> Run(id='L1oBMKW60ndt5YtjRqav', notebook_id='sePTpDsGJRq3', notebook_v='0', created_by='bKeW4T6E', created_at=datetime.datetime(2023, 3, 14, 21, 49, 36))
dobject = ln.select(ln.DObject).where(ln.DObject.source == run).all()
#> [DObject(id='dZvGD7YUKCKG4X4aLd5K', name='My dataframe', suffix='.parquet', size=2240, hash='R2_kKlH1nBGesMdyulMYkA', source_id='L1oBMKW60ndt5YtjRqav', storage_id='wor0ul6c', created_at=datetime.datetime(2023, 3, 14, 21, 49, 46))]

See {doc}/guide/track for more.

Track biological metadata

Track biological features

import bionty as bt  # Lamin's manager for biological knowledge
import lamindb as ln

ln.Run()  # assume we're in a notebook and don't need to pass pipeline_name

# a sample single cell RNA-seq dataset
adata = ln.dev.datasets.anndata_mouse_sc_lymph_node()

# Create a reference
# - ensembl id as the standardized id
# - mouse as the species
reference = bt.Gene(species="mouse")

# parse gene identifiers from data and map on reference
features = ln.Features(adata, reference)
#> 🔶 id column not found, using index as features.
#> ✅ 10000 terms (100.0%) are mapped.
#> 🔶 0 terms (0.0%) are not mapped.
# The result is a hashed feature set record:
print(features)
#> Features(id='2Mv3JtH-ScBVYHilbLaQ', type='gene', created_by='bKeW4T6E')
# genes records can be accessed via:
print(features.genes[:3])
#> [Gene(id='ENSMUSG00000020592', species_id='NCBI_10090'),
#>  Gene(id='ENSMUSG00000034931', species_id='NCBI_10090'),
#>  Gene(id='ENSMUSG00000071005', species_id='NCBI_10090')]

# track data with features
dobject = ln.DObject(adata, name="Mouse Lymph Node scRNA-seq", features=features)

# access linked gene references
print(dobject.features.genes[:3])
#> [Gene(id='ENSMUSG00000020592', species_id='NCBI_10090'),
#>  Gene(id='ENSMUSG00000034931', species_id='NCBI_10090'),
#>  Gene(id='ENSMUSG00000071005', species_id='NCBI_10090')]

# upload serialized data to configured storage
# commit a DObject record to the SQL database
# commit all linked features to the SQL database
ln.add(dobject)
#> DObject(id='VRu0Mg93d5l6NLb4znCD', name='Mouse Lymph Node scRNA-seq', suffix='.h5ad', size=17341245, hash='Qprqj0O23197Ko-VobaZiw', source_id='EB78Sl5KPG6wW6XcOlsm', storage_id='0Xt6BY40', created_at=datetime.datetime(2023, 3, 17, 6, 49, 39))

See {doc}/guide/features for more.

- Each page in this guide is a Jupyter Notebook, which you can download [here](https://github.com/laminlabs/lamindb/tree/main/docs/guide).
- You can run these notebooks in hosted versions of JupyterLab, e.g., [Saturn Cloud](https://github.com/laminlabs/run-lamin-on-saturn), Google Vertex AI, and others.
- We recommend using [JupyterLab](https://jupyterlab.readthedocs.io/) for best notebook tracking experience.

📬 Reach out to report issues, learn about data modules that connect your assays, pipelines & workflows within our data platform enterprise plan.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lamindb-0.34.2.tar.gz (115.3 kB view details)

Uploaded Source

Built Distribution

lamindb-0.34.2-py3-none-any.whl (39.1 kB view details)

Uploaded Python 3

File details

Details for the file lamindb-0.34.2.tar.gz.

File metadata

  • Download URL: lamindb-0.34.2.tar.gz
  • Upload date:
  • Size: 115.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.28.1

File hashes

Hashes for lamindb-0.34.2.tar.gz
Algorithm Hash digest
SHA256 bc1f77e2c2509d0511933e55e7ed816204dd4bb45037379df2df6376deb38687
MD5 7a609c0f9cebf8ded5c59f3cbcd7002f
BLAKE2b-256 9b396b53051a36657887b7cc743d436599a3f0b9319d60b04ce46596797ba8d0

See more details on using hashes here.

Provenance

File details

Details for the file lamindb-0.34.2-py3-none-any.whl.

File metadata

  • Download URL: lamindb-0.34.2-py3-none-any.whl
  • Upload date:
  • Size: 39.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.28.1

File hashes

Hashes for lamindb-0.34.2-py3-none-any.whl
Algorithm Hash digest
SHA256 62d95a2bb09c4b1c5ffc611145a6a475591ba8e97f2d56637e77c423869a499f
MD5 10c169f923cede44c2072b27191deb83
BLAKE2b-256 24cdff3713a03005dc94c44b6d09be534f230f8b0b1d9ac5ce0ed20460856ed0

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page