Skip to main content

LaminDB: Manage R&D data & analyses.

Project description

Stars codecov pypi

LaminDB: Manage R&D data & analyses

Curate, store, track, query, integrate, and learn from biological data.

LaminDB is an open-source data lake for R&D in biology. It manages indexed object storage (local directories, S3, GCP) with a mapped SQL database (SQLite, Postgres, and soon, BigQuery).

One cool thing is that you can readily create distributed LaminDB instances at any scale. Get started on your laptop, deploy in the cloud, or work with a mesh of instances for different teams and purposes.

Public beta: Currently only recommended for collaborators as we still make breaking changes.

Installation

LaminDB is a python package available for Python versions 3.8+.

pip install lamindb

Import

In your python script, import LaminDB as:

import lamindb as ln

Quick setup

Quick setup on the command line:

  • Sign up via lamin signup <email>
  • Log in via lamin login <handle>
  • Set up an instance via lamin init --storage <storage> --schema <schema_modules>

:::{dropdown} Example code

lamin signup testuser1@lamin.ai
lamin login testuser1
lamin init --storage ./mydata --schema bionty,wetlab

:::

See {doc}/guide/setup for more.

Track & query data

Track data sources, data, and metadata

::::{tab-set} :::{tab-item} Within an interactive notebook

import lamindb as ln

ln.Run() # data source (a run record) is created
#> ℹ️ Instance: testuser2/mydata
#> ℹ️ User: testuser2
#> ℹ️ Loaded run:
#> Run(id='L1oBMKW60ndt5YtjRqav', notebook_id='sePTpDsGJRq3', notebook_v='0', created_by='bKeW4T6E', created_at=datetime.datetime(2023, 3, 14, 21, 49, 36))

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})

# create a data object with SQL metadata record including hash
# link run record
dobject = ln.DObject(df, name="My dataframe")
#> DObject(id='dZvGD7YUKCKG4X4aLd5K', name='My dataframe', suffix='.parquet', size=2240, hash='R2_kKlH1nBGesMdyulMYkA', source_id='L1oBMKW60ndt5YtjRqav', storage_id='wor0ul6c')

# upload serialized version to the configured storage
# commit a DObject record to the SQL database
ln.add(dobject)
#> DObject(id='dZvGD7YUKCKG4X4aLd5K', name='My dataframe', suffix='.parquet', size=2240, hash='R2_kKlH1nBGesMdyulMYkA', source_id='L1oBMKW60ndt5YtjRqav', storage_id='wor0ul6c', created_at=datetime.datetime(2023, 3, 14, 21, 49, 46))

::: :::{tab-item} Within a regular pipeline

# create (or query) a pipeline record
pipeline = lns.Pipeline(name="My pipeline")
#> Pipeline(id='fhn5Zydf', v='1', name='My pipeline', created_by='bKeW4T6E')

# create a run from the above pipeline as the data source
run = ln.Run(pipeline=pipeline)
#> Run(id='2aaKWH8dwBE6hnj3n9K9', pipeline_id='fhn5Zydf', pipeline_v='1', created_by='bKeW4T6E')

# access pipeline from run via
print(run.pipeline)
#> Pipeline(id='fhn5Zydf', v='1', name='My pipeline', created_by='bKeW4T6E')

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})

# create a data object with SQL metadata record including hash and link run record
dobject = ln.DObject(df, name="My dataframe", source=run)
#> DObject(id='dZvGD7YUKCKG4X4aLd5K', name='My dataframe', suffix='.parquet', size=2240, hash='R2_kKlH1nBGesMdyulMYkA', source_id='L1oBMKW60ndt5YtjRqav', storage_id='wor0ul6c')

# Tip: If you work with a single thread, you can pass `global_context=True` to ln.Run(), allowing you to omit source=run

# upload serialized version to the configured storage
# commit a DObject record to the SQL database
ln.add(dobject)
#> DObject(id='dZvGD7YUKCKG4X4aLd5K', name='My dataframe', suffix='.parquet', size=2240, hash='R2_kKlH1nBGesMdyulMYkA', source_id='L1oBMKW60ndt5YtjRqav', storage_id='wor0ul6c', created_at=datetime.datetime(2023, 3, 14, 21, 49, 46))

::: ::::

Query & load data

dobject = ln.select(ln.DObject, name="My dataframe").one()
#> [DObject(id='dZvGD7YUKCKG4X4aLd5K', name='My dataframe', suffix='.parquet', size=2240, hash='R2_kKlH1nBGesMdyulMYkA', source_id='L1oBMKW60ndt5YtjRqav', storage_id='wor0ul6c', created_at=datetime.datetime(2023, 3, 14, 21, 49, 46))]
df = dobject.load()
#>      a	b
#>  0	1	3
#>  1	2	4

Get the data ingested by the latest run:

run = ln.select(ln.Run).order_by(ln.Run.created_at.desc()).first()
#> Run(id='L1oBMKW60ndt5YtjRqav', notebook_id='sePTpDsGJRq3', notebook_v='0', created_by='bKeW4T6E', created_at=datetime.datetime(2023, 3, 14, 21, 49, 36))
dobject = ln.select(ln.DObject).where(ln.DObject.source == run).all()
#> [DObject(id='dZvGD7YUKCKG4X4aLd5K', name='My dataframe', suffix='.parquet', size=2240, hash='R2_kKlH1nBGesMdyulMYkA', source_id='L1oBMKW60ndt5YtjRqav', storage_id='wor0ul6c', created_at=datetime.datetime(2023, 3, 14, 21, 49, 46))]

See {doc}/guide/track for more.

Track biological metadata

Track biological features

import bionty as bt  # Lamin's manager for biological knowledge
import lamindb as ln

ln.Run()  # assume we're in a notebook and don't need to pass pipeline_name

# a sample single cell RNA-seq dataset
adata = ln.dev.datasets.anndata_mouse_sc_lymph_node()

# Create a reference
# - ensembl id as the standardized id
# - mouse as the species
reference = bt.Gene(species="mouse")

# parse gene identifiers from data and map on reference
features = ln.Features(adata, reference)
#> 🔶 id column not found, using index as features.
#> ✅ 0 terms (0.0%) are mapped.
#> 🔶 10000 terms (100.0%) are not mapped.
# The result is a hashed feature set record:
print(features)
#> Features(id='2Mv3JtH-ScBVYHilbLaQ', type='gene', created_by='bKeW4T6E')
# genes records can be accessed via:
print(features.genes[:3])
#> [Gene(id='ENSMUSG00000020592', species_id='NCBI_10090'),
#>  Gene(id='ENSMUSG00000034931', species_id='NCBI_10090'),
#>  Gene(id='ENSMUSG00000071005', species_id='NCBI_10090')]

# track data with features
dobject = ln.DObject(adata, name="Mouse Lymph Node scRNA-seq", features=features)

# access linked gene references
print(dobject.features.genes[:3])
#> [Gene(id='ENSMUSG00000020592', species_id='NCBI_10090'),
#>  Gene(id='ENSMUSG00000034931', species_id='NCBI_10090'),
#>  Gene(id='ENSMUSG00000071005', species_id='NCBI_10090')]

# upload serialized data to configured storage
# commit a DObject record to the SQL database
# commit all linked features to the SQL database
ln.add(dobject)

See {doc}/guide/features for more.

- Each page in this guide is a Jupyter Notebook, which you can download [here](https://github.com/laminlabs/lamindb/tree/main/docs/guide).
- You can run these notebooks in hosted versions of JupyterLab, e.g., [Saturn Cloud](https://github.com/laminlabs/run-lamin-on-saturn), Google Vertex AI, and others.
- We recommend using [JupyterLab](https://jupyterlab.readthedocs.io/) for best notebook tracking experience.

📬 Reach out to report issues, learn about data modules that connect your assays, pipelines & workflows within our data platform enterprise plan.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lamindb-0.33.4.tar.gz (123.5 kB view details)

Uploaded Source

Built Distribution

lamindb-0.33.4-py2.py3-none-any.whl (48.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file lamindb-0.33.4.tar.gz.

File metadata

  • Download URL: lamindb-0.33.4.tar.gz
  • Upload date:
  • Size: 123.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.28.2

File hashes

Hashes for lamindb-0.33.4.tar.gz
Algorithm Hash digest
SHA256 fb7f9cad448b482e3167814df1b8edbc35d894d8a42c3da2bab984fb51391961
MD5 ae13b298cd8ae1f98f9bbbd4c0306aa6
BLAKE2b-256 59553bbbda7c9d5bc3acd2561c7c9e13457cab56204cd55dfa3f411905cb72b6

See more details on using hashes here.

Provenance

File details

Details for the file lamindb-0.33.4-py2.py3-none-any.whl.

File metadata

  • Download URL: lamindb-0.33.4-py2.py3-none-any.whl
  • Upload date:
  • Size: 48.6 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.28.2

File hashes

Hashes for lamindb-0.33.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 e07dd13b3be44a2f1cc8e9df1cd911ba67f7fbe8478db6487ee80041501025de
MD5 0a2cbabc12a745d603a1ebdb9c28e525
BLAKE2b-256 d91a9e82f789e7f7a6a85f87e99825047766eabc6445d672080d960dfb12f6fe

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page