LaminDB: Manage R&D data & analyses.
Project description
LaminDB: Data lake for biology
LaminDB is an API layer for your existing infrastructure to manage your existing data.
Public beta: Currently only recommended for collaborators as we still make breaking changes.
Update 2023-06-14:
- We completed a major migration from SQLAlchemy/SQLModel to Django, available in 0.42.0.
- The last version before the migration is 0.41.2.
Features
Free:
- Track data lineage across notebooks, pipelines & apps.
- Manage biological registries, ontologies & features.
- Query, search & look up anything, manage & migrate custom schemas.
- Persist, load & stream data objects with a single line of code.
- Idempotent and ACID operations.
- Use a mesh of LaminDB instances and share them in a hub akin to GitHub.
Enterprise:
- Explore & share data, submit samples (to come) & track lineage with LaminApp (deployable in your infrastructure).
- Receive support, code templates & services for a BioTech data & analytics platform.
Usage overview
Use the CLI to initialize a data lake with local or cloud default storage:
$ lamin init --storage ./myartifacts # or s3://my-bucket, gs://my-bucket, etc.
Within Python, import lamindb
:
import lamindb as ln
Store, query, search & load data artifacts
Store a DataFrame
in default storage:
df = pd.DataFrame({"feat1": [1, 2], "feat2": [3, 4]}) # AnnData works, too
ln.File(df, name="My dataset1").save() # create a File object and save it
You'll have the full power of SQL to query for metadata, but the simplest query for a file is:
file = ln.File.select(name="My dataset1").one() # get exactly one result
If you don't have specific metadata in mind, search for the file:
ln.File.search("dataset1")
Load the file back into memory:
df = file.load()
Or get a backed accessor to stream its content from the cloud
backed = file.backed() # currently works for AnnData, zarr, HDF5, not yet for DataFrame
Track & query data lineage
user = ln.User.select(handle="lizlemon").one()
ln.File.select(created_by=user).df() # all files ingested by lizlemon
ln.File.select().order_by("-updated_at").first() # latest updated file
Notebooks
Track a Jupyter Notebook:
ln.track() # auto-detect & save notebook metadata
ln.File("my_artifact.parquet").save() # this file is an output of the notebook run
When you query this file later on you'll know from which notebook it came:
file = ln.File.select(name="my_artifact.parquet").one() # query for a file
file.transform # notebook with id, title, filename, version, etc.
file.run # the notebook run that created the file
Or you query for notebooks directly:
transforms = ln.Transform.select( # all notebooks with 'T cell' in the title created in 2022
name__contains="T cell", type="notebook", created_at__year=2022
).all()
ln.File.select(transform__in=transforms).all() # data artifacts created by these notebooks
Pipelines
This works just like it does for notebooks just that you need to provide pipeline metadata yourself.
Save a pipeline to the Transform
registry, call
ln.Transform(name="Awesom-O", version="0.41.2").save() # save a pipeline, optionally with metadata
Track a pipeline run:
transform = ln.Transform.select(name="Awesom-O", version="0.41.2").one() # select pipeline from the registry
ln.track(transform) # create a new global run context
ln.File("s3://my_samples01/my_artifact.fastq.gz").save() # link file against run & transform
Now, you can query, e.g., for
ln.Run.select(transform__name="Awesom-O").order_by("-created_at").df() # get the latest pipeline runs
Run inputs
To track run inputs, pass is_run_input
to any File
accessor: .stage()
, .load()
or .backed()
. For instance,
file.load(is_run_input=True)
Alternatively, you can track all files accessed through any of the methods by settings ln.settings.track_run_inputs = True
.
Auto-complete categoricals
When you're unsure about spellings, use a lookup object:
lookup = ln.Transform.lookup()
ln.Run.select(transform=lookup.awesome_o)
Load your data lake instance from anywhere
Let other users access your work including all lineage & metadata via a single line:
$ lamin load myaccount/myartifacts
Manage biological registries
lamin init --storage ./bioartifacts --schema bionty
...
Track biological features
...
Track biological samples
...
Manage custom schemas
- Create a GitHub repository with Django ORMs similar to github.com/laminlabs/lnschema-lamin1
- Create & deploy migrations via
lamin migrate create
andlamin migrate deploy
It's fastest if we do this for you based on our templates within an enterprise plan, but you can fully manage the process yourself.
Installation
pip install lamindb # basic data lake
pip install 'lamindb[jupyter]' # Jupyter notebook tracking
pip install 'lamindb[bionty]' # basic biological entities
pip install 'lamindb[fcs]' # .fcs files (flow cytometry)
pip install 'lamindb[aws]' # AWS (s3fs, etc.)
pip install 'lamindb[gcp]' # Google Cloud (gcfs, etc.)
Quick setup
Why do I have to sign up?
- Data lineage requires a user identity (who modified which data when?).
- Collaboration requires a user identity (who shares this with me?).
Signing up takes 1 min.
We do not store any of your data, but only basic metadata about you (email address, etc.) & your LaminDB instances (S3 bucket names, etc.).
- Sign up:
lamin signup <email>
- Log in:
lamin login <handle>
How does it work?
LaminDB builds semantics of R&D and biology onto well-established tools:
- SQLite & Postgres for SQL databases using Django ORM (previously: SQLModel)
- S3, GCP & local storage for object storage using fsspec
- Configurable storage formats: pyarrow, anndata, zarr, etc.
- Biological knowledge sources & ontologies: see Bionty
LaminDB is open source.
Architecture
LaminDB consists of the lamindb
Python package (repository here) with its components:
- bionty: Basic biological entities (usable standalone).
- lamindb-setup: Setup & configure LaminDB, client for Lamin Hub.
- lnschema-core: Core schema, ORMs to model data objects & data lineage.
- lnschema-bionty: Bionty schema, ORMs that are coupled to Bionty's entities.
- lnschema-lamin1: Exemplary configured schema to track samples, treatments, etc.
- nbproject: Parse metadata from Jupyter notebooks.
LaminHub & LaminApp are not open-sourced, and neither are templates that model lab operations.
Lamin's packages build on the infrastructure listed above.
Notebooks
- Find all guide notebooks here.
- You can run these notebooks in hosted versions of JupyterLab, e.g., Saturn Cloud, Google Vertex AI, Google Colab, and others.
- Jupyter Lab & Notebook offer a fully interactive experience, VS Code & others require using the CLI (
lamin track my-notebook.ipynb
)
Documentation
Read the docs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file lamindb-0.44.1.tar.gz
.
File metadata
- Download URL: lamindb-0.44.1.tar.gz
- Upload date:
- Size: 215.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.31.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e200467d8ea7377c7b1e2bab26a51f6bb40830aaad396417a4704d7273a98b16 |
|
MD5 | 0babe9e64776743438b8fc8ba5e27346 |
|
BLAKE2b-256 | dc5f958da9a94eae714164e73fafab7864a215609c4c06b06c23da5582dd87e6 |
Provenance
File details
Details for the file lamindb-0.44.1-py3-none-any.whl
.
File metadata
- Download URL: lamindb-0.44.1-py3-none-any.whl
- Upload date:
- Size: 52.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.31.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4559908f52d5b67e96d34ad6e0966e3fb8e1e0eef132e1fe79445b8728e30d0e |
|
MD5 | ccbe00bfef040b07df91b269e34195ee |
|
BLAKE2b-256 | 90df51484313dece4eab900091f63175cab6d7c9a0d135cfe8716be1329bd4ec |