LaminDB: Manage R&D data & analyses.
Project description
LaminDB: Data lake for biology
LaminDB is an API layer for your existing infrastructure to manage your existing data & analyses.
Public beta: Currently only recommended for collaborators as we still make breaking changes.
Update 2023-06-05: We completed a major migration from SQLAlchemy/SQLModel to Django, available in pre-releases of v0.42.
Features
Free:
- Track data lineage across notebooks, pipelines & apps.
- Manage biological registries, ontologies & features.
- Persist, load & stream data objects with a single line of code.
- Query for anything, define & manage your own schemas.
- Manage data on your laptop, on your server or in your cloud infra.
- Use a mesh of distributed LaminDB instances for different teams and purposes.
- Share instances through a Hub akin to GitHub.
Enterprise:
- Explore & share data, submit samples & track lineage with LaminApp (deployable in your infra).
- Receive support, code templates & services for a BioTech data & analytics platform.
How does it work?
LaminDB builds semantics of R&D and biology onto well-established tools:
- SQLite & Postgres for SQL databases
- S3, GCP & local storage for object storage
- Django ORM and fsspec
- Configurable storage formats: pyarrow, anndata, zarr, etc.
- Biological knowledge resources & ontologies: see Bionty
LaminDB is open source. For details, see Architecture.
Installation
pip install lamindb # basic data lake
pip install 'lamindb[bionty]' # biological entities
pip install 'lamindb[nbproject]' # Jupyter notebook tracking
pip install 'lamindb[aws]' # AWS dependencies (s3fs, etc.)
pip install 'lamindb[gcp]' # GCP dependencies (gcfs, etc.)
Quick setup
Why do I have to sign up?
- Data lineage requires a user identity (who modified which data when?).
- Collaboration requires a user identity (who shares this with me?).
Signing up takes 1 min.
We do not store any of your data, but only basic metadata about you (email address, etc.) & your instances (S3 bucket names, etc.).
- Sign up via
lamin signup <email>
. - Log in via
lamin login <handle>
. - Init an instance via
lamin init --storage <storage>
.
Usage overview
Track & query data lineage
ln.track() # auto-detect a notebook & register as a Transform
ln.File("my_artifact.parquet").save() # link Transform & Run objects to File object
Now, you can query, e.g., for
ln.File.select(created_by__handle="user1").df() # a DataFrame of all files ingested by user1
ln.File.select().order_by("-updated_at").first() # latest updated file
Or for
transforms = ln.Transform.select( # all notebooks with 'T cell' in the title created in 2022
name__contains="T cell", type="notebook", created_at__year=2022
).all()
ln.File.select(transform=transforms[1]).all() # files ingested by the second notebook in transforms
Or, if you'd like to track a run of a registered pipeline (here, "Cell Ranger"):
transform = ln.Transform.select(name="Cell Ranger", version="0.7.1").one() # select a pipeline from the registry
ln.track(transform) # create a new global run context
ln.File("s3://my_samples01/my_artifact.fastq.gz").save() # link file against run & transform
Now, you can query, e.g., for
run = ln.select(ln.Run, transform__name="Cell Ranger").order_by("-created_at").df() # get the latest Cell Ranger pipeline runs
# query files by selected runs, etc.
Persist & load data objects
df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
ln.File(df, name="My dataframe").save()
Get it back:
file = ln.select(ln.File, name="My dataframe").one() # query for it
df = file.load() # load it into memory
a b
0 1 3
1 2 4
Manage biological registries
lamin init --storage ./myobjects --schema bionty
...
Track biological features
...
Track biological samples
...
Manage custom schemas
- Create a GitHub repository with Django ORMs similar to github.com/laminlabs/lnschema-lamin1
- Create & deploy migrations via
lamin migrate create
andlamin migrate deploy
It's fastest if we do this for you based on our templates within an enterprise plan, but you can fully manage the process yourself.
Notebooks
- Find all guide notebooks here.
- You can run these notebooks in hosted versions of JupyterLab, e.g., Saturn Cloud, Google Vertex AI, and others or on Google Colab.
- Jupyter Lab & Notebook offer a fully interactive experience, VS Code & others require using the CLI (
lamin track my-notebook.ipynb
)
Architecture
LaminDB consists of the lamindb
Python package, which builds on a number of open-source packages:
- bionty: Basic biological entities (usable standalone).
- lamindb-setup: Setup & configure LaminDB, client for Lamin Hub.
- lnschema-core: Core schema, ORMs to model data objects & data lineage.
- lnschema-bionty: Bionty schema, ORMs that are coupled to Bionty's entities.
- lnschema-lamin1: Exemplary configured schema to track samples, treatments, etc.
- nbproject: Parse metadata from Jupyter notebooks.
LaminHub & LaminApp are not open-sourced, neither are templates that model lab operations.
Lamin's packages build on the infrastructure listed above. Previously, they were based on SQLAlchemy/SQLModel instead of Django, and cloudpathlib instead of fsspec.
Documentation
Read the docs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file lamindb-0.42a9.tar.gz
.
File metadata
- Download URL: lamindb-0.42a9.tar.gz
- Upload date:
- Size: 182.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.31.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | be087137c7fbfb6c12ab7d260426065a2eaaf5d1daf15bf4ef61b068b38e166d |
|
MD5 | 76117ffb3cdbf5c35d1063515ba3ded6 |
|
BLAKE2b-256 | 000fbce383ff972730c4f4354d5d54eba476e50391d3e1fe7bd5317da10e5ab9 |
Provenance
File details
Details for the file lamindb-0.42a9-py3-none-any.whl
.
File metadata
- Download URL: lamindb-0.42a9-py3-none-any.whl
- Upload date:
- Size: 52.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.31.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 06c9f83ed6a41a4a7c58d55a2a95b5ad922feb6ed95861438f8669874fc53b7a |
|
MD5 | 5a9c6a9db6cf36715d7c831e53c376e1 |
|
BLAKE2b-256 | ddffd2fa077140fb56e7f6d49ebec0391cee363a8e15c12193ade4a49b4167a1 |