Skip to main content

python wrapper for Lance columnar format

Project description

Python bindings for Lance Data Format

:warning: Under heavy development

Lance Logo

Lance is a new columnar data format for data science and machine learning

Why you should use Lance

  1. Is order of magnitude faster than parquet for point queries and nested data structures common to DS/ML
  2. Comes with a fast vector index that delivers sub-millisecond nearest neighbors search performance
  3. Is automatically versioned and supports lineage and time-travel for full reproducibility
  4. Integrated with duckdb/pandas/polars already. Easily convert from/to parquet in 2 lines of code

Quick start

Installation

pip install pylance

Make sure you have a recent version of pandas (1.5+), pyarrow (10.0+), and DuckDB (0.7.0+)

Converting to Lance

import lance

import pandas as pd
import pyarrow as pa
import pyarrow.dataset

df = pd.DataFrame({"a": [5], "b": [10]})
uri = "/tmp/test.parquet"
tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, uri, format='parquet')

parquet = pa.dataset.dataset(uri, format='parquet')
lance.write_dataset(parquet, "/tmp/test.lance")

Reading Lance data

dataset = lance.dataset("/tmp/test.lance")
assert isinstance(dataset, pa.dataset.Dataset)

Pandas

df = dataset.to_table().to_pandas()

DuckDB

import duckdb

# If this segfaults, make sure you have duckdb v0.7+ installed
duckdb.query("SELECT * FROM dataset LIMIT 10").to_df()

Vector search

Download the sift1m subset

wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xzf sift.tar.gz

Convert it to Lance

import lance
from lance.vector import vec_to_table
import numpy as np
import struct

nvecs = 1000000
ndims = 128
with open("sift/sift_base.fvecs", mode="rb") as fobj:
    buf = fobj.read()
    data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * nvecs * ndims])).reshape((nvecs, ndims))
    dd = dict(zip(range(nvecs), data))

table = vec_to_table(dd)
uri = "vec_data.lance"
sift1m = lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)

Build the index

sift1m.create_index("vector",
                    index_type="IVF_PQ", 
                    num_partitions=256,  # IVF
                    num_sub_vectors=16)  # PQ

Search the dataset

# Get top 10 similar vectors
import duckdb

dataset = lance.dataset(uri)

# Sample 100 query vectors. If this segfaults, make sure you have duckdb v0.7+ installed
sample = duckdb.query("SELECT vector FROM dataset USING SAMPLE 100").to_df()
query_vectors = np.array([np.array(x) for x in sample.vector])

# Get nearest neighbors for all of them
rs = [dataset.to_table(nearest={"column": "vector", "k": 10, "q": q})      
      for q in query_vectors]

*More distance metrics, HNSW, and distributed support is on the roadmap

Python package details

Install from PyPI: pip install pylance # >=0.3.0 is the new rust-based implementation Install from source: maturin develop (under the /python directory) Run unit tests: make test Run integration tests: make integtest

Import via: import lance

The python integration is done via pyo3 + custom python code:

  1. We make wrapper classes in Rust for Dataset/Scanner/RecordBatchReader that's exposed to python.
  2. These are then used by LanceDataset / LanceScanner implementations that extend pyarrow Dataset/Scanner for duckdb compat.
  3. Data is delivered via the Arrow C Data Interface

Motivation

Why do we need a new format for data science and machine learning?

1. Reproducibility is a must-have

Versioning and experimentation support should be built into the dataset instead of requiring multiple tools.
It should also be efficient and not require expensive copying everytime you want to create a new version.
We call this "Zero copy versioning" in Lance. It makes versioning data easy without increasing storage costs.

2. Cloud storage is now the default

Remote object storage is the default now for data science and machine learning and the performance characteristics of cloud are fundamentally different.
Lance format is optimized to be cloud native. Common operations like filter-then-take can be order of magnitude faster using Lance than Parquet, especially for ML data.

3. Vectors must be a first class citizen, not a separate thing

The majority of reasonable scale workflows should not require the added complexity and cost of a specialized database just to compute vector similarity. Lance integrates optimized vector indices into a columnar format so no additional infrastructure is required to get low latency top-K similarity search.

4. Open standards is a requirement

The DS/ML ecosystem is incredibly rich and data must be easily accessible across different languages, tools, and environments. Lance makes Apache Arrow integration its primary interface, which means conversions to/from is 2 lines of code, your code does not need to change after conversion, and nothing is locked-up to force you to pay for vendor compute. We need open-source not fauxpen-source.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

pylance-0.13.0-cp39-abi3-win_amd64.whl (23.5 MB view details)

Uploaded CPython 3.9+ Windows x86-64

pylance-0.13.0-cp39-abi3-manylinux_2_28_x86_64.whl (25.5 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.28+ x86-64

pylance-0.13.0-cp39-abi3-manylinux_2_24_aarch64.whl (24.5 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.24+ ARM64

pylance-0.13.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25.5 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ x86-64

pylance-0.13.0-cp39-abi3-macosx_11_0_arm64.whl (20.2 MB view details)

Uploaded CPython 3.9+ macOS 11.0+ ARM64

pylance-0.13.0-cp39-abi3-macosx_10_15_x86_64.whl (22.0 MB view details)

Uploaded CPython 3.9+ macOS 10.15+ x86-64

File details

Details for the file pylance-0.13.0-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: pylance-0.13.0-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 23.5 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.9.13

File hashes

Hashes for pylance-0.13.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 4018ba016f1445874960a4ba2ad5c80cb380f3116683282ee8beabd38fa8989d
MD5 ea54243dc46b4f671fa41093ead90b9b
BLAKE2b-256 052f775154151e7b0e78fc4e514c3368ef3cc0fa84af3a735447d03f4785e153

See more details on using hashes here.

Provenance

File details

Details for the file pylance-0.13.0-cp39-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pylance-0.13.0-cp39-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c51d4b6e59cf4dc97c11a35b299f11e80dbdf392e2d8dc498573c26474a3c19e
MD5 e4bf61053322e976c0d7345e21f2b327
BLAKE2b-256 48556aa07f7d8c1058ba8f9fb73f6d31173ca8806d25a37287ad230be0d6ec58

See more details on using hashes here.

Provenance

File details

Details for the file pylance-0.13.0-cp39-abi3-manylinux_2_24_aarch64.whl.

File metadata

File hashes

Hashes for pylance-0.13.0-cp39-abi3-manylinux_2_24_aarch64.whl
Algorithm Hash digest
SHA256 d1ddd7700924bc6b6b0774ea63d2aa23f9210a86cd6d6af0cdfa987df776d50d
MD5 30df9a8c5e26ede6d8cec305c721ca13
BLAKE2b-256 d623c621414da8ce1d654043fbcdba4eca66a3e65e31e29629f2ec9a7fbd1168

See more details on using hashes here.

Provenance

File details

Details for the file pylance-0.13.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pylance-0.13.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e3cb92547e145f5bfb0ea7d6f483953913b9bdd44c45bea84fc95a18da9f5853
MD5 fe2414ea8ea61eb842ab068c11f383d0
BLAKE2b-256 7818c17e4689cb50706315847e677c699ecfd4281bafe780f9b18d3e55008ac8

See more details on using hashes here.

Provenance

File details

Details for the file pylance-0.13.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pylance-0.13.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f4861ab466c94b0f9a4b4e6de6e1dfa02f40e7242d8db87447bc7bb7d89606ac
MD5 9398f63910c45a6f5918f405a8be0828
BLAKE2b-256 652129ac42a8ad86e345dbdba594479cc26325e01f25b9f84051086c6b314c05

See more details on using hashes here.

Provenance

File details

Details for the file pylance-0.13.0-cp39-abi3-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for pylance-0.13.0-cp39-abi3-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 2f3d6f9eec1f59f45dccb01075ba79868b8d37c8371d6210bcf6418217a0dd8b
MD5 52c4f3e797a4e8a74d390a8bc2d436c7
BLAKE2b-256 7b802d69a49996a1b90888b3453da8f3083dcc85bc0014c062b63224fbcdc8be

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page