Skip to main content

python wrapper for Lance columnar format

Project description

Python bindings for Lance Data Format

:warning: Under heavy development

Lance Logo

Lance is a new columnar data format for data science and machine learning

Why you should use Lance

  1. Is order of magnitude faster than parquet for point queries and nested data structures common to DS/ML
  2. Comes with a fast vector index that delivers sub-millisecond nearest neighbors search performance
  3. Is automatically versioned and supports lineage and time-travel for full reproducibility
  4. Integrated with duckdb/pandas/polars already. Easily convert from/to parquet in 2 lines of code

Quick start

Installation

pip install pylance

Make sure you have a recent version of pandas (1.5+), pyarrow (10.0+), and DuckDB (0.7.0+)

Converting to Lance

import lance

import pandas as pd
import pyarrow as pa
import pyarrow.dataset

df = pd.DataFrame({"a": [5], "b": [10]})
uri = "/tmp/test.parquet"
tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, uri, format='parquet')

parquet = pa.dataset.dataset(uri, format='parquet')
lance.write_dataset(parquet, "/tmp/test.lance")

Reading Lance data

dataset = lance.dataset("/tmp/test.lance")
assert isinstance(dataset, pa.dataset.Dataset)

Pandas

df = dataset.to_table().to_pandas()

DuckDB

import duckdb

# If this segfaults, make sure you have duckdb v0.7+ installed
duckdb.query("SELECT * FROM dataset LIMIT 10").to_df()

Vector search

Download the sift1m subset

wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xzf sift.tar.gz

Convert it to Lance

import lance
from lance.vector import vec_to_table
import numpy as np
import struct

nvecs = 1000000
ndims = 128
with open("sift/sift_base.fvecs", mode="rb") as fobj:
    buf = fobj.read()
    data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * nvecs * ndims])).reshape((nvecs, ndims))
    dd = dict(zip(range(nvecs), data))

table = vec_to_table(dd)
uri = "vec_data.lance"
sift1m = lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)

Build the index

sift1m.create_index("vector",
                    index_type="IVF_PQ", 
                    num_partitions=256,  # IVF
                    num_sub_vectors=16)  # PQ

Search the dataset

# Get top 10 similar vectors
import duckdb

dataset = lance.dataset(uri)

# Sample 100 query vectors. If this segfaults, make sure you have duckdb v0.7+ installed
sample = duckdb.query("SELECT vector FROM dataset USING SAMPLE 100").to_df()
query_vectors = np.array([np.array(x) for x in sample.vector])

# Get nearest neighbors for all of them
rs = [dataset.to_table(nearest={"column": "vector", "k": 10, "q": q})      
      for q in query_vectors]

*More distance metrics, HNSW, and distributed support is on the roadmap

Python package details

Install from PyPI: pip install pylance # >=0.3.0 is the new rust-based implementation Install from source: maturin develop (under the /python directory) Run unit tests: make test Run integration tests: make integtest

Import via: import lance

The python integration is done via pyo3 + custom python code:

  1. We make wrapper classes in Rust for Dataset/Scanner/RecordBatchReader that's exposed to python.
  2. These are then used by LanceDataset / LanceScanner implementations that extend pyarrow Dataset/Scanner for duckdb compat.
  3. Data is delivered via the Arrow C Data Interface

Motivation

Why do we need a new format for data science and machine learning?

1. Reproducibility is a must-have

Versioning and experimentation support should be built into the dataset instead of requiring multiple tools.
It should also be efficient and not require expensive copying everytime you want to create a new version.
We call this "Zero copy versioning" in Lance. It makes versioning data easy without increasing storage costs.

2. Cloud storage is now the default

Remote object storage is the default now for data science and machine learning and the performance characteristics of cloud are fundamentally different.
Lance format is optimized to be cloud native. Common operations like filter-then-take can be order of magnitude faster using Lance than Parquet, especially for ML data.

3. Vectors must be a first class citizen, not a separate thing

The majority of reasonable scale workflows should not require the added complexity and cost of a specialized database just to compute vector similarity. Lance integrates optimized vector indices into a columnar format so no additional infrastructure is required to get low latency top-K similarity search.

4. Open standards is a requirement

The DS/ML ecosystem is incredibly rich and data must be easily accessible across different languages, tools, and environments. Lance makes Apache Arrow integration its primary interface, which means conversions to/from is 2 lines of code, your code does not need to change after conversion, and nothing is locked-up to force you to pay for vendor compute. We need open-source not fauxpen-source.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

pylance-0.14.0-cp39-abi3-win_amd64.whl (23.4 MB view details)

Uploaded CPython 3.9+ Windows x86-64

pylance-0.14.0-cp39-abi3-manylinux_2_28_x86_64.whl (25.4 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.28+ x86-64

pylance-0.14.0-cp39-abi3-manylinux_2_24_aarch64.whl (24.5 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.24+ ARM64

pylance-0.14.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25.4 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ x86-64

pylance-0.14.0-cp39-abi3-macosx_11_0_arm64.whl (20.1 MB view details)

Uploaded CPython 3.9+ macOS 11.0+ ARM64

pylance-0.14.0-cp39-abi3-macosx_10_15_x86_64.whl (21.9 MB view details)

Uploaded CPython 3.9+ macOS 10.15+ x86-64

File details

Details for the file pylance-0.14.0-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: pylance-0.14.0-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 23.4 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.13

File hashes

Hashes for pylance-0.14.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 33d46b088eae4d4f0d6ca6b05667f549d700cab8b0466543eedd1a670cf08b95
MD5 c5810ec699465a3ca7650e24140f260a
BLAKE2b-256 5bbc4a0fb76049d5ae52b484eb7e1ff4421f0dd8b7b81034180bb154034ee810

See more details on using hashes here.

Provenance

File details

Details for the file pylance-0.14.0-cp39-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pylance-0.14.0-cp39-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 51576d849f60f7f0019f9792633af70331b866dabb455434835392777f3280de
MD5 c277e93ebf39effd4817aabda7b47471
BLAKE2b-256 8f957450c2aa830155de9260481562cc1c3f3c0f003569817676864b659c9c49

See more details on using hashes here.

Provenance

File details

Details for the file pylance-0.14.0-cp39-abi3-manylinux_2_24_aarch64.whl.

File metadata

File hashes

Hashes for pylance-0.14.0-cp39-abi3-manylinux_2_24_aarch64.whl
Algorithm Hash digest
SHA256 4de53b9def8ecc1219e2ad48e66b986fea13a21f4fa208ea0a0524dbaa6a2f37
MD5 cea400d9d3bd636fc1527f3bfa62aa8d
BLAKE2b-256 de5ba3374444f7f31f71f7a920feba1fff9cab3d10a537dfcadf0ecf6d984ff8

See more details on using hashes here.

Provenance

File details

Details for the file pylance-0.14.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pylance-0.14.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 21dfcee2d599a306e1ab6a2c79ffe72ff3ea6e544074e08c1702347f2421d333
MD5 31970b927d5cc7c4d1349073cbce17f1
BLAKE2b-256 a506126814b989ccf7d95cabf11945d5560769182cabda433dc3185703bde430

See more details on using hashes here.

Provenance

File details

Details for the file pylance-0.14.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pylance-0.14.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 793029fc787a4fe173869650f1d2c5e663febe919a1e535028b75b60ae66f82b
MD5 b2d07fa603da9620ebf6ff7078dd6faf
BLAKE2b-256 58650c29e4b2d91845872a7bc481fa2c6aed2e784ed1ce0767ac786389233fb7

See more details on using hashes here.

Provenance

File details

Details for the file pylance-0.14.0-cp39-abi3-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for pylance-0.14.0-cp39-abi3-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 d3b3b87415e72cc9899c4de4f2f118bce6448d73a78379a318c2982b4110481d
MD5 9863df90fc3fef0186faa799b3d818b3
BLAKE2b-256 7dc4b7a2fcd81bebbd250d33d6ecda7d3eb23b446cc6764b4b77fbbe33db7ed5

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page