Skip to main content

python wrapper for Lance columnar format

Project description

Python bindings for Lance Data Format

:warning: Under heavy development

Lance Logo

Lance is a new columnar data format for data science and machine learning

Why you should use Lance

  1. Is order of magnitude faster than parquet for point queries and nested data structures common to DS/ML
  2. Comes with a fast vector index that delivers sub-millisecond nearest neighbors search performance
  3. Is automatically versioned and supports lineage and time-travel for full reproducibility
  4. Integrated with duckdb/pandas/polars already. Easily convert from/to parquet in 2 lines of code

Quick start

Installation

pip install pylance

Make sure you have a recent version of pandas (1.5+), pyarrow (10.0+), and DuckDB (0.7.0+)

Converting to Lance

import lance

import pandas as pd
import pyarrow as pa
import pyarrow.dataset

df = pd.DataFrame({"a": [5], "b": [10]})
uri = "/tmp/test.parquet"
tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, uri, format='parquet')

parquet = pa.dataset.dataset(uri, format='parquet')
lance.write_dataset(parquet, "/tmp/test.lance")

Reading Lance data

dataset = lance.dataset("/tmp/test.lance")
assert isinstance(dataset, pa.dataset.Dataset)

Pandas

df = dataset.to_table().to_pandas()

DuckDB

import duckdb

# If this segfaults, make sure you have duckdb v0.7+ installed
duckdb.query("SELECT * FROM dataset LIMIT 10").to_df()

Vector search

Download the sift1m subset

wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xzf sift.tar.gz

Convert it to Lance

import lance
from lance.vector import vec_to_table
import numpy as np
import struct

nvecs = 1000000
ndims = 128
with open("sift/sift_base.fvecs", mode="rb") as fobj:
    buf = fobj.read()
    data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * nvecs * ndims])).reshape((nvecs, ndims))
    dd = dict(zip(range(nvecs), data))

table = vec_to_table(dd)
uri = "vec_data.lance"
sift1m = lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)

Build the index

sift1m.create_index("vector",
                    index_type="IVF_PQ", 
                    num_partitions=256,  # IVF
                    num_sub_vectors=16)  # PQ

Search the dataset

# Get top 10 similar vectors
import duckdb

dataset = lance.dataset(uri)

# Sample 100 query vectors. If this segfaults, make sure you have duckdb v0.7+ installed
sample = duckdb.query("SELECT vector FROM dataset USING SAMPLE 100").to_df()
query_vectors = np.array([np.array(x) for x in sample.vector])

# Get nearest neighbors for all of them
rs = [dataset.to_table(nearest={"column": "vector", "k": 10, "q": q})      
      for q in query_vectors]

*More distance metrics, HNSW, and distributed support is on the roadmap

Python package details

Install from PyPI: pip install pylance # >=0.3.0 is the new rust-based implementation Install from source: maturin develop (under the /python directory) Run unit tests: make test Run integration tests: make integtest

Import via: import lance

The python integration is done via pyo3 + custom python code:

  1. We make wrapper classes in Rust for Dataset/Scanner/RecordBatchReader that's exposed to python.
  2. These are then used by LanceDataset / LanceScanner implementations that extend pyarrow Dataset/Scanner for duckdb compat.
  3. Data is delivered via the Arrow C Data Interface

Motivation

Why do we need a new format for data science and machine learning?

1. Reproducibility is a must-have

Versioning and experimentation support should be built into the dataset instead of requiring multiple tools.
It should also be efficient and not require expensive copying everytime you want to create a new version.
We call this "Zero copy versioning" in Lance. It makes versioning data easy without increasing storage costs.

2. Cloud storage is now the default

Remote object storage is the default now for data science and machine learning and the performance characteristics of cloud are fundamentally different.
Lance format is optimized to be cloud native. Common operations like filter-then-take can be order of magnitude faster using Lance than Parquet, especially for ML data.

3. Vectors must be a first class citizen, not a separate thing

The majority of reasonable scale workflows should not require the added complexity and cost of a specialized database just to compute vector similarity. Lance integrates optimized vector indices into a columnar format so no additional infrastructure is required to get low latency top-K similarity search.

4. Open standards is a requirement

The DS/ML ecosystem is incredibly rich and data must be easily accessible across different languages, tools, and environments. Lance makes Apache Arrow integration its primary interface, which means conversions to/from is 2 lines of code, your code does not need to change after conversion, and nothing is locked-up to force you to pay for vendor compute. We need open-source not fauxpen-source.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

pylance-0.9.7-cp38-abi3-win_amd64.whl (19.2 MB view details)

Uploaded CPython 3.8+ Windows x86-64

pylance-0.9.7-cp38-abi3-manylinux_2_24_aarch64.whl (17.4 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.24+ ARM64

pylance-0.9.7-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.7 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

pylance-0.9.7-cp38-abi3-macosx_11_0_arm64.whl (16.3 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

pylance-0.9.7-cp38-abi3-macosx_10_15_x86_64.whl (17.7 MB view details)

Uploaded CPython 3.8+ macOS 10.15+ x86-64

File details

Details for the file pylance-0.9.7-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: pylance-0.9.7-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 19.2 MB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.10

File hashes

Hashes for pylance-0.9.7-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 0ea27558357738dffea3c5360cefd343dbb013808e33a617d741a8b1e5c6d006
MD5 f6e64188fcbc846046e19ca80e6310cc
BLAKE2b-256 65a354297f076161c5b50a0c12d044ecacd50caab483074496cd331f54a1207f

See more details on using hashes here.

Provenance

File details

Details for the file pylance-0.9.7-cp38-abi3-manylinux_2_24_aarch64.whl.

File metadata

File hashes

Hashes for pylance-0.9.7-cp38-abi3-manylinux_2_24_aarch64.whl
Algorithm Hash digest
SHA256 8ec37446d04aa7e2207f0ee13d3679f0640ad566c601e5f19db6a4092c150996
MD5 0fa67873873823545b2bb0c5aef79910
BLAKE2b-256 8145fb68d775aa1225bcbed71de44be2ba3a879b733e03315d39f088680891e2

See more details on using hashes here.

Provenance

File details

Details for the file pylance-0.9.7-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pylance-0.9.7-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 246962523cfb95cfceac714301e4b42d7450bea99fdfd3c4ad0cbc992c8f570d
MD5 92cc763f432fb6df064de456be3a9cc8
BLAKE2b-256 8d1e761f11f46c240087b2d83dc67e18c8315038327533e3baaaf54ec621b9de

See more details on using hashes here.

Provenance

File details

Details for the file pylance-0.9.7-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pylance-0.9.7-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 894c9fff2e4c4cc02e3c8a444b410c655b9aef759a5e4023def76b448bba0c79
MD5 863bfe796bcb75887185cf91a517d19c
BLAKE2b-256 48ac3065c28324b134bdbb837471e6a1380a3cc8ad31eae9aeab16b6353fb682

See more details on using hashes here.

Provenance

File details

Details for the file pylance-0.9.7-cp38-abi3-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for pylance-0.9.7-cp38-abi3-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 5538838a584908656603dd908b25dc4f82b329af8c97e65361327e1a71ffa5bc
MD5 1fd89e39ba3b2e5ecb631cf9b867cd02
BLAKE2b-256 3bf1ffabe183df05e6ac6829122c6bddef4d7ce20d8f69418ce833a6f2a58c3f

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page