ColBERT Live! implements efficient ColBERT search on top of vector indexes that support live updates (without rebuilding the entire index)

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence
- Text Processing :: Indexing

Project description

ColBERT Live!

ColBERT Live! implements efficient ColBERT search on top of vector indexes that support live updates (without rebuilding the entire index).

Background

ColBERT (Contextualized Late Interaction over BERT) is a state-of-the-art semantic search model that combines the effectiveness of BERT-based language models with the performance required for practical, large-scale search applications.

Compared to traditional dense passage retrieval (i.e. vector-per-passage) ColBERT is particularly strong at handling unusual terms and short queries.

It's reasonable to think of ColBERT as combining the best of semantic vector search with traditional keyword search a la BM25, but without having to tune the weighting of hybrid search or dealing with corner cases where the vector and keyword sides play poorly together.

However, the initial ColBERT implementation is designed around a custom index that cannot be updated incrementally. This means that adding, modifying, or removing documents from the search system requires reindexing the entire collection, which can be prohibitively slow for large datasets.

ColBERT Live!

ColBERT Live! implements ColBERT on any vector index. This means you can add, modify, or remove documents from your search system without the need for costly reindexing of the entire collection, making it ideal for dynamic content environments. It also means that you can easily apply other predicates such as access controls or metadata filters from your database to your vector searches. ColBERT Live! features

Efficient ColBERT search implementation
Support for live updates to the vector index
Abstraction layer for database backends, starting with AstraDB
State of the art ColBERT techniques including:
- Answer.AI ColBERT model for higher relevance
- Document embedding pooling for reduced storage requirements
- Query embedding pooling for improved search performance

Installation

You can install ColBERT Live! using pip:

pip install colbert-live

Usage

Subclass your database backend and implement the required methods for retrieving embeddings
Initialize ColbertLive(db)
Call ColbertLive.search(query_str, top_k)

Here's the code from the cmdline example, which implements adding and searching multi-chunk documents from the commandline.

class CmdlineDB(AstraDB):
    def prepare(self, embedding_dim: int):
        self.query_ann_stmt = ...
        self.query_chunks_stmt = ...
    def process_ann_rows(self, result: ResultSet) -> list[tuple[Any, float]]:
        ...
    def process_chunk_rows(self, result: ResultSet) -> list[torch.Tensor]:
        ...

def add_document(db, colbert_live, title, chunks):
    doc_id = db.add_document(title, chunks)
    chunk_embeddings = colbert_live.encode_chunks(chunks)
    db.add_embeddings(doc_id, chunk_embeddings)
    print(f"Document added with ID: {doc_id}")


def search_documents(db, colbert_live, query, k=5):
    results = colbert_live.search(query, k=k)
    print("\nSearch results:")
    for i, (chunk_pk, score) in enumerate(results, 1):
        doc_id, chunk_id = chunk_pk
        print(doc_id, type(doc_id))
        rows = db.session.execute(f"SELECT title FROM {db.keyspace}.documents WHERE id = %s", [doc_id])
        title = rows.one().title
        print(f"{i}. {title} (Score: {score:.4f})")


def main():
    args = ... # arg parsing skipped, see cmdline/main.py for details

    db = CmdlineDB('colbertlive',
                   'answerdotai/answerai-colbert-small-v1',
                   os.environ.get('ASTRA_DB_ID'),
                   os.environ.get('ASTRA_DB_TOKEN'))
    colbert_live = ColbertLive(db)

    if args.command == "add":
        add_document(db, colbert_live, args.title, args.chunks)
    elif args.command == "search":
        search_documents(db, colbert_live, args.query, args.k)

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence
- Text Processing :: Indexing

Release history Release notifications | RSS feed

0.9.0

Oct 10, 2024

0.4.2

Oct 9, 2024

0.4.1

Oct 7, 2024

0.4.0

Oct 4, 2024

0.3.0

Oct 1, 2024

0.2.6

Oct 1, 2024

0.2.5

Oct 1, 2024

0.2.4

Oct 1, 2024

0.2.3

Oct 1, 2024

0.2.2

Oct 1, 2024

0.2.1

Oct 1, 2024

0.2.0

Oct 1, 2024

This version

0.1.0

Sep 24, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

colbert_live-0.1.0.tar.gz (15.9 kB view details)

Uploaded Sep 24, 2024 Source

Built Distribution

colbert_live-0.1.0-py3-none-any.whl (15.6 kB view details)

Uploaded Sep 24, 2024 Python 3

File details

Details for the file colbert_live-0.1.0.tar.gz.

File metadata

Download URL: colbert_live-0.1.0.tar.gz
Upload date: Sep 24, 2024
Size: 15.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for colbert_live-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`fe080b4a32970e8ed4c31c6e22c74c9252bf135df62c414c5a748bc93a5c068e`
MD5	`37e937fd1e9db7403e5259218152a74a`
BLAKE2b-256	`d5334e04515455759bd1b832b40f11abc7b1a0c1a058fb089e08700d0546e666`

See more details on using hashes here.

File details

Details for the file colbert_live-0.1.0-py3-none-any.whl.

File metadata

Download URL: colbert_live-0.1.0-py3-none-any.whl
Upload date: Sep 24, 2024
Size: 15.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for colbert_live-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ce86aa61b30be973523532c057d8cd5603e087b7a0d5b144be73ef10df253de5`
MD5	`582279296c0e0454555c4c20d433516d`
BLAKE2b-256	`232564fc65be07bbdc4e316e0b56860a4edf545e0f94e05cb1ca0781bae89ca8`