Skip to main content

Efficient vector DB using binary & int8 embeddings

Project description

BinaryVectorDB - Efficient Search on Large Datasets

This repository contains a Binary Vector Database for efficient search on large datasets, aimed for educational purposes.

Most embedding models represent their vectors as float32: These consume a lot of memory and search on these is very slow. At Cohere, we introduced the first embedding model with native int8 and binary support, which give you excellent search quality for a fraction of the cost:

Model Search Quality MIRACL Time to Search 1M docs Memory Needed 250M Wikipedia Embeddings Price on AWS (x2gb instance)
OpenAI text-embedding-3-small 44.9 680 ms 1431 GB $65,231 / yr
OpenAI text-embedding-3-large 54.9 1240 ms 2861 GB $130,463 / yr
Cohere Embed v3 (Multilingual)
Embed v3 - float32 66.3 460 ms 954 GB $43,488 / yr
Embed v3 - binary 62.8 24 ms 30 GB $1,359 / yr
Embed v3 - binary + int8 rescore 66.3 28 ms 30 GB + 240 GB disk $1,589 / yr

Setup

The setup is easy:

pip install BinaryVectorDB

To use some of the below examples you need a Cohere API key (free or paid) from https://cohere.com/

Usage - Load an Existing Binary Vector Database

We will talk later how to build your own vector database. For the start, let us use a pre-build binary vector database. We host various pre-build databases on https://huggingface.co/Cohere/BinaryVectorDB. You can download these and use them localy.

Let us the simple English version from Wikipedia to get started:

wget https://huggingface.co/datasets/Cohere/BinaryVectorDB/resolve/main/wikipedia-2023-11-simple.zip

And then unzip this file:

unzip wikipedia-2023-11-simple.zip

Load the Vector Database

You can load the database easily by pointing it to the unzipped folder from the previous step:

from BinaryVectorDB import CohereBinaryVectorDB

# Point it to the unzipped folder from the previous step
# Ensure that you have set your Cohere API key via: export COHERE_API_KEY=<<YOUR_KEY>>
db = CohereBinaryVectorDB("wikipedia-2023-11-simple/")

query = "Who is the founder of Facebook"
print("Query:", query)
hits = db.search(query)
for hit in hits[0:3]:
    print(hit)

The database has 646,424 embeddings and a total size of XXX MB. However, just XXX MB for the binary embeddings are loaded in memory. The documents and their int8 embeddings are kept on disk and are just loaded when needed.

This split of binary embeddings in memory and int8 embeddings & documents on disk allows us to scale to very large datasets without need tons of memory.

Build your own Binary Vector Database

It is quite easy to build your own Binary Vector Database.

from BinaryVectorDB import CohereBinaryVectorDB
import os
import gzip
import json

simplewiki_file = "simple-wikipedia-example.jsonl.gz"

#If file not exist, download
if not os.path.exists(simplewiki_file):
    cmd = f"wget https://huggingface.co/datasets/Cohere/BinaryVectorDB/resolve/main/simple-wikipedia-example.jsonl.gz"
    os.system(cmd)

# Create the vector DB with an empty folder
# Ensure that you have set your Cohere API key via: export COHERE_API_KEY=<<YOUR_KEY>>
db_folder = "path_to_an_empty_folder/"
db = CohereBinaryVectorDB(db_folder)

if len(db) > 0:
    exit(f"The database {db_folder} is not empty. Please provide an empty folder to create a new database.")

# Read all docs from the jsonl.gz file
docs = []
with gzip.open(simplewiki_file) as fIn:
    for line in fIn:
        docs.append(json.loads(line))

#Limit it to 10k docs to make the next step a bit faster
docs = docs[0:10_000]

# Add all documents to the DB
# docs2text defines a function that maps our documents to a string
# This string is then embedded with the state-of-the-art Cohere embedding model
db.add_documents(docs, docs2text=lambda doc: doc['title']+" "+doc['text'])

The document can be any Python serializable object. You need to provide a function for docs2text that map your document to a string. In the above example, we concatenate the title and text field. This string is send to the embedding model to produce the needed text embeddings.

Add more documents

It is easy to add more documents to an existing database:

from BinaryVectorDB import CohereBinaryVectorDB

db_folder = "path_to_an_empty_folder/"
db = CohereBinaryVectorDB(db_folder)

print(f"The DB has currently {len(db)} docs stored")

new_docs = [
    "BinaryVectorDB is an amazing example how binary & int8 embeddings allows scaling to large datasets",
    "To learn more about BinaryVectorDB visit cohere.com"
]

db.add_documents(docs, docs2text=lambda doc: doc)

print(f"The DB has currently {len(db)} docs stored")

Updating & Deleting Documents

Is this a real Vector Database?

Not really. It repository is meant mostly for educational purposes to show techniques how to scale to large datasets.

If you actually wants to go into production, use a proper vector database like Vespa.ai, that allows you to achieve similar results.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

BinaryVectorDB-0.0.1.tar.gz (10.3 kB view details)

Uploaded Source

File details

Details for the file BinaryVectorDB-0.0.1.tar.gz.

File metadata

  • Download URL: BinaryVectorDB-0.0.1.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/6.11.0 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.66.2 urllib3/1.26.5 CPython/3.10.12

File hashes

Hashes for BinaryVectorDB-0.0.1.tar.gz
Algorithm Hash digest
SHA256 4fe5668277f1c8c5ef3c4b361606ff3e71dd0518e13b1a11ae579075d0ee25d9
MD5 15e7734a71b65bcd554755f4ab466499
BLAKE2b-256 928c6561714e60fa6e0b292adef47ea2fd5c4340c32cb55e9fa7b11d5c5469bb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page