Skip to main content

Efficient vector DB on large datasets from disk, using minimal memory.

Project description

DiskVectorIndex - Ultra-Low Memory Vector Search on Large Dataset

Indexing large datasets (100M+ embeddings) requires a lot of memory in most vector databases: For 100M documents/embeddings, most vector databases require about 500GB of memory, driving the cost for your servers accordingly high.

This repository offers methods to be able to search on very large datasets (100M+) with just 300MB of memory, making semantic search on such large datasets suitable for the Memory-Poor developers.

We provide various pre-build indices, that can be used to semantic search and powering your RAG applications.

Pre-Build Indices

Below you find different pre-build indices. The embeddings are downloaded at the first call, the size is specified under Index Size. Most of the embeddings are memory mapped from disk, e.g. for the Cohere/trec-rag-2024-index corpus you need 15 GB of disk, but just 380 MB of memory to load the index.

Name Description #Docs Index Size (GB) Memory Needed
Cohere/trec-rag-2024-index Segmented corpus for TREC RAG 2024 113,520,750 15GB 380MB
fineweb-edu-10B-index (soon) 10B token sample from fineweb-edu embedded and indexed on document level. 9,267,429 1.4GB 230MB
fineweb-edu-100B-index (soon) 100B token sample from fineweb-edu embedded and indexed on document level. 69,672,066 9.2GB 380MB
fineweb-edu-350B-index (soon) 350B token sample from fineweb-edu embedded and indexed on document level. 160,198,578 21GB 380MB
fineweb-edu-index (soon) Full 1.3T token dataset fineweb-edu embedded and indexed on document level. 324,322,256 42GB 285MB

Each index comes with the respective corpus, that is chunked into smaller parts. These chunks are downloaded on-demand and reused for further queries.

Getting Started

Get your free Cohere API key from cohere.com. You must set this API key as an environment variable:

export COHERE_API_KEY=your_api_key

Make sure you have wget installed, to be able to download files on-the-fly from the HuggingFace hub.

Install the package:

pip install DiskVectorIndex

You can then search via:

from DiskVectorIndex import DiskVectorIndex

index = DiskVectorIndex("Cohere/trec-rag-2024-index")

while True:
    query = input("\n\nEnter a question: ")
    docs = index.search(query, top_k=3)
    for doc in docs:
        print(doc)
        print("=========")

You can also load a fully downloaded index from disk via:

from DiskVectorIndex import DiskVectorIndex

index = DiskVectorIndex("path/to/index")

How does it work?

The Cohere embeddings have been optimized to work well in compressed vector space, as detailed in our Cohere int8 & binary Embeddings blog post. The embeddings have not only been trained to work in float32, which requires a lot of memory, but to also operate well with int8, binary and Product Quantization (PQ) compression.

The above indices uses Product Quantization (PQ) to go from originally 1024*4=4096 bytes per embedding to just 128 bytes per embedding, reducing your memory requirement 32x.

Further, we use faiss with a memory mapped IVF: In this case, only a small fraction (between 32,768 and 131,072) embeddings must be loaded in memory.

Need Semantic Search at Scale?

At Cohere we helped customers to run Semantic Search on tens of billions of embeddings, at a fraction of the cost. Feel free to reach out for Nils Reimers if you need a solution that scales.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DiskVectorIndex-0.0.1.tar.gz (4.8 kB view details)

Uploaded Source

File details

Details for the file DiskVectorIndex-0.0.1.tar.gz.

File metadata

  • Download URL: DiskVectorIndex-0.0.1.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.13

File hashes

Hashes for DiskVectorIndex-0.0.1.tar.gz
Algorithm Hash digest
SHA256 9f4766c4d12028ba36fd27fb4ba7219daed4f297bcc44ebbddd37e0ff0463b87
MD5 764ccc91b6f5c72877a5e29f31963420
BLAKE2b-256 8cbe00f78794c347306c97e7a8f684fc9f8e3d732619bec66a3e64257038d2d3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page