Efficient vector DB on large datasets from disk, using minimal memory.
Project description
DiskVectorIndex - Ultra-Low Memory Vector Search on Large Dataset
Indexing large datasets (100M+ embeddings) requires a lot of memory in most vector databases: For 100M documents/embeddings, most vector databases require about 500GB of memory, driving the cost for your servers accordingly high.
This repository offers methods to be able to search on very large datasets (100M+) with just 300MB of memory, making semantic search on such large datasets suitable for the Memory-Poor developers.
We provide various pre-build indices, that can be used to semantic search and powering your RAG applications.
Pre-Build Indices
Below you find different pre-build indices. The embeddings are downloaded at the first call, the size is specified under Index Size. Most of the embeddings are memory mapped from disk, e.g. for the Cohere/trec-rag-2024-index
corpus you need 15 GB of disk, but just 380 MB of memory to load the index.
Name | Description | #Docs | Index Size (GB) | Memory Needed |
---|---|---|---|---|
Cohere/trec-rag-2024-index | Segmented corpus for TREC RAG 2024 | 113,520,750 | 15GB | 380MB |
fineweb-edu-10B-index (soon) | 10B token sample from fineweb-edu embedded and indexed on document level. | 9,267,429 | 1.4GB | 230MB |
fineweb-edu-100B-index (soon) | 100B token sample from fineweb-edu embedded and indexed on document level. | 69,672,066 | 9.2GB | 380MB |
fineweb-edu-350B-index (soon) | 350B token sample from fineweb-edu embedded and indexed on document level. | 160,198,578 | 21GB | 380MB |
fineweb-edu-index (soon) | Full 1.3T token dataset fineweb-edu embedded and indexed on document level. | 324,322,256 | 42GB | 285MB |
Each index comes with the respective corpus, that is chunked into smaller parts. These chunks are downloaded on-demand and reused for further queries.
Getting Started
Get your free Cohere API key from cohere.com. You must set this API key as an environment variable:
export COHERE_API_KEY=your_api_key
Make sure you have wget
installed, to be able to download files on-the-fly from the HuggingFace hub.
Install the package:
pip install DiskVectorIndex
You can then search via:
from DiskVectorIndex import DiskVectorIndex
index = DiskVectorIndex("Cohere/trec-rag-2024-index")
while True:
query = input("\n\nEnter a question: ")
docs = index.search(query, top_k=3)
for doc in docs:
print(doc)
print("=========")
You can also load a fully downloaded index from disk via:
from DiskVectorIndex import DiskVectorIndex
index = DiskVectorIndex("path/to/index")
How does it work?
The Cohere embeddings have been optimized to work well in compressed vector space, as detailed in our Cohere int8 & binary Embeddings blog post. The embeddings have not only been trained to work in float32, which requires a lot of memory, but to also operate well with int8, binary and Product Quantization (PQ) compression.
The above indices uses Product Quantization (PQ) to go from originally 1024*4=4096 bytes per embedding to just 128 bytes per embedding, reducing your memory requirement 32x.
Further, we use faiss with a memory mapped IVF: In this case, only a small fraction (between 32,768 and 131,072) embeddings must be loaded in memory.
Need Semantic Search at Scale?
At Cohere we helped customers to run Semantic Search on tens of billions of embeddings, at a fraction of the cost. Feel free to reach out for Nils Reimers if you need a solution that scales.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file DiskVectorIndex-0.0.1.tar.gz
.
File metadata
- Download URL: DiskVectorIndex-0.0.1.tar.gz
- Upload date:
- Size: 4.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f4766c4d12028ba36fd27fb4ba7219daed4f297bcc44ebbddd37e0ff0463b87 |
|
MD5 | 764ccc91b6f5c72877a5e29f31963420 |
|
BLAKE2b-256 | 8cbe00f78794c347306c97e7a8f684fc9f8e3d732619bec66a3e64257038d2d3 |