Project description

Self-hosted semantic search and QA

Have a conversation with your own private documents. Want to know what your doctor said at your last exam? What about the name of that cute person with the mishievious smile that you saw at San Diego Tech Coffee? You probably have that info in a text file on your laptop somewhere, but you probably haven't ever used the word "mischievous" in the notes you jot down in a rush. You may forget those details unless you have a tool like knowt to help you resurface them. All you need to do is put your text notes into the "data/corpus" directory and knowt will take care of the rest.

Under the hood, Knowt implements a RAG (Retrieval Augmented Generative model). So knowt first processes your private text files to create a searchable index of each passage of text you provide. This gives it to perform "semantic search" on this indexed data blazingly fast, without using approximations. See the project [final report](docs/Information Retrieval Systems.pdf) for more details. To index a 10k documents should take less than a minute, and adding new documents takes seconds. And answers to your questions take milliseconds. Even if you wanted to ask some general question about some fact on Wikipedia, that would take less than a second (though indexing those 10M text stings took 3-4 hours on my two-yr-old laptop).

Installation

Python virtual environment

To set up the project environment, follow these steps:

Clone the project repository or download the project files to your local machine.
Navigate to the project directory.
Create a Python virtual environment in the project directory:

pip install virtualenv
python -m virtualenv .venv

Activate the virtual environment (mac/linux):

source .venv/bin/activate

Install dependencies

Not that you have a virtual environment, you're ready to install some Python packages and download language models (spaCy and BERT).

Install the required packages using the requirements.txt file:

pip install -e .

Download the small BERT embedding model (you can use whichever open source model you like):

python -c 'from sentence_transformers import SentenceTransformer; sbert = SentenceTransformer("paraphrase-MiniLM-L6-v2")'

Quick start

You can search an example corpus of nutrition and health documents by running the search_engine.py script.

Search your personal docs

Replace the text files in data/corpus with your own.
Start the command-line search engine with:

python search_engine.py --refresh

The --refresh flag ensures that a fresh index is created based on your documents. Otherwise it may ignore the data/corpus directory and reuse an existing index and corpus in the data/cache directory.

The search_engine.py script will first segement the text files into sentences. Then it will create a "reverse index" by counting up words and character patterns in your documents. It will also creat semantic embeddings to allow you to as questions about vague concepts without even knowing any the words you used in your documents.

Contributing

Contributions to this project are welcome!

License

This project is licensed under MIT License.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.5

Mar 17, 2024

0.1.4

Mar 16, 2024

0.1.3

Mar 16, 2024

This version

0.1.1

Feb 26, 2024

0.1.0

Feb 17, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

knowt-0.1.1.tar.gz (17.4 kB view hashes)

Uploaded Feb 26, 2024 Source

Built Distribution

knowt-0.1.1-py3-none-any.whl (18.7 kB view hashes)

Uploaded Feb 26, 2024 Python 3

Hashes for knowt-0.1.1.tar.gz

Hashes for knowt-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`3a3d672d668212d98d247d554c78a043ec03e19aed41dbc4ccc618f227d6ed45`
MD5	`a1a628d58e0c237883caba1f7641cb84`
BLAKE2b-256	`01535527bcbb8426dabde655cb986ae916b4597c20dae9f201779ff6d47df3b9`

Hashes for knowt-0.1.1-py3-none-any.whl

Hashes for knowt-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5ee6434710293c2df7927b6475a7c253366a4d29539ebdc9074c9dd8d73b6186`
MD5	`3ce5ed5995c02d06f5ea22f014408532`
BLAKE2b-256	`19c3af7916a0e0bd10d451366b8621c13583cec8c958e38ea8bf5e42c08d2ed1`