Private, personalized searchable knowledge base, from your own notes.
Project description
Text Search Engine
This project implements a simple text search engine using Python. It processes text files to create a searchable index of sentences, allowing users to perform semantic searches against this indexed data. See the project [final report](docs/Information Retrieval Systems.pdf) for more details.
Installation
Python virtual environment
To set up the project environment, follow these steps:
- Clone the project repository or download the project files to your local machine.
- Navigate to the project directory.
- Create a Python virtual environment in the project directory:
pip install virtualenv python -m virtualenv .venv
- Activate the virtual environment (mac/linux):
source .venv/bin/activate
Install dependencies
Not that you have a virtual environment, you're ready to install some Python packages and download language models (spaCy and BERT).
- Install the required packages using the
requirements.txt
file:pip install -r requirements.txt
- Download the small spaCy language model (for sentence segmentation):
python -m spacy download en_core_web_sm
- Download the small BERT embedding model:
python -c 'from sentence_transformers import SentenceTransformer; sbert = SentenceTransformer("paraphrase-MiniLM-L6-v2")'
Quick start
You can search an example corpus of nutrition and health documents by running the search_engine.py
script.
Search your personal docs
- Replace the text files in
data/corpus
with your own. - Start the command-line search engine with:
python search_engine.py --refresh
The --refresh
flag ensures that a fresh index is created based on your documents.
Otherwise it may ignore the data/corpus
directory and reuse an existing index and corpus in the data/cache
directory.
The search_engine.py
script will first segement the text files into sentences.
Then it will create an inverse index to provide context for any retrieved information.
It will also create embedding vectors and locality sensitive hashes for experimenting with vector database and RAG (retrieval augmented generation)
then allow you to process search requests, returning the top matching sentences along with their filenames and line numbers.
Contributing
Contributions to this project are welcome.
License
This project is licensed under MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.