Building a Smart Semantic Search App for Documents Query: A Step-by-Step Guide

6 min readSep 2, 2024

Introduction

In today’s data-driven world, professionals like project managers, researchers, and other knowledge workers often need to sift through large volumes of documents to find specific information. This task can be time-consuming, especially when dealing with various file formats like .docx and .txt. While cloud-based solutions offer powerful search capabilities, they often raise concerns about data privacy and security. What if you could have the power of semantic search in a local app, ensuring that your sensitive information stays on your machine?

In this article, we’ll walk through the process of building a local app that allows you to search through your documents using semantic search. We’ll explore the benefits of encoding, how it helps with accurate querying, and why this local solution is ideal for professionals who need to protect sensitive information.

What is Semantic Search?

Semantic search goes beyond simple keyword matching. It understands the context and meaning of your query, allowing it to find relevant documents even if they don’t contain the exact keywords you used. This is particularly useful when searching through documents that might phrase information differently than your query.

Why Encoding?

Encoding is the process of transforming text data into numerical representations, known as embeddings. These embeddings capture the semantic meaning of the text, allowing the search algorithm to compare the meanings of different documents and find the most relevant ones. By encoding your documents, you enable the semantic search functionality that powers accurate and context-aware queries.

The Use Case

Imagine you’re a project manager overseeing multiple projects, each with its own set of documentation. These documents might include meeting notes, project plans, research papers, and more. Finding a specific piece of information can become a daunting task, especially as the volume of documents grows.

With our app, you can simply enter a query and the app will return the top 5 documents that most closely match the meaning of your query. The best part? Everything runs locally on your machine, so there’s no risk of exposing sensitive project data to third-party cloud services.

Project Setup

Before diving into the code, ensure you have the following installed:

Python 3.11
PyQt5
sentence-transformers
scikit-learn
python-docx
NumPy

If you don’t have these, you can install them using pip:

pip install pyqt5 sentence-transformers scikit-learn python-docx numpy

Step 1: Loading and Encoding the Documents

The first step is to load and encode your .docx files. The code snippet below handles this task. It reads each .docx file in the specified directory, extracts the text, and encodes it using a pre-trained model from the SentenceTransformer library.

import os
import numpy as np
from sentence_transformers import SentenceTransformer
from docx import Document

# Load the model
model_name = 'all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)

# Function to load and encode .docx files
def load_and_encode(directory_path):
    encoded_data = []
    file_names = []

    for filename in os.listdir(directory_path):
        if filename.endswith(".docx"):
            file_path = os.path.join(directory_path, filename)
            doc = Document(file_path)
            full_text = "\n".join([para.text for para in doc.paragraphs])
            encoded_text = model.encode(full_text, convert_to_tensor=False)
            encoded_data.append(encoded_text)
            file_names.append(file_path)

    return np.array(encoded_data), np.array(file_names)

Step 2: Storing Encoded Data

Once the documents are encoded, you don’t want to re-encode them every time you query the app. The following snippet checks if the directory has changed since the last encoding and, if not, loads the existing encoded data. This optimization reduces the app’s response time.

import hashlib

# Function to calculate the hash of a directory's contents
def calculate_directory_hash(directory_path):
    hash_md5 = hashlib.md5()
    for root, dirs, files in os.walk(directory_path):
        for file in sorted(files):
            file_path = os.path.join(root, file)
            with open(file_path, "rb") as f:
                for chunk in iter(lambda: f.read(4096), b""):
                    hash_md5.update(chunk)
    return hash_md5.hexdigest()

# Function to load embeddings and file names if the directory has not changed
def load_data_if_unchanged(directory_path):
    current_hash = calculate_directory_hash(directory_path)
    hash_file_path = os.path.join(directory_path, 'dir_hash.txt')

    if os.path.exists(hash_file_path):
        with open(hash_file_path, 'r') as hash_file:
            saved_hash = hash_file.read()

        if saved_hash == current_hash:
            print("Directory unchanged. Loading existing data.")
            return np.load(os.path.join(directory_path, 'text_embeddings.npy')), \
                   np.load(os.path.join(directory_path, 'file_names.npy'))

    print("Directory changed or first time loading. Encoding data.")
    text_embeddings, file_names = load_and_encode(directory_path)
    np.save(os.path.join(directory_path, 'text_embeddings.npy'), text_embeddings)
    np.save(os.path.join(directory_path, 'file_names.npy'), file_names)

    with open(hash_file_path, 'w') as hash_file:
        hash_file.write(current_hash)

    return text_embeddings, file_names

Step 3: Querying the Model

With the documents encoded and stored, the next step is to query them. The snippet below takes a user query, encodes it, and calculates the cosine similarity between the query and each document’s embeddings. It then returns the five most relevant documents.

from sklearn.metrics.pairwise import cosine_similarity

# Function to query the model
def query_model(query, text_embeddings):
    query_embedding = model.encode([query], convert_to_tensor=False)
    similarities = cosine_similarity(query_embedding, text_embeddings)[0]
    return similarities

Step 4: Building the User Interface

Now, let’s create a simple UI using PyQt. The UI allows the user to select a directory of .docx files, enter a query, and view the five most relevant matches.

from PyQt5.QtWidgets import QApplication, QWidget, QVBoxLayout, QPushButton, QLabel, QFileDialog, QTextEdit

# Function to run the UI
def run_app():
    app = QApplication([])

    window = QWidget()
    window.setWindowTitle("Document Search")

    layout = QVBoxLayout()

    directory_label = QLabel("Select Directory:")
    layout.addWidget(directory_label)

    directory_button = QPushButton("Choose Directory")
    layout.addWidget(directory_button)

    query_label = QLabel("Enter your query:")
    layout.addWidget(query_label)

    query_text = QTextEdit()
    layout.addWidget(query_text)

    search_button = QPushButton("Search")
    layout.addWidget(search_button)

    result_label = QLabel("Results:")
    layout.addWidget(result_label)

    def choose_directory():
        directory = QFileDialog.getExistingDirectory(window, "Select Directory")
        if directory:
            directory_label.setText(f"Selected Directory: {directory}")
            window.directory_path = directory

    def search_documents():
        query = query_text.toPlainText()
        directory_path = window.directory_path

        text_embeddings, file_names = load_data_if_unchanged(directory_path)
        similarities = query_model(query, text_embeddings)

        top_indices = similarities.argsort()[-5:][::-1]
        results = ""
        for index in top_indices:
            doc_path = file_names[index]
            doc = Document(doc_path)
            full_text = "\n".join([para.text for para in doc.paragraphs])
            preview = full_text[:500]  # Preview of the first 500 characters
            results += f"File: {doc_path}\nPreview: {preview}\n\n"

        result_label.setText(results)

    directory_button.clicked.connect(choose_directory)
    search_button.clicked.connect(search_documents)

    window.setLayout(layout)
    window.show()
    app.exec_()

if __name__ == "__main__":
    run_app()

Here’s how the UI will look like:

Step 5: Packaging the App

To make the app easier to distribute, you can package it into an executable using PyInstaller:

pyinstaller --onefile --windowed your_script_name.py

This command creates a standalone executable that you can run on any Windows machine without needing to install Python or the required libraries.

For complete working code of this app, you can refer to this github link: https://github.com/monzurularash/Semantic-Search-App

How It Works

Loading and Encoding Documents: The app first checks if the directory containing your .docx and .txt files has changed since the last time you ran it. If the directory hasn't changed, it loads precomputed embeddings, saving you time. If there are changes, it re-encodes the documents and updates the embeddings.
Semantic Search: When you enter a query, the app encodes the query into an embedding and compares it with the embeddings of your documents using cosine similarity. This process identifies the documents most relevant to your query.
Displaying Results: The top 5 documents are displayed with a preview of their content, helping you quickly find the information you’re looking for.
Local Data Processing: Since all operations occur locally on your machine, you don’t have to worry about sensitive data being sent to the cloud.

Use Case and Benefits

This app is ideal for project managers, researchers, and other professionals who work with large volumes of documents. For example, a project manager might need to quickly find specific information across multiple project documents. Instead of manually searching through each file, the project manager can use this app to enter a query and instantly retrieve the most relevant documents. This not only saves time but also ensures that no important information is overlooked.

Similarly, a researcher dealing with a vast collection of research papers, notes, and articles can use this app to quickly locate specific studies or data points. The ability to semantically search through documents means that even if the exact keywords aren’t present, the app can still find documents that are conceptually related to the query.

Conclusion

In a world where data privacy and efficiency are increasingly important, having a powerful local tool like this semantic search app is invaluable. Whether you’re a project manager looking to streamline your workflow, a researcher needing quick access to specific studies, or any professional dealing with a large volume of documents, this app offers a secure, efficient, and intelligent solution.

By understanding and implementing the concepts of encoding and semantic search, you can significantly enhance your ability to manage and retrieve information from your document collections — all while ensuring that your sensitive data remains secure.