Table of Contents
Introduction
Environment
Setup Custom Embeddings Class
Milvus Store Creation
Document Loading
Insertion Methods
Single Insertions
Bulk Insertion
Batched Insertion
Main Execution and Results
Conclusion
1. Introduction
In this article, we'll explore the performance differences between single, bulk, and batched insertions when working with Milvus, a powerful vector database. We'll use LangChain, a framework for developing applications powered by language models, to interact with Milvus and perform our experiments.
2. Environment Setup
Before diving into the code, let's set up our environment. We need to install the following packages for working with embeddings, Milvus, and LangChain, as well as utilities for progress tracking and converting pdfs to txt:
$ pip install sentence-transformers langchain-milvus langchain-core tqdm pymilvus PyMuPDF
Within your directory create a new python file and name it as pdf_converter.py
. Then, create two subdirectories: pdfs
and books
. Visit Project Gutenberg and download 10 or so books in PDF format. Save all these PDFs in the pdfs
directory. Run python pdf_converter.py
- the script will process each PDF file and save the extracted text as .txt
files in the books
directory.
home/
│
├── pdfs/ # Put your downloaded PDFs here
│ ├── book1.pdf
│ ├── book2.pdf
│ └── ...
│
├── books/ # Extracted text files will appear here
│ ├── book1.txt
│ ├── book2.txt
│ └── ...
└── pdf_converter.py # Our code for converting pdfs to txt
└── main.py # Our code for testing insertion methods
3. Setup Custom Embeddings Class
We start by defining a custom embeddings class that utilizes the SentenceTransformer model:
Our SentenceTransformerEmbeddings
class encapsulates the embedding functionality, allowing us to generate embeddings for both individual queries and lists of documents. By default, it uses the all-MiniLM-L6-v2
model, which provides a good balance between performance and quality for lots of NLP tasks.
4. Milvus Store Creation
Next, let's define a function to create a fresh Milvus store:
Our create_milvus_store
function initializes a new Milvus collection using LangChain's Milvus integration. It sets up the connection to a Milvus Lite database and ensures we start with a clean slate by dropping any existing collection with the same name.
5. Document Loading
To simulate a real-world scenario, we create a function to load documents from text files:
The load_documents
function reads all .txt files from a specified directory, splits them into paragraphs, and creates LangChain Document objects. Each document includes the paragraph text and metadata about its source file.
6. Insertion Methods
We implement three different insertion methods to compare their performance:
6.a Single Insertions
This function inserts documents one at a time, simulating a scenario where documents are processed individually as they become available.
6.b Bulk Insertion
The bulk insertion method adds all documents to Milvus in a single operation, which can be more efficient for large datasets that are available all at once.
6.c Batched Insertion
Batched insertion strikes a balance between single and bulk insertions by processing documents in smaller groups. This can be useful when dealing with large datasets that exceed memory constraints or when you want to report progress more frequently.
7. Main Execution and Results
Let's tie everything together in a main function:
We load our documents, run each insertion method, then finally report the results, including total time, insertion speed, and speedup factors compared to single insertions.
8. Conclusion
Comparing the performance of different insertion strategies when working with Milvus and LangChain is incredibly crucial - especially in production environments. By analyzing the results, we can make informed decisions about the most efficient way to insert data into our vector database based on our specific use case and data characteristics.
Typically, bulk insertions offer the best performance for large datasets that can be processed all at once, while batched insertions provide a good compromise between memory usage and speed for very large datasets or streaming scenarios. Single insertions, while slower, might be necessary for real-time or incremental updates to the database.
By understanding these trade-offs, developers can optimize their Milvus-based applications for better performance and resource utilization.
Here's the full code for the embeddings script and a helper script for converting pdfs to txt.