Single vs. Bulk Insertions with Milvus

Introduction
Environment
Setup Custom Embeddings Class
Milvus Store Creation
Document Loading
Insertion Methods
1. Single Insertions
2. Bulk Insertion
3. Batched Insertion
Main Execution and Results
Conclusion

1. Introduction

In this article, we'll explore the performance differences between single, bulk, and batched insertions when working with Milvus, a powerful vector database. We'll use LangChain, a framework for developing applications powered by language models, to interact with Milvus and perform our experiments.

2. Environment Setup

Before diving into the code, let's set up our environment. We need to install the following packages for working with embeddings, Milvus, and LangChain, as well as utilities for progress tracking and converting pdfs to txt:

$ pip install sentence-transformers langchain-milvus langchain-core tqdm pymilvus PyMuPDF

Within your directory create a new python file and name it as pdf_converter.py. Then, create two subdirectories: pdfs and books. Visit Project Gutenberg and download 10 or so books in PDF format. Save all these PDFs in the pdfs directory. Run python pdf_converter.py - the script will process each PDF file and save the extracted text as .txt files in the books directory.

home/
│
├── pdfs/                     # Put your downloaded PDFs here
│   ├── book1.pdf
│   ├── book2.pdf
│   └── ...
│
├── books/                    # Extracted text files will appear here
│   ├── book1.txt
│   ├── book2.txt
│   └── ...
└── pdf_converter.py          # Our code for converting pdfs to txt
└── main.py                   # Our code for testing insertion methods

3. Setup Custom Embeddings Class

We start by defining a custom embeddings class that utilizes the SentenceTransformer model:

Our SentenceTransformerEmbeddings class encapsulates the embedding functionality, allowing us to generate embeddings for both individual queries and lists of documents. By default, it uses the all-MiniLM-L6-v2 model, which provides a good balance between performance and quality for lots of NLP tasks.

4. Milvus Store Creation

Next, let's define a function to create a fresh Milvus store:

Our create_milvus_store function initializes a new Milvus collection using LangChain's Milvus integration. It sets up the connection to a Milvus Lite database and ensures we start with a clean slate by dropping any existing collection with the same name.

5. Document Loading

To simulate a real-world scenario, we create a function to load documents from text files:

The load_documents function reads all .txt files from a specified directory, splits them into paragraphs, and creates LangChain Document objects. Each document includes the paragraph text and metadata about its source file.

6. Insertion Methods

We implement three different insertion methods to compare their performance:

6.a Single Insertions

This function inserts documents one at a time, simulating a scenario where documents are processed individually as they become available.

6.b Bulk Insertion

The bulk insertion method adds all documents to Milvus in a single operation, which can be more efficient for large datasets that are available all at once.

6.c Batched Insertion

Batched insertion strikes a balance between single and bulk insertions by processing documents in smaller groups. This can be useful when dealing with large datasets that exceed memory constraints or when you want to report progress more frequently.

7. Main Execution and Results

Let's tie everything together in a main function:

We load our documents, run each insertion method, then finally report the results, including total time, insertion speed, and speedup factors compared to single insertions.

8. Conclusion

Comparing the performance of different insertion strategies when working with Milvus and LangChain is incredibly crucial - especially in production environments. By analyzing the results, we can make informed decisions about the most efficient way to insert data into our vector database based on our specific use case and data characteristics.

Typically, bulk insertions offer the best performance for large datasets that can be processed all at once, while batched insertions provide a good compromise between memory usage and speed for very large datasets or streaming scenarios. Single insertions, while slower, might be necessary for real-time or incremental updates to the database.

By understanding these trade-offs, developers can optimize their Milvus-based applications for better performance and resource utilization.

Here's the full code for the embeddings script and a helper script for converting pdfs to txt.

pdf_to_text.py

main.py

Semantic Search: Milvus, Python & Vector Databases

RAG Embedding Space Visualizer with Streamlit and LangChain