The Stack
Our RAG Embedding Space Visualizer uses several key technologies. Streamlit is used for creating the interactive web application. LangChain provides document loading and text splitting capabilities. HuggingFace Transformers are employed for generating embeddings. UMAP is utilized for dimensionality reduction of high-dimensional embeddings. Finally, Plotly is used for creating interactive 3D visualizations. Let's break down the core components and their roles in our application. As always, the entire code base is at the bottom.
Getting Started
$ pip install streamlit pandas numpy langchain sentence-transformers umap-learn plotly PyPDF2
โ
Document Processing with LangChain
We use LangChain's document loaders to handle various file formats. This flexible approach allows us to process both PDF and text files. The load_and_chunk_document function uses either PyPDFLoader or TextLoader depending on the file type, and then applies text splitting logic to create manageable chunks.
Embedding Generation
For embedding generation, we utilize HuggingFace's sentence transformers. The generate_embeddings
function uses the all-MiniLM-L6-v2
model, which provides a good balance between performance and efficiency. This function takes the input text and returns an array of embeddings using the embed_documents
method.
Dimensionality Reduction with UMAP
To visualize high-dimensional embeddings in 3D space, we employ UMAP. The reduce_dimensions
function takes the high-dimensional embeddings and reduces them to three dimensions. UMAP is particularly well-suited for this task due to its ability to preserve both local and global structure in the data. We configure UMAP with specific parameters for n_neighbors
and min_dist
to optimize the visualization.
Interactive Visualization with Plotly
We use Plotly to create an interactive 3D scatter plot of our reduced embeddings. The visualize_embeddings
function takes the reduced embeddings, named chunks, and full chunks as input. It creates a DataFrame with the 3D coordinates and chunk information, then uses Plotly Express to generate an interactive 3D scatter plot. This implementation allows users to interact with the 3D plot, zooming in on specific clusters and hovering over points to view chunk details.
Putting It All Together with Streamlit
Streamlit ties all these components together into a really nice looking web application. The main function orchestrates the entire process, from file upload to visualization, providing a smooth UX. It handles file uploading, processes the document, generates embeddings, reduces dimensions, and finally visualizes the embedding space. Error handling is implemented to provide feedback to the user if any issues occur.
The Importance of This Visualization
For RAG systems, understanding the embedding space is crucial. This visualization approach, combining dimension reduction with interactive 3D plotting, provides a powerful tool for RAG developers and researchers to gain insights into their document collections, refine their retrieval strategies, and ultimately improve the performance of their RAG systems. By visualizing how different parts of a document or different documents relate to each other in the embedding space, we can gain insights into query-document relevance, assess content diversity, detect outliers, and refine chunking strategies. The visualization can help identify whether the document collection covers a diverse range of topics or is heavily focused on specific areas, highlight unique or anomalous content that could be important for certain queries or indicate noise in the dataset, and allow observation of how different chunks of the same document relate to each other, enabling the refinement of text splitting strategies to create more semantically coherent chunks. This simple view of the embedding space gives developers an easy way to make informed decisions about their RAG system's architecture and fine-tuning.