Welcome back! In the previous lesson, we explored how to generate embeddings for document chunks using OpenAI and LangChain. Today, we will build on that knowledge by diving into vector databases and how they enable the efficient retrieval of relevant information through similarity search.
Vector databases are specialized storage systems designed to handle high-dimensional vector data, such as the embeddings we generated in the last lesson. They are crucial for performing similarity searches, which allow us to find document chunks that are semantically similar to a given query. In this lesson, we will focus on using FAISS, a powerful tool developed by Facebook AI, to create a local vector storage. This will enable us to efficiently store and search through our embeddings, paving the way for advanced document retrieval tasks.
Before we can perform a similarity search, we need to prepare our document and initialize our embedding model.
Here's a quick recap of how to do it:
Python1from langchain_community.document_loaders import PyPDFLoader 2from langchain.text_splitter import RecursiveCharacterTextSplitter 3from langchain_openai import OpenAIEmbeddings 4 5# Define the file path 6file_path = "data/the_adventure_of_the_blue_carbuncle.pdf" 7 8# Create a loader for our document 9loader = PyPDFLoader(file_path) 10 11# Load the document 12docs = loader.load() 13 14# Split the document into chunks 15text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100) 16split_docs = text_splitter.split_documents(docs) 17 18# Initialize the OpenAI embedding model 19embedding_model = OpenAIEmbeddings()
This code snippet demonstrates how to load a document, split it into chunks, and initialize our embedding model, preparing everything for further processing.
With our document chunks ready and embedding model initialized, the next step is to generate embeddings and create a vector store. As you learned in the previous lesson, embeddings are numerical representations of text that capture semantic meaning.
We'll use FAISS
(Facebook AI Similarity Search) to create a vector store. Think of this as a specialized database designed specifically for storing and searching through embeddings efficiently.
Python1from langchain_community.vectorstores import FAISS 2 3# Generate embeddings for all the document chunks and create a vector store 4vectorstore = FAISS.from_documents(split_docs, embedding_model)
This single line of code handles all the complex work of embedding generation and storage, making it easy for us to perform similarity searches in the next step. Let's break down what's happening in this code:
- We import the
FAISS
class from LangChain's vector store collection. - We call
FAISS.from_documents()
and pass two important parameters:split_docs
: Our list of document chunks that we want to search through later.embedding_model
: Our OpenAI embedding model that will convert each text chunk into a vector.
Behind the scenes, this method:
- Takes each document chunk from
split_docs
. - Uses the embeddings model to convert each chunk's text into a numerical vector.
- Organizes all these vectors in the FAISS index for efficient searching.
- Returns a ready-to-use vector store that maintains the connection between the vectors and their original text.
It’s worth noting that the association between the embedding vectors and the original document objects (including their metadata) is preserved within the vector store. This is important because it enables the system not just to retrieve matching text chunks, but also to surface metadata like the page number or source file—critical in multi-document applications or user-facing interfaces.
Now that we have our vector store, we can perform a similarity search to retrieve relevant documents. Similarity search involves finding document chunks whose embeddings are closest to a given query's embedding. This allows us to extract information that is semantically similar to the query.
Here's how we perform a similarity search:
Python1# Define our search query 2query = "What was the main clue?" 3 4# Perform similarity search to find the top 3 most relevant document chunks 5retrieved_docs = vectorstore.similarity_search(query, k=3) 6 7# Loop through each retrieved document 8for doc in retrieved_docs: 9 # Print the first 300 characters of each document chunk 10 print(doc.page_content[:300], "...\n")
When we run this code with our Sherlock Holmes story, we get the following output:
Plain text1The little man stood glancing from one to the 2other of us with half-frightened, half-hopeful eyes, 3as one who is not sure whether he is on the verge 4of a windfall or of a catastrophe. Then he stepped 5into the cab, and in half an hour we were back in 6the sitting-room at Baker Street. Nothing had been ... 7 8less innocent aspect. Here is the stone; the stone 9came from the goose, and the goose came from Mr. 10Henry Baker, the gentleman with the bad hat and 11all the other characteristics with which I have bored 12you. So now we must set ourselves very seriously 13to finding this gentleman and ascertaining what 14pa ... 15 16she found matters as described by the last 17witness. Inspector Bradstreet, B division, 18gave evidence as to the arrest of Horner, 19who struggled frantically, and protested his 20innocence in the strongest terms. Evidence 21of a previous conviction for robbery having 22been given against the prisoner, the mag ...
As you can see, the similarity search has retrieved three document chunks that are semantically related to our query about the "main clue" in the story. Even though the exact phrase "main clue" might not appear in the text, the system has identified passages that discuss evidence, the stone (the blue carbuncle), and the investigation - all relevant to our query about clues in the mystery.
In this lesson, you learned how to create a local vector storage with FAISS
and perform a similarity search to retrieve relevant information from documents. We built on your knowledge of document loading, splitting, and embedding to enable efficient document retrieval.
As you move on to the practice exercises, I encourage you to experiment with different documents and queries to solidify your understanding. This hands-on practice will prepare you for the next unit, where we will continue to build on these skills. Keep up the great work, and I look forward to seeing you in the next lesson!