Lesson 1
Creating a Document Processor for Contextual Retrieval
Creating a Document Processor for Contextual Retrieval

Welcome to the first lesson of our course on building a RAG-powered chatbot with LangChain and Python! In this course, we'll be creating a complete Retrieval-Augmented Generation (RAG) system that can intelligently answer questions based on your documents.

At the heart of any RAG system is the document processor. This component is responsible for taking your raw documents, processing them into a format that can be efficiently searched, and retrieving the most relevant information when a query is made. Think of it as the librarian of your RAG system — organizing information and fetching exactly what you need when you ask for it.

Understanding the Document Processor

The document processing pipeline we'll build today consists of several key steps:

  1. Loading documents from files (like PDFs)
  2. Splitting these documents into smaller, manageable chunks
  3. Creating vector embeddings for each chunk
  4. Storing these embeddings in a vector database
  5. Retrieving the most relevant chunks when a query is made

This document processor will serve as the foundation for our RAG chatbot. In later units, we'll build a chat engine that can maintain conversation history, and then integrate both components into a complete RAG system. By the end of this course, you'll have a powerful chatbot that can answer questions based on your document collection with remarkable accuracy.

Let's start building our document processor!

Setting Up the Document Processor Class

First, we need to create a class that will handle all our document processing needs. This class will encapsulate the functionality for loading, processing, and retrieving information from documents.

Let's start by setting up the basic structure of our DocumentProcessor class:

Python
1from langchain_community.document_loaders import PyPDFLoader 2from langchain.text_splitter import RecursiveCharacterTextSplitter 3from langchain_openai import OpenAIEmbeddings 4from langchain_community.vectorstores import FAISS 5 6class DocumentProcessor: 7 def __init__(self): 8 self.chunk_size = 1000 9 self.chunk_overlap = 100 10 self.embedding_model = OpenAIEmbeddings() 11 self.vectorstore = None

In this initialization method, we're setting up several important parameters:

  • chunk_size: This determines how large each document chunk will be (measured in characters). We're using 1000 characters as a default, which is a good balance between context size and specificity.
  • chunk_overlap: This specifies how much overlap there should be between consecutive chunks. Overlap helps maintain context across chunk boundaries.
  • embedding_model: We're using OpenAI's embedding model to convert our text chunks into vector representations.
  • vectorstore: This will hold our FAISS vector store, which we'll initialize later when we process our first document.

These parameters can be adjusted based on your specific needs. For example, if you're working with technical documents where context is crucial, you might want to increase the chunk size and overlap.

Implementing Document Loading and Chunking

Now that we have our class structure set up, let's implement the methods for loading documents and splitting them into chunks.

First, we'll create a method to load documents based on their file type:

Python
1def load_document(self, file_path): 2 """Load a document based on its file type""" 3 if file_path.endswith('.pdf'): 4 loader = PyPDFLoader(file_path) 5 else: 6 raise ValueError("Unsupported file format") 7 8 return loader.load()

This method checks the file extension and uses the appropriate loader. Currently, we're only supporting PDF files, but you could easily extend this to support other file types like text files, Word documents, or HTML.

Next, let's implement the method that will process a document and add it to our vector store:

Python
1def process_document(self, file_path): 2 """Process a document and add it to the vector store""" 3 # Load the document 4 docs = self.load_document(file_path) 5 6 # Split the document into chunks 7 text_splitter = RecursiveCharacterTextSplitter( 8 chunk_size=self.chunk_size, 9 chunk_overlap=self.chunk_overlap 10 ) 11 split_docs = text_splitter.split_documents(docs) 12 13 # Create or update the vector store 14 if self.vectorstore is None: 15 self.vectorstore = FAISS.from_documents(split_docs, self.embedding_model) 16 else: 17 self.vectorstore.add_documents(split_docs)

This method does several important things:

  1. It loads the document using our load_document method.
  2. It creates a RecursiveCharacterTextSplitter with our specified chunk size and overlap.
  3. It splits the loaded document into chunks.
  4. It either creates a new FAISS vector store (if this is the first document) or adds the chunks to our existing vector store.

In our process_document method, we check if self.vectorstore is None. If it is, we create a new FAISS vector store from our document chunks using the familiar from_documents method. If not, we use the add_documents method to incorporate new chunks into our existing vector store without rebuilding the entire database - this powerful feature allows us to incrementally expand our knowledge base, giving us the flexibility to start with a single document and gradually add more as needed.

Implementing Context Retrieval Functionality

Now that we can process documents and store their embeddings, we need a way to retrieve relevant context when a query is made. This is where the "retrieval" part of RAG comes into play.

Let's implement a method to retrieve relevant document chunks for a given query:

Python
1def retrieve_relevant_context(self, query, k=3): 2 """Retrieve relevant document chunks for a query""" 3 if self.vectorstore is None: 4 return [] 5 6 return self.vectorstore.similarity_search(query, k=k)

This method takes a query string and an optional parameter k which specifies how many chunks to retrieve. It then performs a similarity search in our vector store to find the k most relevant chunks. If we haven't processed any documents yet (self.vectorstore is None), we simply return an empty list.

Resetting the Vector Store

Finally, let's add a utility method to reset our document processor:

Python
1def reset(self): 2 """Reset the document processor""" 3 self.vectorstore = None

This method simply sets self.vectorstore to None, effectively clearing our knowledge base. This can be useful if you want to start fresh with a new set of documents.

Putting It All Together: Using the Document Processor

Now that we've built all the components of our document processor, let's see how to use it in a complete RAG workflow. We'll create a simple example that:

  1. Initializes our document processor
  2. Processes a PDF document
  3. Retrieves relevant context for a query
  4. Uses a chat model to generate a response based on this context

Here's the complete example:

Python
1from document_processor import DocumentProcessor 2from langchain_openai import ChatOpenAI 3from langchain.prompts import ChatPromptTemplate 4 5# Initialize the document processor 6processor = DocumentProcessor() 7 8# Process a document 9file_path = "data/a_scandal_in_bohemia.pdf" 10processor.process_document(file_path) 11 12# Initialize the chat model 13chat = ChatOpenAI() 14 15# Define a query 16query = "What is the main mystery in the story?" 17 18# Retrieve relevant context 19relevant_docs = processor.retrieve_relevant_context(query) 20context = "\n\n".join([doc.page_content for doc in relevant_docs]) 21 22# Create a prompt template for RAG 23prompt_template = ChatPromptTemplate.from_template( 24 "Answer the following question based on the provided context.\n\n" 25 "Context:\n{context}\n\n" 26 "Question: {question}" 27) 28 29# Format the prompt with our context and query 30prompt = prompt_template.format(context=context, question=query) 31 32# Get the response from the model 33response = chat.invoke(prompt) 34 35# Print the question and the AI's answer 36print(f"Question: {query}") 37print(f"Answer: {response.content}")

This example demonstrates the complete RAG workflow:

  1. We process a document and store its chunks in our vector store
  2. We define a query about the document
  3. We retrieve the most relevant chunks for this query
  4. We create a prompt that includes both the query and the retrieved context
  5. We send this prompt to a language model to generate a response

When you run this code with the PDF of "A Scandal in Bohemia" (a Sherlock Holmes story), you might get output like:

Plain text
1Question: What is the main mystery in the story? 2Answer: The main mystery in "A Scandal in Bohemia" revolves around retrieving a compromising photograph that Irene Adler possesses. This photograph shows her with the King of Bohemia, and the king fears it could damage his reputation and upcoming marriage to a Scandinavian princess. He hires Sherlock Holmes to recover this photograph before Irene can use it to blackmail him or reveal their past relationship.

The language model's response is based on the specific context we provided, which helps ensure that it's accurate and relevant to our document.

Summary and Next Steps

In this lesson, we've built a powerful document processor for our RAG chatbot. We've learned how to:

  • Create a DocumentProcessor class that encapsulates document processing functionality
  • Load documents from PDF files
  • Split documents into manageable chunks with appropriate overlap
  • Create and manage a vector store for efficient similarity search
  • Retrieve relevant context for user queries
  • Integrate our document processor with a chat model for a basic RAG workflow

In the next unit, we'll build on this foundation by creating a chat engine that can maintain conversation history. This will allow our chatbot to have more natural, contextual conversations with users. Eventually, we'll integrate both components into a complete RAG system that can intelligently answer questions based on your documents while maintaining conversational context.

Get ready to practice what you've learned and take your RAG chatbot to the next level!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.