Welcome to the final lesson in our Scaling Up RAG with Vector Databases course! Previously, we explored how to chunk large documents for efficient retrieval, store these chunks in a vector database (such as ChromaDB), and then retrieve them to build prompts for Large Language Models (LLMs). Remember, chunking and storing these text fragments provided the basic scaffolding for a Retrieval-Augmented Generation (RAG) pipeline.
In this lesson, we will expand on that foundation by introducing metadata-based filtering, which allows you to target specific attributes — like category or date — and make your content searches significantly more precise. By the end, you will be able to create queries that focus only on the metadata you care about, such as retrieving documents from specific categories.
Before we get hands-on, let's talk about the intuition behind metadata:
-
What is Metadata, and Why Does It Matter?
Metadata includes any labeled information that describes your text chunks. Common examples are category, date, or title. When you have a large collection of documents, a normal text-based similarity search might return results you don't actually want. But by selectively filtering on metadata, you can drastically reduce irrelevant results and ensure only the most pertinent information is retrieved. -
Real-World Example
Imagine a large enterprise knowledge base spanning different departments (e.g., Human Resources, Technology, Finance). If you only want to see technology-related documents, applying a simple metadata filter on the category field ensures that your search never strays into HR or Finance content. This becomes particularly useful when you have specialized queries that are domain-specific and need accurate, fast retrieval.
Let's move to coding, first focusing on the metadata filter:
Python1def metadata_enhanced_search(query, collection, categories=None, top_k=3): 2 # If categories are provided, build the filter 3 where = {"category": {"$in": categories}} if categories else None
Here, we take an optional list of categories
and build a where
clause:
- A
where
query in ChromaDB acts like a targeted filter on the collection (similar to aWHERE
clause in SQL). By specifying{"category": {"$in": categories}}
, only documents with a matching category will be returned. - The
$
symbol in$in
denotes a special operator in ChromaDB's query language. It's similar to MongoDB's query syntax, where operators like$in
,$eq
,$gt
, etc. are prefixed with a dollar sign to distinguish them from regular field names. In this case,$in
checks if the category value is contained within the provided list of categories. - If no categories are passed in, the filter is set to
None
, which tells ChromaDB to run a broader, unfiltered search.
Now, let's see how the search is actually performed and how results are organized:
Python1 # Perform the query with an optional metadata filter 2 results = collection.query( 3 query_texts=[query], 4 n_results=top_k, 5 where=where 6 ) 7 8 # Compile the results 9 retrieved_chunks = [] 10 for i in range(len(results['documents'][0])): 11 retrieved_chunks.append({ 12 "chunk": results['documents'][0][i], 13 "doc_id": results['metadatas'][0][i]['doc_id'], 14 "category": results['metadatas'][0][i].get('category'), 15 "distance": results['distances'][0][i] 16 }) 17 18 return retrieved_chunks
- The
collection.query
ChromaDB method is used retrieve the most similar documents to the givenquery_texts
. We control how many matching chunks to retrieve per query by passingn_results
, while thewhere
parameter is the optional dictionary defining our metadata filter. - After running the query, we gather the relevant chunks, document IDs, and distance scores into a structured output — making it simpler to handle the results in subsequent steps of your pipeline.
- The
distance
value represents how semantically similar the retrieved chunk is to the query — in ChromaDB, this is calculated using cosine distance between the query's embedding and each document's embedding. While cosine similarity ranges from -1 to 1, ChromaDB returns cosine distance (), which means larger values indicate less similarity.
Next, we can integrate this metadata-based search into our workflow. Let's try running a sample query with and without metadata filtering:
Python1# Example query 2query = "Recent advancements in AI and their impact on teaching" 3 4print("======== WITHOUT CATEGORY FILTER ========") 5no_filter_results = metadata_enhanced_search(query, collection, categories=None, top_k=3) 6for res in no_filter_results: 7 print(f"Doc ID: {res['doc_id']}, Category: {res['category']}, Distance: {res['distance']:.4f}") 8 print(f"Chunk: {res['chunk']}\n") 9 10print("======== WITH CATEGORY FILTER (Education) ========") 11tech_filter_results = metadata_enhanced_search(query, collection, categories=["Education"], top_k=3) 12for res in tech_filter_results: 13 print(f"Doc ID: {res['doc_id']}, Category: {res['category']}, Distance: {res['distance']:.4f}") 14 print(f"Chunk: {res['chunk']}\n")
When you run the example query in the code provided using our data/corpus.json
, you obtain the following output:
Plain text1======== WITHOUT CATEGORY FILTER ======== 2Doc ID: 64, Category: Education, Distance: 1.0530 3Chunk: The integration of technology in education is revolutionizing traditional teaching methods. Digital tools and interactive platforms are making learning more engaging. Educators are adapting to these changes to enhance student outcomes. 4 5Doc ID: 24, Category: Education, Distance: 1.1431 6Chunk: Universities are rethinking traditional education models to better prepare students for a dynamic global job market. Innovative teaching methods, including online and hybrid courses, are gaining traction. These reforms aim to create more engaging and effective learning environments. 7 8Doc ID: 1, Category: Technology, Distance: 1.1532 9Chunk: Artificial intelligence is transforming the way we approach complex problems in computing. Recent breakthroughs in machine learning have enabled faster data processing and smarter algorithms. The future of technology is expected to integrate AI into every facet of life. 10 11======== WITH CATEGORY FILTER (Education) ======== 12Doc ID: 64, Category: Education, Distance: 1.0530 13Chunk: The integration of technology in education is revolutionizing traditional teaching methods. Digital tools and interactive platforms are making learning more engaging. Educators are adapting to these changes to enhance student outcomes. 14 15Doc ID: 24, Category: Education, Distance: 1.1431 16Chunk: Universities are rethinking traditional education models to better prepare students for a dynamic global job market. Innovative teaching methods, including online and hybrid courses, are gaining traction. These reforms aim to create more engaging and effective learning environments. 17 18Doc ID: 63, Category: Education, Distance: 1.2630 19Chunk: Modern classrooms are benefiting from innovative pedagogical approaches that encourage active learning. Educators are integrating technology to create interactive lessons. These methods aim to foster critical thinking and creativity among students.
By default (no filter), you see documents that span both “Education” and “Technology.” Notice how Doc ID 1, despite mentioning AI, focuses more on broad computing challenges rather than teaching. Once the “Education” filter is applied, Doc ID 1 is excluded, and Doc ID 63 appears instead, emphasizing “modern classrooms” and strategies that involve integrating technology. Given the query about “AI and its impact on teaching,” Doc ID 63 is more specifically aligned with education-focused content, underscoring how metadata-based filtering can help narrowing down your search to only the most relevant subsets of your data.
In this lesson, you learned how to harness metadata-based filtering to refine search results in your RAG pipeline. You've seen that by storing category information (or any other descriptor) alongside your text chunks, you can easily pinpoint the data most relevant to your query. This makes your system significantly more robust and efficient, especially as your document collection grows.
Next, you will have the chance to practice implementing these ideas on your own in the upcoming exercises. Good luck, and keep exploring the power of metadata in RAG!
