Hello there, welcome to the second lesson of our "Scaling Up RAG with Vector Databases" course! In the previous unit, you explored how to break large documents into smaller chunks and attach useful metadata (like doc_id
, chunk_id
, and labels such as category
). These chunks are essential for structuring data in a way that makes retrieval easier. In this lesson, we'll build on that groundwork by showing you how to store them in a vector database. One popular choice is ChromaDB—a specialized, open-source database designed for high-speed, semantic querying of vectors. By switching from keyword-based searches to semantic searches, your RAG system will retrieve relevant information more efficiently. Let's dive in!
A vector database stores data in the form of numerical vectors that capture the semantic essence of texts (or other data). The database then uses similarity metrics—rather than literal word matches—so that conceptually similar items are stored close together. This means searches on vector databases can retrieve contextually relevant results even when keywords are absent. By leveraging approximate or exact nearest-neighbor strategies for similarity, vector databases can scale to handle millions or billions of vectors while still providing quick query responses. This makes them especially suitable for RAG systems, which rely on fast semantic lookups across large collections of text.
Before we explore how to set up a vector database, let's look at why it's a crucial component of a RAG pipeline:
- Semantic Retrieval: By embedding text into vectors, queries can match documents based on meaning rather than strict keyword matches. This yields more accurate and context-sensitive search results.
- Scalability: Specialized vector databases handle large datasets efficiently, allowing you to store and query vast libraries of text chunks without sacrificing performance.
- Richer Context: Embeddings capture nuanced relationships among chunks, ensuring that related information is surfaced even when it doesn't use the exact same terms.
- Easy Updates: Vector databases (like ChromaDB) often allow you to add and remove chunks on the fly, so your collection stays in sync with new or evolving information.
Now, let's jump into coding with ChromaDB, our chosen vector database. Here's how to set up a ChromaDB client:
Python1from chromadb import Client 2from chromadb.config import Settings 3from chromadb.utils import embedding_functions 4 5def build_chroma_collection(chunks): 6 # Use a Sentence Transformer model for embeddings 7 model_name = 'sentence-transformers/all-MiniLM-L6-v2' 8 embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name) 9 10 # Create a ChromaDB client with default settings 11 client = Client(Settings()) 12 13 # Either get an existing collection or create a new one 14 collection = client.get_or_create_collection( 15 name="rag_collection", 16 embedding_function=embed_func 17 ) 18 # ... continues
How It Works:
- Embedding Setup: We define a
SentenceTransformerEmbeddingFunction
to generate vectors for the text chunks. The model we're using,all-MiniLM-L6-v2
, is a lightweight but powerful sentence transformer that maps sentences to a 384-dimensional dense vector space. It's popular for RAG applications because it balances efficiency (small size, fast inference) with strong semantic understanding capabilities. - Client Configuration:
Client(Settings())
connects to ChromaDB with default settings. By default, ChromaDB creates an in-memory store for quick experimentation. You can customize its behavior (for example, specifying a file path for persistence or enabling other features) by passing additional parameters toSettings()
. - Collection Management:
get_or_create_collection
checks if a collection named"rag_collection"
exists; if not, it creates a new one. A collection in ChromaDB is a logical container that groups related documents and their embeddings together, similar to a table in a traditional database but optimized for vector similarity operations. Collections allow you to organize your vector data into separate namespaces, making it possible to maintain multiple distinct sets of documents with different embedding models or for different use cases.
After setting up your client, embedding function, and creating the collection, the next step is to prepare your chunks for insertion and add them to the collection:
Python1 # ... continues 2 # Prepare the data: texts, IDs, and metadata 3 texts = [c["content"] for c in chunks] 4 ids = [f"chunk_{c['doc_id']}_{c['chunk_id']}" for c in chunks] 5 metadatas = [ 6 {"doc_id": chunk["doc_id"], 7 "chunk_id": chunk["chunk_id"], 8 "category": chunk["category"]} 9 for chunk in chunks 10 ] 11 12 # Add the documents (chunks) to the collection 13 collection.add(documents=texts, metadatas=metadatas, ids=ids) 14 return collection
Key Points:
- Data Grouping: Each chunk is mapped to its text, a unique ID, and metadata. These are used during retrieval and future reference.
- Seamless Insertion: Calling
collection.add()
handles embedding and stores everything for quick semantic searches.
ChromaDB allows you to keep your collection up to date with new or modified information. Below is an example of adding and then deleting a "document" (or chunk) after the collection has already been created:
Python1# Example chunks to showcase adding them to a new collection 2example_chunks = [ 3 {"doc_id": 0, "chunk_id": 0, "category": "ai", "content": "RAG stands for Retrieval-Augmented Generation."}, 4 {"doc_id": 0, "chunk_id": 1, "category": "ai", "content": "A crucial component of a RAG pipeline is the Vector Database."}, 5 {"doc_id": 1, "chunk_id": 0, "category": "finance", "content": "Accurate data is essential in finance."}, 6] 7collection = build_chroma_collection(example_chunks) 8 9# Prepare a new chunk to add 10new_document = { 11 "doc_id": 2, 12 "chunk_id": 0, 13 "category": "food", 14 "content": "Bananas are yellow fruits rich in potassium." 15} 16 17# Construct a unique ID for the new document 18# Format: "chunk_{doc_id}_{chunk_id}" (e.g., "chunk_2_0") 19doc_id = f"chunk_{new_document['doc_id']}_{new_document['chunk_id']}" 20 21# Add the new chunk to the existing collection 22collection.add( 23 documents=[new_document["content"]], # The text content to be embedded 24 metadatas=[{ # Metadata for filtering and context 25 "doc_id": new_document["doc_id"], 26 "chunk_id": new_document["chunk_id"], 27 "category": new_document["category"] 28 }], 29 ids=[doc_id] # Unique identifier for this chunk 30) 31 32# If needed, remove the chunk by its unique ID 33# For example, if the information about bananas becomes outdated 34collection.delete(ids=[doc_id]) # Using the same ID: "chunk_2_0"
Key Points:
example_chunks
is our initial list of text chunks with their metadata, whilenew_document
is a new chunk we want to add to our existing collection.doc_id
is a unique identifier string created by combining document ID and chunk ID (e.g.,"chunk_2_0"
).- Why Unique IDs Matter: Each chunk needs a unique identifier so ChromaDB can reference it later for updates, deletions, or retrieval. By combining
doc_id
andchunk_id
into a string like"chunk_2_0"
, we ensure each chunk has a distinct ID while maintaining its relationship to the source document. - Adding: Calling
collection.add()
again embeds the new document automatically, so your collection grows in real time. - Deleting: When old or incorrect information becomes obsolete—such as a product description that's been updated or regulatory data that's changed—you can simply remove it by its ID. In real-world scenarios, data about policies, facts, or even entire documents can become outdated over time, and selectively deleting chunks lets you keep your database accurate without a full rebuild.
This flexibility means you can keep your RAG environment updated without fully restructuring your database.
By storing text chunks in a vector database, you've laid the foundation for faster, more semantically aware retrieval. You know how to create, update, and manage a ChromaDB collection—crucial skills for any large-scale RAG system.
In the next lesson, you'll learn how to query the vector database to fetch the most relevant chunks and feed them into a language model. That's where the real magic of producing context-rich, accurate responses shines! For now, feel free to explore different embedding models or try adding and deleting a variety of chunks. When you're ready, proceed to the practice exercises to cement these concepts and further refine your RAG workflow.
