Introduction

Hello there, welcome to the second lesson of our "Scaling Up RAG with Vector Databases" course! In the previous unit, you explored how to break large documents into smaller chunks and attach useful metadata (like doc_id, chunk_id, and labels such as category). These chunks are essential for structuring data in a way that makes retrieval easier. In this lesson, we'll build on that groundwork by showing you how to store them in a vector database. One popular choice is ChromaDB — a specialized, open-source database designed for high-speed, semantic querying of vectors. By switching from keyword-based searches to semantic searches, your RAG system will retrieve relevant information more efficiently. Let's dive in!

Understanding Vector Databases

A vector database stores data in the form of numerical vectors that capture the semantic essence of texts (or other data). The database then uses similarity metrics — rather than literal word matches — so that conceptually similar items are stored close together. This means searches on vector databases can retrieve contextually relevant results even when keywords are absent. By leveraging approximate or exact nearest-neighbor strategies for similarity, vector databases can scale to handle millions or billions of vectors while still providing quick query responses. This makes them especially suitable for RAG systems, which rely on fast semantic lookups across large collections of text.

Why We Need Vector Databases for RAG

Before we explore how to set up a vector database, let's look at why it's a crucial component of a RAG pipeline:

  1. Semantic Retrieval: By embedding text into vectors, queries can match documents based on meaning rather than strict keyword matches. This yields more accurate and context-sensitive search results.
  2. Scalability: Specialized vector databases handle large datasets efficiently, allowing you to store and query vast libraries of text chunks without sacrificing performance.
  3. Richer Context: Embeddings capture nuanced relationships among chunks, ensuring that related information is surfaced even when it doesn't use the exact same terms.
  4. Easy Updates: Vector databases (like ChromaDB) often allow you to add and remove chunks on the fly, so your collection stays in sync with new or evolving information.
Setting Up ChromaDB and Basic Configuration

Now, let's jump into coding with ChromaDB, our chosen vector database. Here's how to set up a ChromaDB client using JavaScript:

How It Works:

  • Embedding Setup: We define an OpenAIEmbeddingFunction to generate vectors for the text chunks. The model we're using, text-embedding-ada-002, is a powerful model that maps sentences to a dense vector space. It's popular for RAG applications because it balances efficiency with strong semantic understanding capabilities.
  • Client Configuration: new ChromaClient({ path: "http://localhost:8000" }) connects to ChromaDB with a specified path. This setup allows for a persistent connection to a ChromaDB server running locally or remotely.
  • Collection Management: We check if a collection named "rag_collection" exists; if not, we create a new one. It's worth noting that listCollections() returns only the collection names, not their full metadata. So when checking for existence with , ensure that you're comparing against strings — not objects — to avoid mismatches. This check avoids accidentally creating duplicates and keeps your namespace clean. A collection in ChromaDB is a logical container that groups related documents and their embeddings together, similar to a table in a traditional database but optimized for vector similarity operations. Collections allow you to organize your vector data into separate namespaces, making it possible to maintain multiple distinct sets of documents with different embedding models or for different use cases.
Preparing Data and Adding Chunks to ChromaDB

After setting up your client, embedding function, and creating the collection, the next step is to prepare your chunks for insertion and add them to the collection:

Key Points:

  • Data Grouping: Each chunk is mapped to its text, a unique ID, and metadata. These are used during retrieval and future reference.
  • Seamless Insertion: Calling collection.add() handles embedding and stores everything for quick semantic searches.

Keep in mind that ChromaDB handles both the embedding and storage operations inside add(). If you attempt to add a chunk with an ID that already exists in the collection, ChromaDB does not silently overwrite the existing entry—instead, it will typically throw a duplication error. To avoid ingestion failures, especially during large-scale batch operations, it's a good practice to validate the uniqueness of your IDs before calling add(), or to implement explicit error handling to gracefully manage such conflicts.

Updating and Managing Documents

ChromaDB allows you to keep your collection up to date with new or modified information. Below is an example of adding and then deleting a "document" (or chunk) after the collection has already been created:

Key Points:

  • exampleChunks is our initial list of text chunks with their metadata, while newDocument is a new chunk we want to add to our existing collection.
  • docId is a unique identifier string created by combining document ID and chunk ID (e.g., "chunk_2_0").
  • Why Unique IDs Matter: Each chunk needs a unique identifier so ChromaDB can reference it later for updates, deletions, or retrieval. By combining doc_id and into a string like , we ensure each chunk has a distinct ID while maintaining its relationship to the source document.
Conclusion and Next Steps

By storing text chunks in a vector database, you've laid the foundation for faster, more semantically aware retrieval. You know how to create, update, and manage a ChromaDB collection — crucial skills for any large-scale RAG system.

In the next lesson, you'll learn how to query the vector database to fetch the most relevant chunks and feed them into a language model. That's where the real magic of producing context-rich, accurate responses shines! For now, feel free to explore different embedding models or try adding and deleting a variety of chunks. When you're ready, proceed to the practice exercises to cement these concepts and further refine your RAG workflow.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal