Introduction

Welcome to the first lesson of our course on "Scaling Up RAG with Vector Databases"! This is the third course in our series, building on the foundations laid in the previous courses. Course 1 introduced a basic RAG pipeline, providing a foundational understanding of how retrieval and generation can be combined. Course 2 focused on text representation, with a particular emphasis on embeddings.

In this course, we'll focus on scaling your Retrieval-Augmented Generation (RAG) system by building and querying a vector database. You'll learn to preprocess documents, store chunk embeddings in ChromaDB, retrieve relevant chunks using advanced techniques like compound metadata filters and weighting, and construct prompts that can handle multiple context chunks. Additionally, we'll cover managing updates to your collection and large-scale ingestion using batch strategies.

Our journey begins with document chunking, a crucial preprocessing step that enhances the efficiency and effectiveness of vector databases in RAG systems. By the end of this lesson, you'll be able to break down a lengthy document into discrete segments, each tagged with essential metadata, paving the way for robust retrieval and storage in vector databases.

Understanding Document Chunking

When dealing with Retrieval-Augmented Generation, you typically feed chunks of text — along with certain metadata — into downstream components, like embedding models or vector search engines. This document chunking process is crucial for several reasons:

  1. Manageability: Smaller segments are easier to process. Many models have a maximum context length, meaning they can only handle a certain number of tokens at a time. If the input text exceeds this limit, the model may become inefficient or unable to process the request, leading to errors or truncated outputs. Additionally, even if a model can technically handle larger contexts, performance may degrade, resulting in slower processing times and reduced accuracy.
  2. Context Preservation: A well-sized chunk still retains enough local context to be meaningful. Chunks should be neither too large (leading to potential memory issues) nor too small (risking the loss of context).
  3. Enhanced Retrieval: When text is split rationally, you can retrieve only the relevant segments instead of searching through entire documents — this shortens query times and boosts accuracy.

Think of a real-world example: If you have a large reference manual, you wouldn't read the whole thing to answer a single question. Instead, you'd look for the exact section (or chunk) that pertains to your query. The same principle carries into text retrieval on a computer.

Splitting Large Text into Chunks

Here's a sample function showcasing how to slice a long piece of text into parts of a specified word count:

Python
1def chunk_text(text, chunk_size=10): 2 """ 3 Splits the given text into smaller chunks, each containing 4 up to 'chunk_size' words. Returns a list of these chunk strings. 5 """ 6 words = text.split() # Tokenize by splitting on whitespace 7 # Construct chunks by stepping through the words list in increments of chunk_size 8 return [" ".join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]

After tokenizing the text, the function uses a list comprehension to iterate in steps of chunk_size. Each step produces a concise string of words that can be processed more easily by language models. Note that this method is a simplified approach and may not handle punctuation or sentence boundaries effectively, causing issues with context preservation; for more advanced chunking, consider using NLP libraries like NLTK or spaCy that can respect sentence boundaries.

Inspecting the Input Data

Before moving on, let's briefly take a closer look at the dataset we'll be working with for chunking. Below is an example of two documents in JSON format:

JSON
1[ 2 { 3 "id": 1, 4 "content": "Hello world! This is a sample document used for testing chunk_text function." 5 }, 6 { 7 "id": 2, 8 "content": "Another sample document. This is used for verifying the chunking of text in multiple documents. It includes additional sentences to provide a more comprehensive test case. By having a longer document, we can better assess how the chunking function performs when dealing with more extensive content." 9 } 10]

Our dataset consists of an array of items (or a list of dictionaries, in Python's terms), where each item has an id field that identifies the document and a content field containing the main text. After applying a chunking approach to each document, our data will be represented as smaller segments of text that are easier to process downstream, as we'll see later in the lesson.

Adding Meaningful Metadata

So far, we have a method for chunking text, but in RAG systems, we often need to know where each piece came from and what it represents. That's where metadata comes in. Below is an example function that demonstrates how to loop through a structured dataset of documents, chunk each one, and store metadata for later use:

Python
1def load_and_chunk_dataset(data, chunk_size=10): 2 """ 3 Iterates over a structured dataset of documents, splits each into chunks, 4 and associates metadata (doc_id and chunk_id) with every piece. 5 """ 6 all_chunks = [] 7 for doc in data: 8 doc_id = doc["id"] 9 doc_text = doc["content"] 10 11 # Create smaller text segments from the original document 12 doc_chunks = chunk_text(doc_text, chunk_size) 13 14 # Label each chunk with its source identifier 15 for chunk_id, chunk_str in enumerate(doc_chunks): 16 all_chunks.append({ 17 "doc_id": doc_id, 18 "chunk_id": chunk_id, 19 "text": chunk_str 20 }) 21 22 return all_chunks

Here's how it works step-by-step:

  1. We iterate over data, which is a collection of different text entries in the dataset.
  2. For each entry, the function extracts the id and content fields to process.
  3. We apply the earlier chunk_text function to split the content into multiple pieces.
  4. We then store relevant information — doc_id and chunk_id — along with the chunked text. This means every piece of text is traceable back to its origin, which can be crucial for retrieval and filtering in a vector database.
Practical Usage Example

In this lesson, we'll use two sample documents to demonstrate the chunking process and to illustrate the concepts discussed. In the practice section, we'll be using a more realistic dataset for hands-on exercises; it can be found at src/data/corpus.json.

When we run the following code snippet with chunk_size=10:

Python
1chunked_docs = load_and_chunk_dataset(data, chunk_size=10) 2print(f"Loaded and chunked {len(chunked_docs)} chunks from dataset.") 3for doc in chunked_docs: 4 print(doc)

we get this output:

JSON
1Loaded and chunked 7 chunks from dataset. 2{'doc_id': 1, 'chunk_id': 0, 'text': 'Hello world! This is a sample document used for testing'} 3{'doc_id': 1, 'chunk_id': 1, 'text': 'chunk_text function.'} 4{'doc_id': 2, 'chunk_id': 0, 'text': 'Another sample document. This is used for verifying the chunking'} 5{'doc_id': 2, 'chunk_id': 1, 'text': 'of text in multiple documents. It includes additional sentences to'} 6{'doc_id': 2, 'chunk_id': 2, 'text': 'provide a more comprehensive test case. By having a longer'} 7{'doc_id': 2, 'chunk_id': 3, 'text': 'document, we can better assess how the chunking function performs'} 8{'doc_id': 2, 'chunk_id': 4, 'text': 'when dealing with more extensive content.'}

This output clearly illustrates both the benefits and limitations of our simple chunking approach. While we've successfully broken down the documents into smaller pieces with proper metadata tracking, the quality of some chunks is problematic. For example, chunk 1 of document 1 contains only the fragment 'chunk_text function.', which lacks sufficient context to be meaningful on its own. Similarly, other chunks cut across sentence boundaries, creating segments that might be difficult for retrieval systems to properly interpret.

These issues highlight why more sophisticated chunking strategies that respect semantic boundaries (like sentences or paragraphs) are essential for context preservation in production RAG systems. Poor chunking can significantly reduce the effectiveness of retrieval, as chunks without adequate context may not match relevant queries or might provide incomplete information. In the practice section, we'll explore more advanced techniques that better preserve context and create more meaningful chunks.

Conclusion and Next Steps

In this lesson, we focused on why and how to chunk larger text documents. We learned that chunking:

  • Makes text blocks suitably sized for processing with language models.
  • Preserves local context by grouping words carefully.
  • Allows for the attachment of metadata that makes future retrieval more powerful.

Next, we'll embed these chunks and store them in a vector database so we can efficiently search within them. Feel free to practice splitting your own documents and confirm the results before moving on to the upcoming exercises. By mastering the chunking and metadata process, you're well on your way to creating scalable, retrieval-enhanced applications.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal