Chunking Documents for Efficient Retrieval

Introduction: Why Chunking is Needed for RAG

Welcome back! In the previous lesson, you learned how to represent and load documents using Mastra. Now, we will take the next step in building a smart email assistant: splitting documents into smaller, manageable pieces — a process called chunking.

Why is chunking important? When working with retrieval-augmented generation (RAG) systems, you often need to search for relevant information inside large documents. If you try to use the whole document at once, it can be too big for the model to handle, and you might miss important details. By breaking documents into smaller chunks, you make it easier to find and use the right information at the right time.

In this lesson, you will learn how to split a document into overlapping chunks using Mastra. This is a key step before you can retrieve and use information efficiently in your smart email assistant.

Quick Recall: Loading Documents in Mastra

Before we start chunking, let’s quickly review how to load a document in Mastra. You saw this in the previous lesson, but here’s a short reminder.

To load a document, you use the MDocument.fromText() method. This method takes a string (the text of your document) and creates a document object you can work with.

For example:

Here, doc is now a Mastra document containing your text. You will use this document in the next steps.

What is Chunking? Key Concepts

Now, let’s talk about chunking. Chunking means splitting a document into smaller pieces, called "chunks." Each chunk is a segment of the original text. This helps with searching and retrieving information, especially when the document is long.

There are two important ideas to understand:

Chunk Size: This is how big each chunk is, usually measured in characters or words. If the chunk is too small, you might lose important context. If it’s too big, it might be hard to process.
Overlap: Sometimes, you want chunks to share some content. This is called overlap. Overlapping chunks help make sure that important information at the edge of one chunk is not missed in the next chunk.

For example, if you have a chunk size of 100 characters and an overlap of 20, each new chunk will start 80 characters after the previous one, so the last 20 characters of the previous chunk are included in the next chunk.

Chunking a Document with Mastra

Let’s see how to chunk a document using Mastra. We will build this step by step.

Step 1: Prepare the Document

First, make sure you have your document loaded, as shown earlier:

Step 2: Chunk the Document

Now, let’s split the document into chunks. Mastra provides a chunk method for this. You can specify the strategy, size, and overlap.

strategy: "recursive" tells Mastra to use its recursive chunking method, which tries to split the document at logical points (like sentences or paragraphs) when possible.
size: 512 sets the maximum size of each chunk to 512 characters.
overlap: 50 means each chunk will share 50 characters with the previous chunk.

You can also use other chunking strategies in Mastra, such as:

strategy: "character": Splits the document into fixed-size chunks based purely on character count, without considering sentence or paragraph boundaries.
strategy: "sentence": Splits the document at sentence boundaries, creating chunks that contain whole sentences.
strategy: "paragraph": Splits the document at paragraph boundaries, so each chunk is a paragraph or a group of paragraphs.
strategy: "fixed": Splits the document into fixed-size chunks without considering sentence or paragraph boundaries.

For example:

Choose the strategy that best fits your document structure and retrieval needs.

Step 3: Inspect the Chunks

After chunking, you can look at the resulting chunks:

This will print out each chunk’s text. For the example document, the output might look like:

Since the document is short, you may only get one chunk. For longer documents, you would see several overlapping chunks.

Inspecting the Resulting Chunks

Let’s look more closely at what happens when you chunk a document.

Each chunk is an object with a .text property containing a segment of the original document.
If your document is longer, you will see multiple chunks, each overlapping with the previous one by the number of characters you set in the overlap parameter.
Overlapping helps ensure that important information at the boundaries of chunks is not lost during retrieval.

For example, if you had a longer document, you might see output like:

This overlapping structure is important for retrieval-augmented generation because it helps the system find relevant information even if it is split between two chunks.

Summary and What’s Next

In this lesson, you learned why chunking is important for retrieval-augmented generation and how to split a document into overlapping chunks using Mastra. You saw how to set the chunk size and overlap, and how to inspect the resulting chunks.

Next, you will get a chance to practice chunking documents yourself. You will use Mastra’s chunking utilities to break down different types of documents and see how chunk size and overlap affect the results. This hands-on practice will help you get comfortable with preparing documents for efficient retrieval in your smart email assistant.

Previous Lesson

Next Lesson: Embedding Chunks and Storing in a Vector Database

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal