Loading and Splitting Documents with LangChain

Introduction to Document Processing with LangChain

Welcome to the first lesson of Document Processing and Retrieval with LangChain in TypeScript! In this course, you'll learn how to work with documents programmatically, extract valuable information from them, and build systems that can intelligently interact with document content.

Document processing is a fundamental task in many applications, from search engines to question-answering systems. The process typically involves the following steps:

Loading documents from various sources
Splitting documents into manageable chunks
Converting those chunks into numerical representations (called embeddings)
Retrieving relevant information when needed

In this lesson, we'll focus on the first two steps of this pipeline: loading documents and splitting them into appropriate chunks. These steps are crucial because they form the foundation for all subsequent document processing tasks. If your documents aren't loaded correctly or split effectively, the quality of your embeddings and retrieval will suffer.

By the end of this lesson, you’ll be able to load documents from PDF files, split those documents into manageable chunks for further processing, and understand how to prepare documents for embedding and retrieval. Let's get started with understanding the document loader available in LangChain.

LangChain Document Loaders

LangChain simplifies document processing by providing specialized loaders for different file formats. These loaders handle the complexities of parsing various document types, allowing you to focus on working with the content. Let's look at two commonly used loaders.

For PDF files, which are one of the most common document formats, we can use the PDFLoader. We simply pass the file path as a string to the loader's constructor:

When working with simple text files, the TextLoader is the appropriate choice. Again, we specify the path to our text file:

Each loader is specifically designed to handle the nuances of its respective file format, ensuring that the document's content is properly extracted and preserved. Beyond these three, LangChain offers many other loaders for specialized formats, including CSVLoader for CSV files, JSONLoader for JSON files, WebBaseLoader for web pages, and more - all designed to abstract away format-specific challenges so you can concentrate on your document processing tasks.

Loading a Document

Let's look at a concrete example of loading a document using TypeScript. We'll use a Sherlock Holmes story in PDF format.

The load() method reads the file and returns an array of Document objects. Each Document contains the content of a page or section of the original PDF, along with metadata such as the source file and page number. By using TypeScript's type annotations, we ensure that the structure of the loaded documents is clear and type-safe.

Inspecting Loaded Documents

After loading the documents, it's important to inspect them to understand their structure and content. TypeScript's type system helps us safely access properties and work with the loaded data.

Here's how you can inspect the loaded documents:

This might output:

From this output, we can see that:

The PDF has been split into 12 chunks (one per page)
The first chunk contains the title and author information
Each chunk includes detailed metadata about the source document

Inspecting the loaded documents helps you understand how the content is structured before proceeding with further processing.

Document Splitting Techniques

While we've successfully loaded our document, there's a challenge: most documents are too large to process as a single unit, especially when working with language models or embedding techniques. This is where document splitting comes into play. Document splitting involves breaking down a large document into smaller, more manageable chunks. These chunks can then be processed individually, making it easier to work with large documents and improving the quality of embeddings and retrieval.

LangChain provides several text splitters, and one of the most versatile is the RecursiveCharacterTextSplitter. This splitter works by recursively splitting text based on a list of separators (like newlines, periods, etc.) until the chunks are below a specified size.

Here's how you can initialize a text splitter in TypeScript:

Two key parameters for the RecursiveCharacterTextSplitter are:

chunkSize: The maximum size (in characters) of each chunk
chunkOverlap: The number of characters that overlap between adjacent chunks

The overlap is important because it helps maintain context between chunks. For example, if you set a chunkSize of 1000 and a chunkOverlap of 100, each chunk will be at most 1000 characters long, and adjacent chunks will share 100 characters of content.

Understanding Chunks Visually

To better understand how document splitting works, let's visualize the process with a simple example. Imagine we have a document with the following text:

Now, let's say we want to split this into chunks with a chunkSize of 50 characters and a chunkOverlap of 15 characters. Here's how the splitting would work:

Chunk 1 (characters 1-50):

Chunk 2 (starts 15 characters before the end of Chunk 1):

Chunk 3 (starts 15 characters before the end of Chunk 2):

Notice how each chunk overlaps with the previous one by 15 characters. This overlap ensures that important context isn't lost when a sentence or idea spans across chunk boundaries.

Why Chunk Size Must Be Larger Than Overlap

The chunk size must always be larger than the overlap for the splitting to work correctly. Here's why:

If chunkSize = chunkOverlap, each new chunk would start exactly where the previous chunk started, creating infinite identical chunks
If chunkSize < chunkOverlap, the splitter would try to go backwards, which is impossible

Think of it this way: the overlap is like taking a step backward before taking a bigger step forward. You need to move forward more than you step back, or you'll never make progress through the document.

A good rule of thumb is to keep the overlap between 10-20% of your chunk size. So if your chunk size is 1000 characters, an overlap of 100-200 characters usually works well.

Splitting the Document into Chunks

With our text splitter initialized, we can now split the Sherlock Holmes document we loaded earlier:

The splitDocuments method takes our array of Document objects (from the PDF loader) and returns a new array in which each document has been split according to our specified parameters. The metadata from the original documents is preserved in each of the split chunks.

Let's examine the first chunk to see what it looks like:

This might output something like:

Notice that we now have more chunks than we had pages (54 chunks compared to the original 12 pages). This is because the text splitter has broken down the content based on our specified chunk size, rather than keeping the original page-based division. Each chunk is now a manageable size, making it easier to process with language models or embedding techniques.

Optimizing Chunk Size and Overlap

It's worth emphasizing that effective chunking is a balance between chunk size and overlap. Too small a chunk size may fragment important ideas, while too large a size may exceed the token limit of embedding models. Similarly, too much overlap can introduce redundancy. For most applications, starting with a chunk size of 500–1000 characters and an overlap of 50–100 characters (as we did in our example) is a reasonable default, but you may need to adjust these parameters based on your specific documents and use case.

The optimal chunking strategy often depends on:

The nature of your documents (technical papers vs. narrative text)
The specific requirements of your downstream tasks
The token limits of the embedding or language models you're using

Don't be afraid to experiment with different chunking parameters to find what works best for your particular application.

Review and Next Steps

In this lesson, you've learned how to load documents from PDF files using LangChain's PDFLoader and how to split those documents into manageable chunks using the RecursiveCharacterTextSplitter. These are the first two steps in the document processing pipeline and form the foundation for more advanced tasks like embedding and retrieval.

Let's recap what we've covered:

We explored the PDFLoader in LangChain for loading PDF files.
We learned how to load a document and inspect its content and metadata using TypeScript.
We discussed the importance of document splitting and how it helps in processing large documents.
We used the RecursiveCharacterTextSplitter to split our documents into manageable chunks with overlap to maintain context between chunks.

In the next lesson, we'll explore how to convert these document chunks into vector embeddings, which will allow us to perform semantic search and retrieval. You'll learn how embedding models work and how to use them effectively with LangChain. The document loading and splitting techniques you've learned here are essential prerequisites for these more advanced operations, as they ensure that your documents are properly prepared for embedding and retrieval.

Next Lesson: Generating Document Embeddings with OpenAI

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal