Lesson 1
Loading and Splitting Documents with LangChain
Introduction to Document Processing with LangChain

Welcome to the first lesson of Document Processing and Retrieval with LangChain in Python! In this course, you'll learn how to work with documents programmatically, extract valuable information from them, and build systems that can intelligently interact with document content.

Document processing is a fundamental task in many applications, from search engines to question-answering systems. The typical document processing pipeline consists of several key steps: loading documents from various sources, splitting them into manageable chunks, converting those chunks into numerical representations (embeddings), and finally retrieving relevant information when needed.

In this lesson, we'll focus on the first two steps of this pipeline: loading documents and splitting them into appropriate chunks. These steps are crucial because they form the foundation for all subsequent document processing tasks. If your documents aren't loaded correctly or split effectively, the quality of your embeddings and retrieval will suffer.

By the end of this lesson, you'll be able to:

  • Load documents from different file formats using LangChain
  • Split documents into manageable chunks for further processing
  • Understand how to prepare documents for embedding and retrieval

Let's get started with understanding the document loaders available in LangChain.

Setting Up PyPDF

To get started with document loading, you'll need to ensure that the pypdf package is installed in your environment. pypdf is essential for LangChain because it provides the underlying functionality to read and extract text from PDF files, enabling LangChain's PyPDFLoader to effectively process documents in this format.

Keep in mind that pypdf works best with text-based PDFs. If the PDF contains scanned images or handwritten text, pypdf will not be able to extract the content, as it doesn’t include OCR (Optical Character Recognition) capabilities. In such cases, a separate OCR tool would be needed.

If you're working in your own environment (rather than CodeSignal), you can install pypdf using pip:

Bash
1pip install pypdf

Note that on CodeSignal, this package is pre-installed, so you won't need to run the installation command.

LangChain Document Loaders

LangChain simplifies document processing by providing specialized loaders for different file formats. These loaders handle the complexities of parsing various document types, allowing you to focus on working with the content. Let's look at three commonly used loaders.

For PDF files, which are one of the most common document formats, we can use the PyPDFLoader. We simply pass the file path as a string to the loader's constructor:

Python
1from langchain_community.document_loaders import PyPDFLoader 2 3# Create a loader for PDF files by providing the file path 4pdf_loader = PyPDFLoader("document.pdf")

When working with simple text files, the TextLoader is the appropriate choice. Again, we specify the path to our text file:

Python
1from langchain_community.document_loaders import TextLoader 2 3# Create a loader for text files by providing the file path 4text_loader = TextLoader("document.txt")

For more complex or less common file types, LangChain offers the versatile UnstructuredFileLoader. As with the other loaders, we initialize it with the path to our document:

Python
1from langchain_community.document_loaders import UnstructuredFileLoader 2 3# Create a general-purpose loader for various file types 4general_loader = UnstructuredFileLoader("document.docx")

Each loader is specifically designed to handle the nuances of its respective file format, ensuring that the document's content is properly extracted and preserved. Beyond these three, LangChain offers many other loaders for specialized formats, including CSVLoader for CSV files, JSONLoader for JSON files, WebBaseLoader for web pages, and more - all designed to abstract away format-specific challenges so you can concentrate on your document processing tasks.

Loading a Document

Let's look at a concrete example of loading a document. We'll use a Sherlock Holmes story in PDF format:

Python
1from langchain_community.document_loaders import PyPDFLoader 2 3# Define the file path to our Sherlock Holmes story 4file_path = "data/the_adventure_of_the_blue_carbuncle.pdf" 5 6# Create a PDF loader for our document 7pdf_loader = PyPDFLoader(file_path) 8 9# Load the document 10docs = pdf_loader.load()

The load() method reads the file and returns a list of Document objects. Each Document object contains the content of a page or section of the original document, along with metadata such as the source file and page number.

Inspecting Loaded Documents

After loading the documents, we can inspect them to understand their structure and content:

Python
1# Print the number of document chunks loaded 2print(f"Loaded {len(docs)} document chunks") 3 4# Print the content of the first chunk 5print(f"\nFirst 200 characters of the first chunk:\n{docs[0].page_content[:200]}") 6 7# Print the metadata of the first chunk 8print(f"\nMetadata of the first chunk:\n{docs[0].metadata}")

This would output:

Plain text
1Loaded 12 document chunks 2 3First 200 characters of the first chunk: 4The Adventure of the Blue Carbuncle 5Arthur Conan Doyle 6 7Metadata of the first chunk: 8{'producer': '', 'creator': '', 'creationdate': '2014-03-15T13:41:38+01:00', 'author': '', 'title': '', 'subject': '', 'keywords': '', 'moddate': '2014-03-15T13:41:38+01:00', 'trapped': '/False', 'ptex.fullbanner': 'This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/MacPorts 2013_5) kpathsea version 6.1.1', 'source': 'data/the_adventure_of_the_blue_carbuncle.pdf', 'total_pages': 12, 'page': 0, 'page_label': 'i'}

From this output, we can see that:

  • The PDF has been split into 12 chunks (one per page)
  • The first chunk contains the title and author information
  • Each chunk includes detailed metadata about the source document

This inspection helps us understand how the document is structured before we proceed with further processing. Now that we've successfully loaded our document, let's move on to splitting it into more manageable chunks.

Document Splitting Techniques

While we've successfully loaded our document, there's a challenge: most documents are too large to process as a single unit, especially when working with language models or embedding techniques. This is where document splitting comes into play. Document splitting involves breaking down a large document into smaller, more manageable chunks. These chunks can then be processed individually, making it easier to work with large documents and improving the quality of embeddings and retrieval.

LangChain provides several text splitters, but one of the most versatile is the RecursiveCharacterTextSplitter. This splitter works by recursively splitting text based on a list of separators (like newlines, periods, etc.) until the chunks are below a specified size.

Python
1from langchain.text_splitter import RecursiveCharacterTextSplitter 2 3# Initialize the text splitter with a specified chunk size and overlap 4text_splitter = RecursiveCharacterTextSplitter( 5 chunk_size=1000, 6 chunk_overlap=100 7)

Two key parameters for the RecursiveCharacterTextSplitter are:

  1. chunk_size: The maximum size (in characters) of each chunk
  2. chunk_overlap: The number of characters that overlap between adjacent chunks

The overlap is important because it helps maintain context between chunks. Without overlap, information that spans the boundary between two chunks might be lost or misinterpreted. For example, if we set a chunk_size of 1000 and a chunk_overlap of 100, each chunk will be at most 1000 characters long, and adjacent chunks will share 100 characters of content.

Splitting the Document into Chunks

With our text splitter initialized, we can now split the Sherlock Holmes document we loaded earlier:

Python
1# Split the loaded document into chunks using the text splitter 2split_docs = text_splitter.split_documents(docs)

The split_documents method takes our list of Document objects (which we obtained from the PDF loader) and returns a new list where each document has been split according to our specified parameters. The metadata from the original documents is preserved in each of the split chunks.

Let's examine the first chunk to see what it looks like:

Python
1# Print the number of chunks after splitting 2print(f"After splitting: {len(split_docs)} chunks") 3 4# Print the content of the first chunk 5print(f"\nFirst chunk content:\n{split_docs[0].page_content}")

This might output something like:

Plain text
1After splitting: 54 chunks 2 3First chunk content: 4The Adventure of the Blue Carbuncle 5Arthur Conan Doyle

Notice that we now have more chunks than we had pages (54 chunks compared to the original 12 pages). This is because the text splitter has broken down the content based on our specified chunk size, rather than keeping the original page-based division. Each chunk is now a manageable size, making it easier to process with language models or embedding techniques.

Optimizing Chunk Size and Overlap

It's worth emphasizing that effective chunking is a balance between chunk size and overlap. Too small a chunk size may fragment important ideas, while too large a size may exceed the token limit of embedding models. Similarly, too much overlap can introduce redundancy. For most applications, starting with a chunk size of 500–1000 characters and an overlap of 50–100 characters (as we did in our example) is a reasonable default, but you may need to adjust these parameters based on your specific documents and use case.

The optimal chunking strategy often depends on:

  • The nature of your documents (technical papers vs. narrative text)
  • The specific requirements of your downstream tasks
  • The token limits of the embedding or language models you're using

Don't be afraid to experiment with different chunking parameters to find what works best for your particular application.

Review and Next Steps

In this lesson, you've learned how to load documents from various file formats using LangChain's document loaders and how to split those documents into manageable chunks using the RecursiveCharacterTextSplitter. These are the first two steps in the document processing pipeline and form the foundation for more advanced tasks like embedding and retrieval.

Let's recap what we've covered:

  1. We explored different document loaders in LangChain, including PyPDFLoader for PDF files, TextLoader for text files, and UnstructuredFileLoader for various file types.
  2. We learned how to load a document and inspect its content and metadata.
  3. We discussed the importance of document splitting and how it helps in processing large documents.
  4. We used the RecursiveCharacterTextSplitter to split our documents into manageable chunks with overlap to maintain context between chunks.

In the next lesson, we'll explore how to convert these document chunks into vector embeddings, which will allow us to perform semantic search and retrieval. You'll learn how embedding models work and how to use them effectively with LangChain. The document loading and splitting techniques you've learned here are essential prerequisites for these more advanced operations, as they ensure that your documents are properly prepared for embedding and retrieval.

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.