30DaysOfLangChain – Day 4: Preparing Data for RAG: Document Loaders & Text Splitters

Welcome to Day 4 of #30DaysOfLangChain! Over the past three days, we’ve built a solid foundation with Runnables, LCEL, prompt engineering, output parsing, and flexible LLM integration. Today, we begin our journey into Retrieval-Augmented Generation (RAG), a powerful technique that allows LLMs to access and utilize external, up-to-date information.

The first step in any RAG pipeline is preparing your data. LLMs have context windows, meaning they can only process a limited amount of text at a time. This is where Document Loaders and Text Splitters come in.

Understanding LangChain’s `Document` Object

Before we load and split, it’s important to know what LangChain works with. The core data structure for unstructured data in LangChain is the Document object. A Document typically has:

page_content: A string containing the actual text content of the document or a chunk.
metadata: A dictionary containing arbitrary key-value pairs that describe the document (e.g., source, page, author, date). This metadata is crucial for understanding the origin of the information.

1. Document Loaders: Bringing Your Data In

Imagine you have information scattered across various file types: PDFs, Word documents, web pages, CSVs, etc. A Document Loader is a tool that takes data from these diverse sources and converts them into a list of LangChain Document objects.

Why do we need them? They abstract away the complexity of reading different file formats. Instead of writing custom parsing logic for each file type, you use a specialized loader.
Examples:
- TextLoader: For simple .txt files.
- PyPDFLoader: For PDF documents (requires pypdf library).
- WebBaseLoader: For loading content from web pages.
- CSVLoader: For CSV files.
- Many more for Notion, Slack, Google Docs, etc.

Loaders handle the initial extraction, providing the raw text content and often some basic metadata (like the source file path).

2. Text Splitters: Breaking Down Information

Once you’ve loaded a document, it might be too large for an LLM’s context window or too large for effective embedding (which we’ll cover on Day 5). This is where Text Splitters come into play. A text splitter takes a long Document and breaks it down into smaller, manageable Document chunks.

Why splitting is crucial:
- Context Window Limits: LLMs have a maximum token limit. Sending an entire book is impossible.
- Relevance: Smaller chunks increase the chance that only the most relevant information is retrieved for a query, reducing noise.
- Embedding Quality: Embeddings (numerical representations of text) often perform better on smaller, more semantically focused chunks.
Splitting Strategies:
- RecursiveCharacterTextSplitter: This is one of the most commonly used and versatile splitters. It attempts to split text recursively by a list of characters (e.g., \n\n, \n, , ""). If a chunk is still too large, it tries the next delimiter until it finds a valid split or resorts to splitting by individual characters.
- chunk_size: The maximum size of each chunk (in characters or tokens, depending on length_function).
- chunk_overlap: The number of characters/tokens to overlap between consecutive chunks. This helps maintain context when an important piece of information spans two chunks.

Okay, let’s kick off Day 4: Document Loaders & Text Splitters! This is a crucial step towards building powerful Retrieval-Augmented Generation (RAG) applications.

Project: Loading and Splitting a Document

Let’s demonstrate how to load a simple text file and then split it into smaller chunks using RecursiveCharacterTextSplitter. We’ll print the original document and then the resulting chunks, along with their metadata.

Before you run the code: Create a file named sample_document.txt in your day4-document-loaders-splitters folder with some paragraphs of text. A longer text is better to demonstrate splitting.

import os
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document 


# Create the sample document file
file_path = "sample_document.txt"
with open(file_path, "w") as f:
    f.write(document_content)

print(f"Created '{file_path}' for demonstration.\n")

# --- Step 1: Load the Document ---
print("--- Loading Document ---")
loader = TextLoader(file_path)
documents = loader.load() # This returns a list of Document objects (usually one per file)

# Check if any documents were loaded
if documents:
    original_document = documents[0]
    print(f"Original Document Page Content (first 200 chars):\n{original_document.page_content[:200]}...\n")
    print(f"Original Document Metadata: {original_document.metadata}\n")
else:
    print("No documents loaded. Exiting.")
    exit()

# --- Step 2: Split the Document into Chunks ---
print("--- Splitting Document ---")

# Initialize the text splitter
# chunk_size: max size of each chunk
# chunk_overlap: number of characters to overlap between chunks (helps maintain context)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=20,
    length_function=len, # Uses character count for length. Can be a token counter.
    add_start_index=True # Adds a 'start_index' to metadata indicating where the chunk began in the original text
)

# Split the document
chunks = text_splitter.split_documents(documents)

print(f"Original document split into {len(chunks)} chunks.\n")

# --- Step 3: Print the Chunks and their Metadata ---
print("--- Reviewing Chunks ---")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:")
    print(f"  Content (first 100 chars): {chunk.page_content[:100]}...")
    print(f"  Length: {len(chunk.page_content)} characters")
    print(f"  Metadata: {chunk.metadata}")
    print("-" * 30)

# Optional: Clean up the dummy file
# os.remove(file_path)
# print(f"\nCleaned up '{file_path}'.")

Code Explanation:

Dummy File Creation: We programmatically create a sample_document.txt for easy demonstration. In a real application, you’d point to your existing files.
TextLoader: We instantiate TextLoader with our file_path. The .load() method reads the file and returns a list of Document objects. For a single text file, it will typically return one Document.
RecursiveCharacterTextSplitter:
- We configure it with chunk_size (max characters per chunk) and chunk_overlap (to ensure context isn’t lost at chunk boundaries).
- length_function=len tells the splitter to use character count. For more precise token-based splitting, you’d use a tokenizer function (e.g., from tiktoken).
- add_start_index=True is a handy feature that adds metadata showing the starting character index of each chunk in the original document.
split_documents(): This method takes a list of Document objects (even if it’s just one) and returns a new list of smaller Document objects, representing the chunks.
Output Review: The loop prints each chunk’s content, length, and metadata, allowing you to see how the original document was broken down.

This project lays the groundwork for RAG by showing how to get your data into a usable format for LLMs. Next up: Embeddings and Vector Stores!

ML Vector