Mastering Chunking in RAG - Techniques and Strategies
Mastering Chunking in RAG - Techniques and Strategies
Mastering Chunking in RAG - Techniques and Strategies
This blog explores the world of chunking in Retrieval-Augmented Generation (RAG)-based LLM systems. It
covers the significance of chunking, various types and methods, and best practices for its effective
implementation.
Start Project
When preparing for exams, most of us prepare notes and divide the whole syllabus into chunks. We then
read every chunk one by one and try to master it deeply. Humans are not the only ones benefitting from
chunking; machines use it, too. Large Language Models (LLMs) also use chunking to understand the
context of input prompts better and respond to user queries efficiently.
RAG-based (Retrieval-Augmented Generation) LLM systems use chunking to enhance their capabilities.
These systems retrieve relevant pieces of information and use these chunks to generate accurate and
contextually relevant responses. RAG-based LLMs can process and integrate information more effectively
by breaking the input into smaller, manageable chunks. This approach ensures that the models maintain
context and coherence over extended interactions, leading to more precise and relevant answers. Like how
students benefit from chunking their study material, LLMs achieve better performance and more profound
understanding through chunking in RAG. Now that we understand the general concept of chunking let's
explore what chunking explicitly means in the context of RAG in depth.
Table of Contents
What is Chunking in RAG?
Why do we need Chunking in RAG?
Types of Chunking in RAG
Strategies for Chunking in RAG
Key Considerations for Implementing Chunking in RAG
Best Practices for Chunking in RAG
Learn to Build RAG systems with ProjectPro!
FAQs
A. Retrieval Phase
In the initial phase, the system retrieves relevant documents, data points, or pieces of information from a
vast corpus. This retrieval is based on the input prompt and aims to gather the most appropriate chunks of
information that can help answer the query.
B. Chunking Phase
Once the relevant information is retrieved, it is divided into smaller, coherent chunks. This segmentation is
essential because it allows the system to handle and process the data in parts rather than as a whole,
which would be computationally intensive and less efficient.
C. Generation Phase
The generative model then uses these chunks to produce a response. Integrating these smaller pieces of
information allows the model to generate a more accurate and contextually relevant answer. The chunking
method ensures that each piece of information is given appropriate attention, maintaining context and
coherence in the final response.
I hope now you have a clear understanding of chunking in RAG. It's time to learn why this technique is
necessary for enhancing the performance and accuracy of these systems.
2) Contextual Relevance
By dividing the retrieved information into chunks, RAG systems can maintain context and relevance
throughout the generation process. Each chunk represents a coherent unit of information that can be
integrated into the response generation. This ensures that the generated responses are accurate and
contextually appropriate, enhancing the overall quality of the system's output.
Having worked in the field of Data Science, I I think that they are fantastic. I attended Yale
wanted to explore how I can implement and Stanford and have worked at
projects in other domains, So I thought of Honeywell,Oracle, and Arthur
connecting with ProjectPro. A project that Andersen(Accenture) in the US. I have taken
helped me absorb this topic was "Credit Risk Big Data and Hadoop,NoSQL, Spark, Hadoop
Modelling". To understand other domains, it… Admin, Hadoop projects. I have been happy…
Not sure what you are looking for? View All Projects
4) Scalability and Performance
Handling large volumes of data efficiently is crucial for the scalability and performance of RAG systems.
Chunking enables the system to scale effectively by processing information in smaller increments,
optimizing memory usage and computational resources. This scalability ensures the system can handle
increasingly complex queries and promptly generate responses.
Here is an insightful brief on the importance of Chunking in RAG by Rishabh Goyal, Senior Manager
(Applied AI) at Fidelity Investments:
Chunking in RAG thus enhances the system's ability to manage and utilize retrieved information
effectively, leading to improved accuracy, contextual understanding, and performance in natural language
processing projects. Let us now look at the different types of chunking in RAG.
2) Recursive Chunking
Recursive Chunking splits the text into smaller chunks iteratively, using a hierarchical approach with
different separators or criteria. Initial splits are made using larger chunks, which are then further divided if
necessary, aiming to keep chunks similar in size. This method maintains better context within chunks and
is helpful for complex texts where contextual integrity is essential.
4) Semantic Chunking
Semantic Chunking divides the text into meaningful, semantically complete chunks based on the
relationships within the text. Each chunk represents a complete idea or topic, maintaining the integrity of
information for more accurate retrieval and generation. This method is slower and more computationally
intensive but is best for NLP applications requiring high semantic accuracy, such as summarization or
detailed question answering.
5) Agentic Chunking
Agentic Chunking is an experimental approach that processes documents in a human-like manner.
Chunks are created based on logical, human-like decisions about content organization, starting from the
beginning and proceeding sequentially, deciding chunk boundaries dynamically. This method is still being
tested and not widely implemented due to the need for multiple LLM calls and higher processing costs. It
is potentially useful for highly dynamic and complex documents where human-like understanding is
beneficial.
The LangChain tool, known for its versatility in natural language processing tasks, effectively implements
the different types of chunking in RAG we discussed so far. LangChain offers various splitters to implement
these methods. In the next section, we will explore these methods in detail and provide examples to help
you understand their functionality.
1) CharacterText Splitter
CharacterText Splitter is a straightforward method where text is divided into chunks based on a fixed
number of characters. Overlapped characters can be used to maintain context between chunks. Before we
proceed with a detailed explanation of an example, we need to understand two terms: size and overlap.
Chunk size refers to the number of characters in each chunk. For instance, if the chunk size is set to 10
characters, each chunk will contain exactly 10 characters from the text.
Chunk Overlap is the number of characters that overlap between consecutive chunks. This ensures
that the context from the end of one chunk carries over to the beginning of the next chunk, which
helps preserve the flow of information.
Example
Let's use a simple text to illustrate how chunk size and chunk overlap work:
Text = "The quick brown fox jumps over the lazy dog."
Chunk Size = 10 characters
Chunk Overlap = 5 characters
Here’s how the text would be split into chunks:
First Chunk
Characters: "The quick "
Length: 10 characters
Second Chunk
Characters: "quick brown"
Starts 5 characters before the end of the first chunk ("quick "), ensuring overlap.
Length: 10 characters
Third Chunk
Characters: "brown fox "
Starts 5 characters before the end of the second chunk ("brown"), ensuring overlap.
Length: 10 characters
Fourth Chunk
Characters: "fox jumps "
Starts 5 characters before the end of the third chunk ("fox "), ensuring overlap.
Length: 10 characters
Fifth Chunk
Characters: "jumps over"
Starts 5 characters before the end of the fourth chunk ("jumps "), ensuring overlap.
Length: 10 characters
Sixth Chunk
Characters: "over the l"
Starts 5 characters before the end of the fifth chunk ("over "), ensuring overlap.
Length: 10 characters
Seventh Chunk
Characters: "the lazy d"
Starts 5 characters before the end of the sixth chunk ("the l"), ensuring overlap.
Length: 10 characters
Eighth Chunk
Characters: "lazy dog."
Starts 5 characters before the end of the seventh chunk ("lazy "), ensuring overlap.
Length: 9 characters (since the text ends here)
Pros Cons
Example
In this example, the RecursiveCharacter Text Splitter divides the text into chunks of approximately 100
characters each while ensuring a 4-character overlap between consecutive chunks to maintain context
and coherence. The algorithm identifies natural language boundaries (paragraphs and sentences) and
applies recursive division to create meaningful chunks that preserve the semantic integrity of the original
text.
Sample Text:
“Natural language processing (NLP) is a field of artificial intelligence concerned with the interaction
between computers and humans using natural language.
It focuses on the understanding, interpretation, and generation of human language, allowing computers
to understand human language as it is spoken.
NLP involves several challenges such as natural language understanding, natural language generation,
and machine translation.”
Let's assume the total character count for this text is around 350 characters.
Initial Chunking: Start with the entire text as one initial chunk.
“Natural language processing (NLP) is a field of artificial intelligence concerned with the interaction
between computers and humans using natural language. It focuses on understanding, interpreting, and
generating human language, allowing computers to understand human language as it is spoken. NLP
involves several challenges such as natural language understanding, natural language generation, and
machine translation.”
After RecursiveCharacter Text Splitter,
Chunk 1 ends after "language. (approximately 100 characters).
Chunk 1 = “Natural language processing (NLP) is a field of artificial intelligence concerned with the
interaction between computers and humans using natural language.”
Chunk 2 starts after "language. " (ensuring 4 characters overlap) and ends after "spoken." (approximately
100 characters)
Chunk 2 = “It focuses on the understanding, interpretation, and generation of human language, and it
allows computers to understand human language as it is spoken.:
Chunk 3 starts after "spoken. " (ensuring 4 characters overlap)
Chunk 3 = “NLP involves several challenges such as natural language understanding, natural language
generation, and machine translation.”
Pros Cons
3) MarkdownHeaderText Splitter
The MarkdownHeaderTextSplitter is designed to split Markdown documents according to their header
structure (e.g., #, ##, ###). This method keeps header metadata intact, allowing for context-aware splitting
that maintains the document's logical structure, which is useful for tasks requiring hierarchical
organization.
Example
Consider a Markdown document:
# Introduction
This is the introduction text.
## Section 1
Content for section 1.
### Subsection 1.1
Details for subsection 1.1.
## Section 2
Content for section 2.
When split by the MarkdownHeaderTextSplitter, it would produce:
Chunk 1
# Introduction
This is the introduction text.
Chunk 2
## Section 1
Content for section 1.
Chunk 3
### Subsection 1.1
Details for subsection 1.1.
Chunk 4
## Section 2
Content for section 2.
Pros Cons
4) TokenText Splitter
The TokenTextSplitter divides text based on the number of tokens rather than the number of characters.
Tokens are the basic units of text used by language models, which may be words, subwords, or
punctuation marks. Tokens are often approximately four characters long, so splitting based on token count
can better represent how the language model will process the text. This approach aligns with how many
language models process text, as they typically have a maximum token limit.
Pros Cons
Splits text based on the token May split words into subwords,
count, aligning with how many especially with certain
language models, which have tokenization algorithms,
context windows based on the potentially leading to less
token count, process text. readable chunks.
It provides a more accurate The effectiveness of this method
representation of how the depends on the tokenization
language model will process the method used by the model,
text since tokens are often which might not be uniform
approximately four characters across different models.
long. Requires a good understanding
Can handle various text lengths of how the model generates and
and adapt to different models’ uses tokens, adding a layer of
token limits, ensuring efficient complexity compared to simpler
use of the model’s context splitting methods.
window.
5) NLTKText Splitter
The NLTK Text Splitter leverages the Natural Language Toolkit's robust tokenization capabilities to split text
based on linguistic structures such as sentences or words. This method ensures accurate sentence and
word boundaries using pre-trained models for various languages. It's highly customizable, allowing new
languages or domain-specific tokenizers to be added.
Pros Cons
6) SentenceTransformersTokenText Splitter
The SentenceTransformers Token Text Splitter uses sentence embeddings to split text into semantically
meaningful chunks. This approach considers the semantic content, ensuring each chunk maintains its
meaning and context. This method is particularly useful for applications like question answering and
information retrieval, where high semantic accuracy is crucial.
Pros Cons
You may have noticed that examples were not provided for the last three methods we discussed. This is
intentional because we encourage you to implement them in Python and explore their functionality
through experimentation with various inputs. Download the code for free and start exploring right away!
So far, we have explored various types of chunking in RAG and discussed several methods LangChain offers
to implement them. However, determining the most suitable chunking strategy can be challenging when
faced with textual data. To assist you in this process, we have compiled a list of parameters in the next
section that you can analyze to make an informed decision about which strategy to implement.
FAQs
PREVIOUS NEXT
About the Author
Manika
Manika Nagpal is a versatile professional with a strong background in both Physics
and Data Science. As a Senior Analyst at ProjectPro, she leverages her expertise in
data science and writing to create engaging and insightful blogs that help…