Chunking Strategy¶
Overview¶
A chunking strategy is the method used to divide documents into smaller, retrievable text segments before generating embeddings in a Retrieval-Augmented Generation (RAG) pipeline. The primary goal of a chunking strategy is to balance context preservation, retrieval accuracy, and system performance when searching knowledge bases.
In RAG systems, large documents are first split into chunks, converted into vector embeddings, and stored in a vector index. During query time, the most relevant chunks are retrieved and passed to the language model as context. Proper chunking directly affects retrieval quality, response accuracy, latency, and cost. (docs.aws.amazon.com) Scope:
(1) Applies to text-based knowledge retrieval systems (2) Used in vector databases, semantic search, and RAG pipelines (3) Independent of specific embedding or LLM providers (4) Applicable across enterprise, technical, and content-driven datasets
Chunking Strategies¶
The following strategies are commonly used in RAG systems.
| Strategy | Description | Complexity | Best For |
|---|---|---|---|
| Fixed-Size | Splits text by token or character count | Low | Small or simple documents |
| Recursive | Repeatedly splits text while preserving structure | Low–Medium | Semi-structured text |
| Document-Based | Splits at document or section boundaries | Low | Structured files with clear sections |
| Semantic | Splits by meaning or topic boundaries | Medium | Technical or narrative content |
| LLM-Based | Uses a language model to determine chunk boundaries | High | Complex documents |
| Agentic | AI agent decides chunking strategy dynamically | Very High | Highly nuanced or regulatory text |
| Late Chunking | Embeds full document, then derives chunks | High | Context-dependent tasks |
| Hierarchical | Multi-level chunking (section → paragraph → sentence) | Medium | Large structured documents |
Core Concepts and Components¶
| Concept | Description | Responsibility |
|---|---|---|
| Chunk | A segment of text stored in the vector index | Retrieval unit |
| Chunk Size | Maximum token or character length per chunk | Controls context and latency |
| Overlap | Shared tokens between adjacent chunks | Preserves context continuity |
| Embedding | Vector representation of chunk text | Enables similarity search |
| Vector Index | Storage for chunk embeddings | Supports fast retrieval |
| Metadata | Attributes attached to chunks | Enables filtering and traceability |
RAG Chunking Execution Flow¶
(1) Load the source document into the ingestion pipeline. (2) Apply the selected chunking strategy to split the document into chunks. (3) Generate embeddings for each chunk. (4) Store embeddings and metadata in the vector index. (5) Receive a user query. (6) Convert the query into an embedding. (7) Retrieve the most similar chunks from the vector index. (8) Send retrieved chunks as context to the language model. (9) Generate the final response using the retrieved context.
Strategy Selection Guidelines¶
Chunking strategy decision checklist
Context: Selecting a chunking strategy directly affects retrieval accuracy and system performance. Actions: (1) Identify document structure (structured, semi-structured, unstructured). (2) Estimate average document length and variability. (3) Determine whether semantic boundaries are important. (4) Evaluate latency and cost constraints. (5) Select the simplest strategy that meets retrieval quality requirements.
Decision Rules¶
(1) Use Fixed-Size when speed and simplicity are the priority. (2) Use Recursive when documents have paragraphs or headings. (3) Use Document-Based when files already have clear sections. (4) Use Semantic when topic boundaries matter. (5) Use Hierarchical for large, structured manuals or policies. (6) Use LLM-Based when chunk boundaries require contextual understanding. (7) Use Late Chunking when full-document context is critical. (8) Use Agentic only for highly complex or domain-specific documents.
Chunking Recommendations by Document Type¶
| Document Type | Characteristics | Recommended Strategy | Notes |
|---|---|---|---|
| Financial reports | Dense numeric data, sections, tables | Document-Based or Hierarchical | Preserve section boundaries |
| Presentations (slides) | Short, independent slides | Document-Based | Treat each slide as a chunk |
| Legal contracts | Long, structured clauses | Hierarchical or LLM-Based | Maintain clause context |
| Technical documentation | Headings, code blocks | Recursive or Hierarchical | Preserve logical structure |
| Research papers | Topic-driven sections | Semantic | Split by topic shifts |
| Emails and chat logs | Short, independent messages | Fixed-Size or Document-Based | Treat each message as a chunk |
| FAQs and support articles | Short question-answer pairs | Document-Based | One chunk per Q&A |
| Product catalogs | Repetitive, structured entries | Document-Based | One chunk per product |
| Policies and compliance docs | Multi-section structured text | Hierarchical | Preserve section hierarchy |
| Meeting transcripts | Long conversational flow | Semantic or Recursive | Preserve topic transitions |
Operational recommendation
Context: Mixed document collections. Actions: (1) Classify documents by type during ingestion. (2) Apply a different chunking strategy per document category. (3) Store the chosen strategy as metadata for observability.
Limitations and Edge Cases¶
(1) Very small chunks may lose semantic context and reduce retrieval accuracy. (2) Very large chunks may exceed model context limits or increase latency. (3) Overlap increases context continuity but also increases storage and cost. (4) Semantic and LLM-based chunking introduce additional processing cost. (docs.aws.amazon.com) (5) Hierarchical chunking may return fewer results if child chunks are replaced by parent chunks during retrieval. (docs.aws.amazon.com) (6) Multimodal data may use different chunking logic at the embedding level. (docs.aws.amazon.com)
Appendix A: Minimal Strategy Examples¶
Fixed-Size Chunking¶
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=500,
chunk_overlap=50
)
chunks = splitter.split_text(text)
Recursive Chunking¶
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50
)
chunks = splitter.split_text(text)
Document-Based Chunking¶
documents = [
{"content": section1, "metadata": {"section": "intro"}},
{"content": section2, "metadata": {"section": "methods"}},
]
Semantic Chunking (example using embeddings)¶
from langchain_experimental.text_splitter import SemanticChunker
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
splitter = SemanticChunker(embeddings)
chunks = splitter.split_text(text)
Appendix B: Strategy Selection Example¶
Example rule-based selector:
def choose_chunking_strategy(doc_type):
mapping = {
"financial": "document_based",
"presentation": "document_based",
"legal": "hierarchical",
"technical": "recursive",
"research": "semantic",
"chat": "fixed_size",
}
return mapping.get(doc_type, "recursive")
Reference¶
(1) https://docs.aws.amazon.com/bedrock/latest/userguide/kb-chunking.html (2) https://developer.nvidia.com/blog/finding-the-best-chunking-strategy-for-accurate-ai-responses/ (3) https://www.geeksforgeeks.org/data-science/chunking-strategies/ (4) https://community.databricks.com/t5/technical-blog/the-ultimate-guide-to-chunking-strategies-for-rag-applications/ba-p/113089