Welcome, Developer đź‘‹
In my last post about RAG, I mentioned chunking in passing. “The chunking strategy matters more than I initially thought,” I wrote, then moved on. A few of you caught that and asked me to go deeper.
So here we are.
Chunking is one of those topics that looks like a detail but turns out to be foundational. You can have the best embedding model, the fastest vector database, the most capable LLM. If your chunks are bad, your answers will be bad. Full stop.
This post is everything I’ve figured out so far about chunking: what it is, why it’s harder than it looks, the main strategies, and the mistakes I made building my POC.
What Chunking Actually Is
When you ingest documents into a RAG pipeline, you can’t just throw a 40-page PDF straight into a vector database. You have to split it into smaller pieces first. Those pieces are called chunks.
Each chunk gets turned into a vector (an embedding), and that vector is what gets stored and searched. At query time, your search returns chunks, not whole documents. The LLM then answers based on whatever chunks came back.
The question is: how do you split the document? How big should each piece be? Where do you cut?
That’s chunking strategy. And the answer, annoyingly, is “it depends.” But I can give you a framework for thinking through the decision instead of just guessing.
Why It Matters More Than You Think
Here’s the mental model that made this click for me.
Imagine you’re studying for an exam and someone gives you a set of index cards with notes on them. When you get a question, you flip through the cards and find the ones that seem relevant.
Now imagine the cards were written badly. Some of them cut off mid-sentence. Some of them have three unrelated topics smashed together. Some are so short they have no context. Some are so long they contain a dozen different ideas and it’s hard to tell what any of them is actually about.
You’d struggle to find the right card. And even when you found something close, you might not be able to use it well.
That’s exactly what happens to your LLM when your chunks are bad. The retrieval step surfaces the wrong material, or the right material without enough context, and the model either hallucinates or gives a vague answer that kind of sounds right but isn’t grounded in anything real.
Most RAG failures I’ve read about, and a few I’ve experienced in my own POC, trace back here. Not to the model. Not to the prompt. To the chunks.
The Main Strategies
There’s no single right answer, but there are a handful of well-understood approaches. Here’s how I think about them.
Fixed-Size Chunking
The simplest one. You split the document every N tokens (or characters), regardless of what’s in the text.
from llama_index.core.node_parser import SentenceSplitter
parser = SentenceSplitter(chunk_size=512, chunk_overlap=50)
nodes = parser.get_nodes_from_documents(documents)512 tokens is a common starting point. The overlap (50 tokens in this case) means each chunk shares a little content with its neighbors, which helps when a relevant piece of information sits right at the boundary between two chunks.
When it works: Simple documents with consistent structure. Plain text, logs, markdown without complex layouts. It’s fast and predictable.
When it breaks: Any document where meaning is tied to structure. A legal clause that spans two chunks. A code example that gets cut in half. A table where rows end up in different chunks. The split doesn’t know or care about any of that.
Semantic Chunking
Instead of splitting by size, you split by meaning. You look at how semantically similar adjacent sentences are, and you break the chunk when there’s a meaningful shift in topic.
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.ollama import OllamaEmbedding
embed_model = OllamaEmbedding(model_name="nomic-embed-text")
parser = SemanticSplitterNodeParser(
buffer_size=1,
breakpoint_percentile_threshold=95,
embed_model=embed_model,
)
nodes = parser.get_nodes_from_documents(documents)The result is chunks that are more coherent as individual units of meaning. A chunk about password reset doesn’t bleed into a chunk about billing. They’re separated naturally.
When it works: Long documents that cover multiple distinct topics. Blog posts, wiki articles, product documentation, support runbooks.
When it breaks: It’s slower (you’re embedding during chunking, not just counting tokens), and the chunk sizes become variable. Some chunks end up tiny, some large. That variability can complicate things downstream if your retrieval or prompt budget expects consistent sizes.
Document-Structure-Aware Chunking
This one uses the actual structure of the document to decide where to split. Headers, sections, paragraphs. If you have a markdown file with ## headings, you split on those. If you have a PDF with clearly delineated sections, you respect them.
from llama_index.core.node_parser import MarkdownNodeParser
parser = MarkdownNodeParser()
nodes = parser.get_nodes_from_documents(documents)For HTML or structured docs you can do something similar with HTMLNodeParser or by pre-processing the document before it hits the parser.
When it works: Well-structured documents where sections map naturally to topics. Most technical documentation, wikis, policy documents.
When it breaks: Messy real-world documents. Scanned PDFs with no proper structure. Documents where the formatting and the content don’t align. If the structure is bad, your chunks inherit the same problems.
Hierarchical Chunking (Parent-Child)
This one takes a bit more setup but pays off for complex retrieval scenarios.
The idea: you create two levels of chunks. Small, precise child chunks for retrieval (they match queries well because they’re specific). And larger parent chunks that get sent to the LLM for context.
You retrieve by the small chunk, but you inject the big one. The model gets the relevant detail plus enough surrounding context to reason about it properly.
from llama_index.core.node_parser import HierarchicalNodeParser, get_leaf_nodes
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.core.retrievers import AutoMergingRetriever
from llama_index.core.storage.docstore import SimpleDocumentStore
parser = HierarchicalNodeParser.from_defaults(
chunk_sizes=[2048, 512, 128]
)
nodes = parser.get_nodes_from_documents(documents)
leaf_nodes = get_leaf_nodes(nodes)
# All nodes (parents + leaves) go into the docstore so the retriever
# can walk up the hierarchy at merge time
docstore = SimpleDocumentStore()
docstore.add_documents(nodes)
storage_context = StorageContext.from_defaults(docstore=docstore)
index = VectorStoreIndex(leaf_nodes, storage_context=storage_context)
vector_retriever = index.as_retriever(similarity_top_k=6)
retriever = AutoMergingRetriever(
vector_retriever, storage_context=storage_context, verbose=True
)The AutoMergingRetriever handles the logic: if enough child chunks from the same parent come back (by default, more than 50% of a parent’s children), it merges them up and sends the parent chunk to the LLM instead.
When it works: Long documents with dense information. Technical specs, legal contracts, research papers. Any case where you want precision on retrieval but breadth on context.
When it breaks: The complexity cost is real. More to configure, more to debug, more to reason about when something goes wrong. I wouldn’t start here.
The Overlap Question
Almost every strategy has an overlap parameter. It’s worth understanding what it actually does.
Overlap means each chunk shares some tokens with the chunk before it. If your chunk size is 512 and your overlap is 50, the last 50 tokens of chunk N are the first 50 tokens of chunk N+1.
Why does this matter? Because important information often sits at boundaries. A sentence that starts at the end of one chunk and finishes at the start of the next would be split in half without overlap. With overlap, both chunks contain the complete sentence.
A common starting point is 10-15% overlap relative to chunk size. So 50-75 tokens for a 512-token chunk. Going higher than ~20% starts to add noise and bloat your index without much benefit.
What I Got Wrong in My POC
I want to be honest about where I stumbled, because I think it’s useful.
I started with fixed-size chunking and forgot about it. I set 512 tokens, ran it, the basic Q&A worked, I moved on. Then I started asking more specific questions and noticed the answers getting vague or slightly off. Went back and looked at the chunks being retrieved. Several of them were cutting right through the middle of relevant explanations. The answer was there, just split across two chunks that didn’t both get retrieved.
I underestimated how much document quality matters. I was testing with a mix of clean markdown and some PDF exports that weren’t great. The PDFs chunked terribly. The parser was splitting on formatting artifacts, not on meaning. The lesson: garbage in, garbage out, and it happens earlier in the pipeline than you think.
I didn’t look at my chunks. This sounds obvious in retrospect but I built the whole pipeline before I actually printed out what the chunks looked like. When I finally did, I found chunks that were just headers with no body, chunks that were blank, and chunks that were a random splice of two different sections. Visualising your chunks early saves a lot of debugging time later.
A Practical Starting Point
If you’re building a RAG app and you’re not sure where to start, here’s what I’d actually do:
-
Start with fixed-size at 512 tokens and 50 overlap. It’s not perfect but it’s not bad, and it’s fast to iterate on.
-
Look at your chunks. Print them. Read them. Do they make sense as standalone units of information? Would a human understand each one without the rest of the document?
-
Check what your retrieval is actually returning. When you ask a question, log the chunks that come back. Are they the right ones? Are they complete?
-
Switch to structure-aware chunking if your documents have good structure. It’s a straightforward improvement and doesn’t add much complexity.
-
Consider semantic chunking if you’re dealing with long mixed-topic documents and retrieval quality still isn’t good enough after fixing the obvious stuff.
-
Try hierarchical chunking last, only if you need it. It solves a real problem but it adds real complexity.
The order matters. Don’t reach for the complex solution before you understand why the simple one isn’t working.
One More Thing: Metadata
Before I wrap up, there’s one thing that doesn’t get talked about enough alongside chunking: metadata.
When you create a chunk, you can attach metadata to it. The source document, the section it came from, the date it was created, whatever is relevant to your use case. That metadata travels with the chunk into the vector database and you can use it for filtering at retrieval time.
This matters more than it sounds. Without metadata, every query searches every chunk. With metadata, you can say: only search chunks from the last 30 days, or only search the technical documentation, not the marketing copy.
LlamaIndex handles this cleanly:
from llama_index.core import Document
doc = Document(
text="Your document content here",
metadata={
"source": "support-runbook",
"section": "password-reset",
"last_updated": "2026-05-01",
}
)Add metadata during ingestion. You’ll thank yourself when you need to filter later.
Conclusion
Chunking is where most RAG quality issues actually start. Not the model. Not the prompt. The chunks.
Fixed-size is the easiest starting point. Structure-aware is better if your documents have clear sections. Semantic chunking is more powerful for mixed content but slower to run. Hierarchical chunking solves the context problem for dense, complex documents.
Look at your chunks early. Log your retrieval results. Tune based on what you actually see, not what a blog post (including this one) says you should use by default.
I’m still iterating on my own POC. But chunking is the thing I now spend the most time thinking about when something isn’t working. It’s almost always the right place to look first.
Stay focused, Developer!