The Dark Art of Chunking

Your Chunker Doesn't Work How You Think It Does

The Dark Art of Chunking

If you’re using common recursive text chunkers, like we are in TrustGraph, there’s a good chance your chunks don’t look like how you probably think they should. For one, the chunk size and overlap are in characters, not tokens. Give that a second to sink in. I’m seeing the best results, substantially better, chunking at 500 characters. That trend has held for every model I’ve tested so far. In the case of Gemini 1.5 Flash, its advertised context window is 1 Million tokens. Even for Gemini 1.5 Flash, there was a substantial - 30% to be exact - increase in knowledge extraction going from chunking at 1000 to 500 characters. And this again is for a model with an advertised 1 Million token context window.

But, for the moment I’m talking about chunking. As you may have noticed, we at TrustGraph love data. We’ve tried to jam as much as we can into our Grafana dashboard. Let’s use an example of chunking at 2000. What do we think that would look like? Chunks mostly in that range right? Gee, I bet a histogram of the chunk sizes would be nifty right about now…

Why so many chunk sizes?

Why are there so many chunk sizes? Shouldn’t they all be roughly 2000 characters in size, give or take? Well, it depends on the document being chunked. Most text splitters first try to find “natural break points” that look like paragraphs, sections, ends of sentences, etc. Then, the text is chunked by the specified number of characters. Depending on the document, this process may produce very consistent chunk sizes, or in this case, a wide range of sizes. But what if we change the chunk overlap parameter from 50 to 100? Does that matter?

Yes, yes it does.

Simply increasing the amount of overlap causes the chunk sizes to be more uniform. But wait, is that what we’d expect? Increasing the amount of overlap would seemingly increase the number of chunks for a given text corpus but altering the chunks sizes? Again, the break points for the chunking algorithms are looking for natural break points first, which can cause unexpected results. What happens if we chunk larger and smaller?

Chunking at 1000 with 5% overlap.

Chunking at 4000 with 5% overlap.

While chunking at 1000 is relatively uniform, raising the chunk size to 4000 gives us 4 chunks in the 2500 size range with 1 block as small as 1000. But what happens if we change the overlap?

Chunking at 4000 with 2.5% overlap.

Chunking at 4000 with 10% overlap.

Chunking at 1000 with 2.5% overlap.

Chunking at 1000 with 10% overlap.

For chunking at 4000, varying the overlap from 2.5% to 10% had no impact. For chunking at 1000, overlaps of 2.5% and 5% yielded the same number of chunks but a slightly different distribution. However, chunking at 1000 with an overlap of 10% saw dramatic changes. With this overlap, we saw 6 chunks at 250 and smaller. My assumption has always been that the overlap parameter would simply impact the break points where chunks begin and end. Instead, it can also dramatically impact the size of the chunks.

The Impact of Chunk Sizes

As we’ve previously seen, knowledge extraction increases dramatically as chunk sizes decrease. How does chunk size impact an extraction process?

  • Variance in extracted Graph Edges per chunk

  • Variance in LLM response time per chunk

  • Positional Biasing of the knowledge extraction

Positional Biasing of the knowledge extraction? What does that mean? Well, we know that a smaller chunk will yield more Graph Edges than a larger chunk. That means the smaller chunks are contributing more Graph Edges to the total Knowledge Graph. In other words, a small chunk might produce 25 Graph Edges while a larger chunk produces 15 Graph Edges. In the context of the overall Knowledge Graph, this means there is more information about the smaller chunks than the larger ones.

Perhaps the smaller chunks are more important. Perhaps, but we don’t know. Until we’ve human assessed the text corpus - and the whole point of using AI is to not have to do this - we simply don’t know. But what we do know is that chunking matters and matters in unexpected ways. We believe there’s many performance gains yet to be unlocked by improved text chunking. We invite you to join us on this journey of data driven discovery!