Chunk Smaller

Even powerful LLMs perform better with smaller chunks

The Smaller the Better

Quite possibly, the question I get asked most is: why should I use Graph RAG when context windows for LLMs keep getting longer? The short answer is: because long context windows don’t work.

Gasp! Shock! Horror! I said it! The problem has been, I haven’t had a way to prove that statement. Our general rule of thumb at TrustGraph is that, at most, 10% of the advertised context window is usable. At that limit, you run the risk of missing important details in the input context.

But what are important details? Yes, we’re quickly descending into a semantic quagmire of subjectivity from which I don’t have the tools to escape. In order to save myself, we’ve begun to create a new benchmark: Number of Graph Edges Extracted for a given Text Corpus.

Creating a Test Document

Building an objective test to show the limitations of long context windows requires a controlled approach. One of the first challenges is finding a suitable text corpus that is unlikely to be “known” by a LLM. Using Claude 3.5 Sonnet, I created a faux academic paper contrasting the cosmology and culture of the Atlanteans, Lemurians, and the alien race the Ebens? Why?

  • To attempt to create a truly unique knowledge set

  • To create a knowledge set that would require fact checking when testing RAG queries, since no one would know the answers to a totally fictitious document

  • To make the corpus relatively simple with no formulas or formatting

  • Comparison testing can be boring, so why not?

Choosing a Test Model

In the category of “big” LLMs, we found that Claude 3 Haiku excelled at knowledge extraction. Yes, that’s right, Haiku. Not Sonnet (tests with Sonnet 3.5 coming soon) and not Opus. Haiku also has an advertised context window of 200k tokens. Surely, Haiku would breeze through a series of Naive Extractions with ease.

The Test Setup

Using my synthetic document (available here), I used TrustGraph to run a Naive Extraction with Haiku. I chunked the document at 5 different sizes: 1000, 1500, 2000, 4000, and 8000. At the end of the Naive Extraction, I recorded the number of Graph Edges extracted.

For a model like Haiku, the assumption would be chunking this small would have zero impact. Even at 8000, I’m only at 4% of the advertised context window. So, I of course extracted the same amount of Graph Edges for each run right? Right?

Wrong. Very wrong.

The initial assumption held true when chunking from 8000 to 4000. However, I saw a significant increase in extracted Graph Edges at 2000 and an even bigger leap at 1000.

Chunking Size

Graph Edges

1000

2153

1500

1728

2000

1710

4000

1344

8000

1352

The number of extracted Graph Edges increased 59% when decreasing the chunking size from 8000 to 1000. The number increased 25% when making the small change of going from 1500 to 1000. For a model with an advertised context window of 200k tokens, decreasing the chunk size from 1500 to 1000 should have no impact on the responses.

Except that it does. Why is that? I’d be lying if I told you I know. I can speculate, but it would be nothing more than speculation. But, we’ve been seen these results consistently, driving the creation of TrustGraph.

Going Even Smaller?

At this point, I thought I was done, but then I wondered, surely going smaller than 1000 will have no impact? Right?

So wrong. So very, very, very wrong.

Chunking at 500 yielded a staggering 2975 Graph Edges. It’s worth mentioning that the test document is roughly 10k tokens. This amount of graph edges is close to a complete restructuring of the text in a knowledge graph. But still, even for a model like Haiku, 500 makes that big a difference?

  • Going from 8000 to 500 increased Graph Edges by 120%

  • Going from 2000 to 500 increased Graph Edges by 74%

  • Going from 1000 to 500 increased Graph Edges by 38%

More To Come

No, I haven’t conclusively proven that long context windows don’t work. For one, I’m using a synthetic document generated by a model in the same family as the model I’m using for the extraction. There’s all sorts of odd implications from that test setup that would be fun to explore!

Empirically, these results align with what we’ve been seeing at TrustGraph: chunk sizes of 1000-2000 seems to be a “sweet spot” for Naive Extraction. We want to continue this testing with as many configurations as possible. We hope you join us in using TrustGraph to build these knowledge extraction benchmarks!