A Tale of Two Trends

The Consistent Inconsistencies of LLMs

Last time, I tested Claude 3 Haiku, across a range of context window sizes for a Naive Extraction process in TrustGraph. What did I find? The smaller I chunked, the more Graph Edges I extracted.

More of the Same

This time, I wanted to see if the same behavior held across a wider of range of models. Here’s the test setup:

  • A 10k token synthetic PDF (available here)

  • Recursive Text Chunker

  • Chunking at 500, 1000, 2000, and 4000

Most research focuses on the same models. Instead, I wanted to mix it up a bit. Here’s the models tested:

  • Claude 3.5 Sonnet

  • Claude 3 Haiku

  • Gemini 1.5 Flash

  • Mistral Large

  • Mixtral8x7B

  • Cohere Command R+

  • Llama3.1:8B

  • AI21 Jamba 1.0

Is Haiku the only model that shows a steep drop-off in knowledge extraction performance at surprisingly small context windows?

No. No, Haiku is not.

The results are shocking - shockingly consistent. Across 8 different models from 6 different companies - AI21, Anthropic, Cohere, Google, Meta, and Mistral - the trend lines are almost identical. The similarities are even more telling when comparing the percentage drop-off from extracted Graph Edges at chunking at 500 versus 4000:

Model

% Drop-Off

Claude 3.5 Sonnet

52.2%

Claude 3 Haiku

54.8%

Gemini 1.5 Flash

55.5%

Mistral Large

50.3%

Mixtral8x7B

59.1%

Cohere Command R+

45.0%

Llama3.1:8B

54.0%

AI21 Jamba 1.0

64.8%

With the exceptions of Jamba 1.0, the observed model drop-off varied only between 50% and 55% - an incredibly tight range. Command R+ and Mixtral8x7B fell outside this range by only 5% on opposite sides of the range.

The Inconsistency in the Consistency

Buried in this beautifully consistent data is a shocking secret begging to be revealed. See it yet? Perhaps this will help:

Model

Most Edges Extracted

Claude 3.5 Sonnet

2575

Claude 3 Haiku

2975

Gemini 1.5 Flash

3120

Mistral Large

2266

Mixtral8x7B

3226

Cohere Command R+

2178

Llama3.1:8B

2439

AI21 Jamba 1.0

1349

If we think the number of extracted Graph Edges is a proxy for model knowledge recognition performance, which we think it is, do these results match your expectations? The winner? Mixtral8x7B beating out models like Claude 3.5 Sonnet and Gemini 1.5 Flash. In fact, Sonnet 3.5 wasn’t even Anthropic’s best performing model, with Haiku extracting 15.5% more than the supposed flagship model.

But look deeper, I tested Llama3.1 - 8B. Yet, it outperformed both Mistral’s and Cohere’s flagship models. Although, I should caveat that, the overall winner was Mixtral8x7B outperforming Mistral Large by a whopping 42.4%. However, Mixtral8x7B did see the second highest drop-off at 59.1% calling into question how it will perform at larger chunk sizes. It seems in this test, Mistral is both a winner and a loser.

I do want to point out, I’ve never worked with Jamba 1.0 before, and I’m not sure where it should stand in comparison to the other models. But, those were the results, which I verified with multiple runs.

But Are They Any Good?

That’s a good question. No, that’s a really good question. No, that is THE question. Oh go ahead and go crazy with the superlatives. It truly is the question to be asked.

At the moment, we’re assuming more is better. As we well know, that’s not always the case. But, this is a tricky case. In a Naive Extraction, we’re asking the models to extract as much knowledge as possible, because we don’t know what’s important. We want to build as rich a knowledge graph as possible for building subgraphs for RAG responses.

But do more Graph Edges produce better RAG responses? Today, I don’t know. In fact, I’m not even sure how to even measure it. We have some ideas, and we want to know the answer to this question as much as you do. But what I do know for certain is we can use your help in this data endeavor!