TrustGraph
Posts
A Tale of Two Trends

A Tale of Two Trends

The Consistent Inconsistencies of LLMs

Daniel Davis
August 26, 2024

A Tale of Two Trends

Last time, I tested Claude 3 Haiku, across a range of context window sizes for a Naive Extraction process in TrustGraph. What did I find? The smaller I chunked, the more Graph Edges I extracted.

More of the Same

This time, I wanted to see if the same behavior held across a wider of range of models. Here’s the test setup:

A 10k token synthetic PDF (available here)
Recursive Text Chunker
Chunking at 500, 1000, 2000, and 4000

Most research focuses on the same models. Instead, I wanted to mix it up a bit. Here’s the models tested:

Claude 3.5 Sonnet
Claude 3 Haiku
Gemini 1.5 Flash
Mistral Large
Mixtral8x7B
Cohere Command R+
Llama3.1:8B
AI21 Jamba 1.0

Is Haiku the only model that shows a steep drop-off in knowledge extraction performance at surprisingly small context windows?

No. No, Haiku is not.

The results are shocking - shockingly consistent. Across 8 different models from 6 different companies - AI21, Anthropic, Cohere, Google, Meta, and Mistral - the trend lines are almost identical. The similarities are even more telling when comparing the percentage drop-off from extracted Graph Edges at chunking at 500 versus 4000:

Model	% Drop-Off
Claude 3.5 Sonnet	52.2%
Claude 3 Haiku	54.8%
Gemini 1.5 Flash	55.5%
Mistral Large	50.3%
Mixtral8x7B	59.1%
Cohere Command R+	45.0%
Llama3.1:8B	54.0%
AI21 Jamba 1.0	64.8%

With the exceptions of Jamba 1.0, the observed model drop-off varied only between 50% and 55% - an incredibly tight range. Command R+ and Mixtral8x7B fell outside this range by only 5% on opposite sides of the range.

The Inconsistency in the Consistency

Buried in this beautifully consistent data is a shocking secret begging to be revealed. See it yet? Perhaps this will help:

Model	Most Edges Extracted
Claude 3.5 Sonnet	2575
Claude 3 Haiku	2975
Gemini 1.5 Flash	3120
Mistral Large	2266
Mixtral8x7B	3226
Cohere Command R+	2178
Llama3.1:8B	2439
AI21 Jamba 1.0	1349

If we think the number of extracted Graph Edges is a proxy for model knowledge recognition performance, which we think it is, do these results match your expectations? The winner? Mixtral8x7B beating out models like Claude 3.5 Sonnet and Gemini 1.5 Flash. In fact, Sonnet 3.5 wasn’t even Anthropic’s best performing model, with Haiku extracting 15.5% more than the supposed flagship model.

But look deeper, I tested Llama3.1 - 8B. Yet, it outperformed both Mistral’s and Cohere’s flagship models. Although, I should caveat that, the overall winner was Mixtral8x7B outperforming Mistral Large by a whopping 42.4%. However, Mixtral8x7B did see the second highest drop-off at 59.1% calling into question how it will perform at larger chunk sizes. It seems in this test, Mistral is both a winner and a loser.

I do want to point out, I’ve never worked with Jamba 1.0 before, and I’m not sure where it should stand in comparison to the other models. But, those were the results, which I verified with multiple runs.

But Are They Any Good?

That’s a good question. No, that’s a really good question. No, that is THE question. Oh go ahead and go crazy with the superlatives. It truly is the question to be asked.

At the moment, we’re assuming more is better. As we well know, that’s not always the case. But, this is a tricky case. In a Naive Extraction, we’re asking the models to extract as much knowledge as possible, because we don’t know what’s important. We want to build as rich a knowledge graph as possible for building subgraphs for RAG responses.

But do more Graph Edges produce better RAG responses? Today, I don’t know. In fact, I’m not even sure how to even measure it. We have some ideas, and we want to know the answer to this question as much as you do. But what I do know for certain is we can use your help in this data endeavor!

Clone TrustGraph on GitHub!
Get Started!
Join the community!