Diving Down the AI Cost Rabbit Hole

Part 1: Knowledge Extraction

With all the attention on AI, you’d think there’d be an endless supply of cost analysis of AI approaches. There’s certainly an endless supply of speculation about strawberries. Cost analysis? Not so much. Well, that is until people saw the costs for OpenAI’s new reasoning model o1.

So far, the pricing model for LLMs is per token, where the output tokens are usually more expensive than the input tokens. In the case of o1, 1M input tokens is $15 and 1M output tokens is $60. Utilizing the full context window of 125k tokens will cause the cost of a single o1 request to be measured in multiple dollars. It doesn’t take long to see why someone would pursue a RAG solution purely for cost savings.

Except for one problem. How do you actually model the costs of RAG, GraphRAG, or HybridRAG? Good question - a good question that currently doesn’t have an answer. In fact, this post is Part 1 in a series to address this question. How can we compare calling LLMs directly with long context to various RAG approaches? 

One critical problem is there’s no single RAG approach. The GraphRAG approach of TrustGraph builds a knowledge graph and mapped vector embeddings from a text corpus in a one-time process. The ability to store this process as a knowledge core, prevents having to run the extraction multiple times.

The current 0.9.5 version of TrustGraph extracts knowledge with 3 parallel processes. These processes extract entities, topics, and semantic relationships. While it is debatable if splitting the extraction into 3 processes is a good idea, this approach does present a “worst case” in terms of costs. To test TrustGraph, I created a synthetic document that is approximately 10k tokens. Running a full extraction consumed the following amounts of tokens:

  • Input Tokens: 111k

  • Output Tokens: 51k

I suspect I know what you’re thinking. Didn’t you say the document was only 10k tokens and that there are 3 processes? Then how did you end up with 111k input tokens? For one, depending on the chunking method, there will be overlap. Secondly, the prompt instructions are roughly 200 tokens each. For each chunk, 200 tokens would get added to the raw text chunk. In addition, we’ve found chunking smaller produces better results, meaning the ratio of the text chunk to the instructions could be 2:1. Since there are 3 parallel processes, all of this gets multiplied by 3.

You’re probably looking at that output token number and wondering how is it 5x the tokens of the original document? The simplest way to extract knowledge is to output a JSON schema that’s converted to RDF triples. Somewhere between 20-30% of those output tokens are likely just the JSON schema. In addition, the TrustGraph extraction will generate duplicate triples since each chunk is handled independently. The current extraction process is by no means hyper efficient. But, these are the tradeoffs for a purely naive extraction process.

But didn’t we already see that some models extract more knowledge than others? Yes, we did. The run of 111k input tokens and 51k output tokens was with Claude Sonnet 3.5 which was average in past tests. There’s roughly a 20-25% possible deviation in output tokens, which will translate to roughly a 10% deviation in total costs. However, when you see the cost estimates, that 10% won’t have much impact.

Before we look at the extraction cost estimates, three interesting observations:

  • Every model provider has set different prices per token

  • The price difference between “big” models and “small” models is well over a full order of magnitude

  • There’s no consistency in the ratio between the costs of input tokens to output tokens

“Big” Model Costs:

Model

Input Price (Per 1M Tokens)

Output Price (Per 1M Tokens)

Approximate Extraction Cost

OpenAI o1

$15.00

$60.00

$4.73

Llama3.1:405B*

$5.32

$16.00

$1.41

gpt-4o

$5.00

$15.00

$1.32

Claude Sonnet 3.5

$3.00

$15.00

$1.10

Gemini 1.5 Pro

$3.50

$10.50

$0.92

Command R+

$2.50

$10.00

$0.79

Mistral Large 2

$2.00

$6.00

$0.53

“Small” Model Costs:

Model

Input Price (Per 1M Tokens)

Output Price (Per 1M Tokens)

Approximate Extraction Cost

Mixtral8×7B

$0.70

$0.70

$0.11

Claude 3 Haiku

$0.25

$1.25

$0.09

gpt-4o-mini

$0.15

$0.60

$0.05

Llama3.1:8B*

$0.22

$0.22

$0.04

Gemini 1.5 Flash

$0.075

$0.30

$0.02

*Prices for Llama3.1 based on calling through AWS Bedrock

But what does it all mean? If we compare the costs of extraction versus the number of extracted graph edges, it seems there’s no correlation between knowledge extraction performance and cost. Is this enough data to draw conclusions about value? How do we take this data and understand the total costs of GraphRAG? Tune in for Part 2 as we continue to dive deeper down this rabbit hole.