• TrustGraph
  • Posts
  • Case Study: Optimizing LLM Deployments with TrustGraph

Case Study: Optimizing LLM Deployments with TrustGraph

VertexAI vs. Intel Tiber Cloud vs. Scaleway

At TrustGraph, we’ve been using VertexAI going all the way back to the PaLM2 days. Up until recently, VertexAI had been a high throughput API for using Google’s latest and greatest LLMs. But enter 429 errors. We assumed that we needed to increase our quotas or rate limits - that’s what you’d do for almost all other cloud services. We assumed wrong.

We recently discovered that Google is taking a different approach to how they serve VertexAI (although if you’ve been using AWS Bedrock for a while, you can probably guess what I’m about to say next). Demand for Gemini is now handled dynamically and globally. What does that mean? If one region is particularly busy, your request is likely to return 429 errors. If you can’t increase your quotas or rate limits, what can you do?

The first answer is to set your region to global. A global request is routed to whatever region has the most availability. You have no control over that routing. If all regions are busy, you could still get a 429 error. Making a global request does not guarantee a response.

Then what’s the real answer? Open up those wallets and buy provisioned throughput! Provisioned throughput is essentially a reservation for “units” of LLM service. Why did I say “units”? Well, let’s look at the parameters that factor into your provisioned throughput purchase for the latest version of Gemini 2.5 Pro:

  • Percentage of queries using > 200K context window

  • Estimated queries per second

  • Input tokens per query

  • Input image tokens per query

  • Input video tokens per query

  • Input audio tokens per query

  • Output response text tokens per query

  • Output reasoning text tokens per query  

And yes, changing all of these parameters affects the provisioned throughput pricing differently.

So, I thought, what’s a reasonable amount of throughput someone might need to reserve? Without knowing how many users would be connected or the use case, how do you come up with reasonable estimates? What did I do? I guessed. The lowest request per second you can set is 1. I then picked some very conservative token numbers: 3000 for input and 500 for output. This is certainly more tokens than would be used for a customer support chatbot, but considerably less than for agentic flows with RAG. With those conservative settings, how much provisioned throughput would I need to buy? $35,100/month. Yes, you read that right. And that doesn’t include any images, video, audio, thinking tokens, or long context requests

My intent is not to be critical of Google. I’m merely stating the facts. I’ve been saying for quite a while now, that we’ve yet to see real AI pricing. OpenAI, Anthropic, Google, et al. have been in a customer acquisition hypergrowth phase, selling AI services at a massive loss. However, now that we’re seeing a need to buy, upfront, $35k worth of provisioned throughput just to guarantee we won’t get 429 errors, now we’re beginning to see real AI pricing. Oh by the way, Google did suggest a third solution to Gemini 429 errors - fall back to using Claude. Seriously, they really suggested that.

We’ve recently been doing some very interesting work with Intel on supporting Intel CPUs, GPUs, and Gaudi for LLM inference in Intel Tiber Cloud with TrustGraph. We’ve been using TrustGraph to handle all of the LLM orchestration, deploying open models. The results have so far been fascinating. 

While TGI/vLLM support for Intel GPUs and Gaudi is not as mature as Nvidia support, we’re already seeing very reasonable token throughputs at more than reasonable prices. A real wildcard is using Llamafiles (llama.cpp) for deploying purely with Intel CPUs (their Granite Ridge bare metal instance of 128 cores of Xeon CPUs has deployed Llama 3.3 70B for us quite well). And the pricing? Well…

  • Granite Ridge 128 core CPU instance: ~$3,300/month

  • 8x GPU bare metal instance: ~$3,400/month

  • 8x Gaudi2 instance: ~$7,500/month

We’ve also been doing some work lately deploying TrustGraph in Scaleway. For comparison, these are monthly prices for Nvidia GPUs in Scaleway:

  • H100-1-80G: ~$2,300/month

  • H100-SXM-8-80G: ~$19,200/month

At the moment, we don’t have a good feel for where the Intel services compare to Nvidia in terms of LLM performance. However, I’d feel comfortable saying Intel’s offerings are certainly in the ballpark with most of the H100 line, but perhaps not the newest ones.

But let’s not forget how we began this conversation - $35,100/month to guarantee a small amount of service for Gemini 2.5 Pro. Now begin the questions: what’s important to you

  • How much are the latest and greatest Gemini models worth to you? Are you going to be able to run a model of Gemini’s capability with the above Intel/Nvidia options? No. But do you need it?

  • Does data sovereignty matter? If you want control over the physical location of your data, the global option in VertexAI isn’t for you. Your only option is to buy provisioned throughput or deploy open models on Intel/Nvidia.

  • Does cost predictability matter? If you think you can accurately forecast the 8 parameters needed to estimate provisioned throughput in VertexAI, please tell me how you’re doing it. I think the rest of us would immediately shrug our shoulders, close our eyes, move some sliders, and hope for the best. What strange times we live in where we find ourselves finding certainty in forecasting CPU/GPU costs.

Is there one right answer? No. Depending on how you answer the above questions (and others), a particular solution will emerge as being best for you. We designed TrustGraph with this flexibility in mind. TrustGraph supports all major LLM API services - including the cloud specific ones like AWS Bedrock, Azure, and VertexAI - in addition to LLM orchestration to deploy the entire TrustGraph knowledge automation platform along with open LLMs in any target environment. To learn more about how TrustGraph merges data silos into AI-optimized knowledge packages in a uniform agentic platform, check out the links below: