What does LLM Temperature Actually Mean?

Spoiler: I still don't know. 🤷

At this point, I thought I knew what temperature means for a LLM. A lower temperature increases determinism, reducing the likelihood of hallucinations or inaccurate responses. Google’s definition echoes this perception:

“The temperature controls the degree of randomness in token selection. The temperature is used for sampling during response generation, which occurs when topP and topK are applied. Lower temperatures are good for prompts that require a more deterministic or less open-ended response, while higher temperatures can lead to more diverse or creative results. A temperature of 0 is deterministic, meaning that the highest probability response is always selected.”

Google AI Studio Documentation

That makes sense - but if temperature is so straightforward, why are my tests with Gemini-1.5-Flash-002 so nonsensical???

We’ve been looking into adding what we’re calling “document-level metadata” in the TrustGraph extraction process. While we did add this feature in release 0.13.2, I had been evaluating using a LLM to extract important entities and topics for the entirety of a text corpus. I normally set the temperature to 0.0 since this should produce the most accurate extraction. I ran an extraction with Gemini-1.5-Flash-002. Looked good - except for one problem, I had accidentally set the temperature to 1.0. I reran it at 0.0, and the results worked worse. What’s going on?

I’ve never run comparison tests with TrustGraph where I did nothing but vary the temperature, but I decided, why not? For a single document, I did 3 runs, varying only the temperature from 0.0, 0.5, 1.0, 1.5, to 2.0. Yes, the temperature of Gemini goes to 2.0. No, I don’t know why. For other parameters, I set top_p=1.0, top_k=40, and output tokens maxed out at 8192 for all runs. I also used a JSON schema object for the response type.

Given my understanding of temperature, I expected Gemini to extract more information, returning more objects as the temperature increased. I would think a more deterministic response would be more conservative in how much information would be extracted. Except that didn’t happen. Except, my hypothesis wasn’t really proven wrong either. In fact, I’m not sure what these results mean.

The first document I tested was the Roger’s Commission Report from the NASA Challenger disaster. That PDF extracts to 176k tokens, 17.6% of Gemini-1.5-Flash’s advertised context window. For each run, here’s the number of output tokens:

Temp

0.0

0.5

1.0

1.5

2.0

Run 1

8192

8192

8192

1818

319

Run 2

1511

1425

8192

507

2147

Run 3

8192

8192

1303

8191

393

The second document was another NASA report on the decision making of the Columbia disaster. That PDF extracts to 24.4k tokens, 2.4% of the advertised context window.

Temp

0.0

0.5

1.0

1.5

2.0

Run 1

340

427

290

379

747

Run 2

340

415

375

364

361

Run 3

340

696

355

379

331

The inconsistency of the first set of test runs is inexplicable. Most times, Gemini tried to extract more than the maximum 8192 tokens, returning an incomplete and invalid JSON object. Yet, what about run 2 when even at a temperature of 0.0 Gemini returned only 1511 tokens? Why did increasing the temperature to 2.0 decrease the output so dramatically? The data is so inconsistent, I don’t know where to begin to draw any conclusions.

The second document data is more consistent. For instance, at a temperature of 0.0, it returned the same amount of tokens all 3 times. When increasing the temperature to 0.5, the responses did increase as I predicted. And then there’s temperature 1.0 where the response amounts go down. Beyond 1.0, the responses mostly go down with one outlier at 2.0 where the responses were 2x. 

With this data, can I draw any meaningful conclusions? Yes, I think I can.

  • Long context windows still aren’t reliable. Even at only 17.6% Gemini’s advertised context window, the behavior is shockingly inconsistent

  • At a much smaller context, the temperature behavior seems to be more consistent, but still a bit mysterious.

  • For knowledge extraction tasks, temperature doesn’t work the way we think it should.

Sure, the consistency of those 3 runs where it returned the same output all 3 times seems great, but what if we want more? For knowledge extraction and graph building in TrustGraph, we’re trying to extract every important detail from the input document. We don’t want just facts, but any meaningful statements or opinions described in the text. It appears allowing the LLM to introduce some randomness in the response tokens produces more objects for information extraction. Bizarrely, I also noticed that increasing the temperature seemed to return more people than at lower temperatures. Based on cursory glances, none of the responses seemed to be producing hallucinations, but that experiment will require more testing.