Ask an AI chatbot what the capital of France is, and it will say Paris every time. Ask it to write you a cover letter, and you will get something different on every single attempt. Most people chalk this up to vague notions of the AI ‘being creative’ or ‘learning from the conversation,’ but neither of those is what’s actually happening. The real explanation lives in a single floating-point number baked into almost every language model deployment, and once you understand it, you will never look at AI output the same way again.
This connects to a broader pattern worth understanding: AI systems are finding patterns in data that human brains are physically incapable of seeing, and the mechanisms they use to generate output are just as counterintuitive as the patterns they discover.
What the Model Is Actually Doing When It ‘Thinks’
Let’s back up to first principles. A large language model doesn’t compose sentences the way you do. It doesn’t have a thought, then words. Instead, it operates token by token, where a token is roughly a word or a word fragment, and at each step it produces a probability distribution over its entire vocabulary. Think of it like a weighted slot machine with 50,000 slots, where each slot is a possible next token and the weights reflect how likely that token is given everything that came before it.
For a factual question like ‘What is the capital of France,’ the model produces a distribution that looks something like this (simplified for illustration):
'Paris' → 0.97
'Lyon' → 0.01
'Berlin' → 0.005
'the' → 0.004
... (remaining probability mass scattered across vocabulary)
The model is overwhelmingly confident. In practice, it will always pick ‘Paris.’ But for a prompt like ‘Write an opening sentence for a mystery novel,’ the distribution might be nearly flat across hundreds of plausible options. This is where the interesting engineering begins.
Temperature: The Knob Nobody Told You About
Before the model samples from that probability distribution, it applies a transformation controlled by a parameter called temperature. The name is borrowed from statistical thermodynamics (yes, really) and it works like this: the model takes the raw output scores (called logits), divides them by the temperature value, and then converts them into probabilities using a softmax function.
When temperature is low (say, 0.1 or 0.2), dividing by a small number makes the high-scoring tokens score even higher relative to the rest. The distribution becomes sharper, more peaked. The model almost always picks the most likely token. Responses are consistent, predictable, and, frankly, a bit boring.
When temperature is high (say, 1.5 or 2.0), dividing by a large number flattens everything out. Tokens that were unlikely become comparably likely to tokens that were highly favored. The model starts picking surprising, lower-probability tokens. Responses become varied, sometimes imaginative, sometimes incoherent.
At temperature 0, you get fully deterministic output. Same input, same output, every time. At temperature 1, you’re sampling from the raw distribution as the model learned it. Above 1, you’re actively flattening that distribution and introducing more randomness than the model itself would naturally produce.
This is why customer service bots sound robotic and repetitive (low temperature, by design) while creative writing tools feel alive and unpredictable (higher temperature, also by design). It’s not magic. It’s arithmetic.
The Related Concept You Should Also Know: Top-p Sampling
Temperature isn’t the only dial. Many production systems also use something called top-p sampling (sometimes called nucleus sampling). Instead of scaling the whole distribution, top-p works by truncating it: the model considers only the smallest set of tokens whose cumulative probability adds up to p, then samples from just that set.
Set p to 0.9, and the model considers only the tokens that together account for 90% of the probability mass. The long tail of weird, low-probability tokens gets cut off entirely. This is often combined with temperature to get responses that are varied but not deranged.
The practical upshot is that developers tuning an AI assistant are essentially mixing two independent controls: how peaked or flat the distribution is (temperature) and how much of the vocabulary is even in play at any given step (top-p). Getting this combination right is genuinely difficult. Too conservative and the model sounds like it’s reading from a script. Too permissive and it goes sideways in ways that are hard to predict, which matters enormously when, as we’ve explored before, AI systems make confident predictions about things they’ve never seen before and that confidence doesn’t necessarily track with accuracy.
Why This Matters Beyond Trivia
Understanding temperature has real consequences for how you use and build AI-powered tools.
If you are using an API directly, you can set temperature yourself. For tasks like code generation, data extraction, or answering factual questions, you want low temperature (0.0 to 0.3). For brainstorming, drafting, or creative tasks, higher temperature (0.7 to 1.0) often produces more interesting results. The model’s capabilities don’t change. Only its sampling behavior does.
If you are evaluating AI tools, the inconsistency you see between runs is a feature of the configuration, not necessarily a flaw in the underlying model. A competitor’s chatbot might seem ‘smarter’ simply because it uses a higher temperature for the kind of prompts you’re testing. This is the sort of subtle variable that makes apples-to-apples comparisons genuinely hard.
And if you are building AI-powered products, the decision about what temperature to deploy at is a product decision as much as a technical one. It shapes user experience, trust, and the nature of errors your system makes. A high-temperature system fails loudly and creatively. A low-temperature system fails quietly and repetitively. Neither failure mode is obviously better. It depends entirely on the use case, and as with so many product decisions, the right answer is context-specific rather than universal.
The Deeper Lesson
There’s something philosophically interesting buried in all of this. The ‘creativity’ of an AI system is, at one level, just controlled randomness applied to learned probability distributions. That shouldn’t diminish it entirely, because the distributions themselves encode an enormous amount of structure about language, reasoning, and knowledge. But it does reframe what we mean when we say an AI is ‘thinking’ versus ‘generating.’
The model’s knowledge is fixed after training. What changes between runs is which path through the probability space it happens to walk. Temperature is the parameter that decides how adventurous those steps are. Every surprising turn of phrase, every unexpectedly elegant explanation, every nonsensical hallucination, all of it flows from this one simple number and the probability distributions it shapes.
Next time a chatbot gives you a weird answer, before you blame the model, it’s worth asking what temperature someone decided to run it at. The answer might be more illuminating than you expect.