Most people treat prompting like sending an email. You write something, the model reads it, the model responds. Clean, linear, intuitive. This mental model is wrong, and the gap between it and reality explains a surprising number of frustrating AI interactions.
Before a large language model generates a single token of response, your prompt has already been through several significant transformations. Understanding what happens in that pre-generation window isn’t academic. It changes how you write prompts, why certain inputs fail, and why the model you think you’re talking to is never quite the model you’re actually talking to.
Tokenization breaks your meaning before anything else
The first thing that happens to your prompt is tokenization. Your text gets split into tokens, which are chunks that roughly correspond to syllables, whole words, or common word fragments, depending on the vocabulary. GPT-style models typically use byte-pair encoding, where the tokenization is learned from training data rather than being linguistically principled.
This creates real problems. The word “unfortunately” might become three tokens. A technical term your domain uses frequently might be split at an unexpected boundary. Numerals, punctuation, and whitespace get handled in non-obvious ways. The model never sees “unfortunately” as a single concept in the way you do. It sees a sequence of sub-word units, each with their own embedding.
Why does this matter? Because the model’s understanding is built on those token sequences, not on your original string. Unusual spellings, invented terminology, and anything that falls outside common training patterns gets carved up in ways that scatter meaning across multiple tokens, making the model’s job harder before it even starts.
Your prompt gets embedded into a space shaped by everything the model learned
After tokenization, each token is mapped to a high-dimensional vector, its embedding. These embeddings encode semantic relationships learned during training. “King” and “queen” occupy nearby positions. “Paris” and “France” have a relationship that mirrors “Tokyo” and “Japan.”
Here’s the part most people miss: your prompt doesn’t exist in isolation in this space. It arrives loaded with the statistical neighborhood of every word you used. When you write “optimize,” the model is working with an embedding shaped by every context in which “optimize” appeared in training, across code, business writing, biology papers, and wherever else. The meaning your prompt carries is partly your meaning and partly the aggregate of every similar sentence ever written.
This is why word choice matters so precisely. Asking the model to “fix” code versus “refactor” code versus “improve” code activates different regions of the model’s learned associations, even when you mean the same thing.
The system prompt and context window set the frame you can’t see
When you interact with a commercial AI product, you’re almost never sending a bare prompt. There’s a system prompt you didn’t write, and often conversation history, retrieved documents, or injected metadata sitting above your message in the context window. The prompt you write is not the prompt the model reads, and the system prompt is the most consequential part you have no visibility into.
The model doesn’t experience your message as “user asks question.” It experiences the full concatenation: system instructions, any retrieved context, prior conversation turns, then your message. The attention mechanism the model uses to process this input allows any part of the context to influence any other. A restriction buried in the system prompt can suppress an answer to a question you ask three thousand tokens later.
The practical upshot is that the same prompt can produce dramatically different outputs on different platforms, not because the underlying model changed, but because the context frame around your message did.
Attention is computed across the entire context simultaneously
Once your prompt is tokenized, embedded, and assembled with its context, the transformer architecture processes the entire sequence at once through its attention layers. This is not reading in any sequential sense. The model computes relationships between every token and every other token in the window simultaneously.
This is computationally expensive (it scales quadratically with context length, which is why extending context windows is technically hard), but it means the model has access to the full relational structure of your input before it generates anything. The structure of your prompt, what comes first, how ideas are grouped, which terms appear near each other, all of it influences the attention weights that shape the model’s internal representation.
Long, rambling prompts aren’t just harder to read. They create a noisier attention pattern where the signal of what you actually want gets diluted by irrelevant tokens competing for weight.
The counterargument
Some would argue this level of detail is unnecessarily technical for practical use. Plenty of people get useful work out of LLMs without understanding tokenization or attention mechanisms, and that’s true. Good instincts about clarity and specificity will get you far without any of this.
But “you can ignore it” and “it doesn’t matter” are different claims. The people who hit systematic failures with LLMs, who can’t figure out why their careful prompt keeps producing the wrong output, are often running into exactly these mechanisms without knowing it. Understanding why invented proper nouns confuse models, why very long contexts produce worse answers, or why the same question on different platforms gives contradictory responses, requires knowing what happens before the first token generates. The mechanism explains the behavior.
Prompting is more like compiling than conversing
The mental model I’d argue for is this: prompting is closer to writing code than sending a message. Your text gets transformed through a deterministic pipeline before the generative process begins, and the quality of what comes out depends heavily on what you put in and how it survives those transformations.
Treat token boundaries as something to work with, not through. Keep context tight and relevant. Understand that your words arrive pre-loaded with associations you didn’t choose. The model you’re talking to is processing a transformed, contextualized, attention-weighted version of what you wrote. Knowing that changes what “good prompting” actually means.