The Simple Version

Your prompt gets converted into a list of numbers, and the model never reads words at all. Everything it does, it does with math on those numbers.

Step One: Your Words Disappear

The first thing that happens when you hit send is tokenization. The model doesn’t receive your sentence, it receives a sequence of tokens, which are chunks of text that usually correspond to whole words or common word fragments. The word “unbelievable” might become two tokens: “unbel” and “ievable”. The word “cat” is probably one. A space before a word often gets folded into the token itself.

This matters more than it sounds. Because tokenization is learned from the training data, common English words get clean, single tokens. Rare words, names, and technical jargon get split into awkward fragments. When you ask a model to count the letters in a word, it often fails not because it can’t count, but because it never sees the letters, only the token. The word “strawberry” is famously difficult for models to spell backwards for exactly this reason.

OpenAI’s tiktoken library is public, and you can run your own text through it to see exactly how GPT-4 chops up your input. Try it once and you’ll never assume words are the unit of processing again.

Step Two: Tokens Become Vectors

Each token gets looked up in an embedding table and converted into a high-dimensional vector, which is just a long list of floating-point numbers (imagine a list of 768 or 1536 numbers instead of a word). This is the embedding.

The interesting property of embeddings is that they encode meaning through proximity. Vectors for “king” and “queen” sit closer together in this high-dimensional space than vectors for “king” and “carburetor”. The model learned these positions entirely from statistics: words that appear in similar contexts end up with similar vectors. No human labeled any of it.

At this point, your prompt is a matrix. A sequence of vectors, one per token, stacked together. This is the actual input the transformer processes.

Attention weight heatmap showing how words in a sentence relate to each other during transformer processing
Each cell in an attention heatmap represents how strongly one token 'attends' to another when building its contextual representation. Brighter cells mean stronger attention.

Step Three: Position Gets Encoded

Here’s a subtle problem: a matrix of vectors doesn’t inherently carry order. The phrase “dog bites man” would produce the same set of vectors as “man bites dog” if you didn’t do something about it.

The solution is positional encoding. Before processing begins, each token’s vector gets modified based on its position in the sequence. Token 1 gets adjusted one way, token 2 another way, and so on. The specific method varies by architecture (learned embeddings vs. sinusoidal functions vs. more recent approaches like RoPE, or Rotary Position Embedding), but the goal is the same: give the model a way to distinguish “A follows B” from “B follows A”.

This step is why context window limits exist and matter. The positional encoding scheme only covers a finite sequence length. Beyond that limit, the model has no way to represent position, so inputs get truncated. Your 50,000-word document doesn’t get processed in full if the model’s context window is 32,000 tokens. It gets cut.

Step Four: The Attention Mechanism Runs

Now the actual computation happens. The transformer architecture processes your vector sequence through multiple layers of attention, and this is where the model does something genuinely interesting.

Attention, roughly speaking, lets each token look at every other token and decide how much to “attend” to it when building its own representation. The token for “it” in the sentence “The cat sat on the mat because it was tired” needs to figure out that “it” refers to “cat”, not “mat”. Attention is the mechanism that does this. Every token asks, in effect, which other tokens are relevant to understanding me?

This runs in parallel across all tokens simultaneously (unlike older recurrent architectures that processed tokens one at a time), which is part of why transformers are fast to run on GPUs, which are built for parallel matrix math.

Each layer of the network builds a richer representation. Early layers tend to capture syntax and local relationships. Later layers capture more abstract semantic relationships. By the time your prompt has passed through all the layers, every token’s vector has been updated to reflect its relationship to every other token in the context.

Step Five: The Model Starts Predicting

Only after all of this processing does the model produce output, and it does so one token at a time. It takes the fully processed representation of your entire prompt and asks: given all of this, what token is most likely to come next?

It produces a probability distribution over its entire vocabulary (often 50,000 to 100,000 tokens), selects one based on that distribution, appends it to the context, and then runs the whole attention calculation again on the now-longer sequence to predict the next token after that.

This is why generation is slower than you might expect, and why it gets slower as the response gets longer. Each new token requires processing the entire context so far. As the reply grows, the computation grows with it.

It’s also why the model doesn’t “know” what it’s going to say before it says it. The word at position 50 of the response depends on the words at positions 1 through 49, which weren’t chosen until the model got there. The output is constructed incrementally, not retrieved from storage.

Understanding this changes how you think about prompting. The model doesn’t read your instructions and then think. It processes everything, instructions and examples and context together, as a unified sequence of vectors, and the output falls out the other end as a series of probabilistic choices. Structure, order, and specificity in your prompt aren’t just politeness. They literally shape the vector representations that drive every prediction. This is also why the prompt you actually submit may not be the prompt the model processes, since many systems inject additional context, system instructions, or safety wrappers around your text before it ever reaches the model.

The machinery is genuinely strange. It is not reading your words. It is doing linear algebra on compressed statistical summaries of all the text it was trained on. The fact that this produces coherent, useful answers is, if you think about it honestly, pretty remarkable.