A List of Numbers That Somehow Understands Language

An embedding is a list of floating-point numbers. That’s the complete technical description. You feed in a word, a sentence, an image, or a product listing, and a model returns something like [0.21, -0.84, 0.03, 0.67, ...] extended across hundreds or thousands of dimensions. Those numbers are coordinates in a geometric space.

That’s not a metaphor. The model is literally placing objects at positions in a high-dimensional space, and the position encodes meaning.

The question worth sitting with is: why does this work? Why would translating language or images into coordinates preserve anything semantically useful? The answer reveals something deep about the structure of meaning itself, and about what neural networks are actually learning.

What the Coordinates Actually Represent

Imagine you could summarize every book ever written by plotting it in a 3-dimensional space. You might put crime thrillers near one corner, literary fiction near another, and technical manuals somewhere else entirely. Books that share themes or tone end up close together. The coordinates don’t describe the books directly; they encode their relationships to everything else.

Embeddings work on the same principle, just in many more dimensions. When a model is trained, it never explicitly learns rules like “synonyms should be close together” or “antonyms should be far apart.” Instead, it learns coordinates that minimize prediction error across enormous amounts of training data. The semantic structure emerges as a side effect of getting good at the task.

This is the part people often gloss over. The coordinates aren’t designed. They’re discovered. A well-trained model learns that “dog” and “puppy” should be nearby in its coordinate space because that placement consistently helps it predict language correctly. The geometry is a compressed record of statistical patterns across billions of examples.

Diagram comparing how different training objectives produce different geometric arrangements of the same concepts in embedding space
The same concepts, positioned differently depending on what the model was trained to predict. The training objective is the cartographer.

The Geometry of Meaning

The famous demonstration is the word2vec result from the Google research team around 2013: king - man + woman ≈ queen. You take the vector for “king”, subtract the vector for “man”, add the vector for “woman”, and the nearest coordinate in the space corresponds to “queen”.

This works because the model learned a consistent direction in its space that encodes the concept of gender across many word pairs. “Actor” and “actress” are separated by approximately the same vector offset as “king” and “queen”. The relationships between concepts become directions you can navigate.

This arithmetic isn’t a party trick. It’s evidence that the model has learned something structurally real about language, not just a lookup table of words and their associations. The representation is compositional. You can combine and manipulate concepts geometrically in ways that correspond to meaningful semantic operations.

The same pattern shows up in image embeddings. Models trained on images learn directions that correspond to “more colorful”, “closer to human face”, “outdoor vs. indoor”, without anyone labeling those dimensions explicitly.

The Curse and the Gift of High Dimensions

Our intuition about space comes from three dimensions. In high dimensions, geometry behaves strangely, and most of those strange behaviors actually help embeddings work better.

In three-dimensional space, two random points can be anywhere relative to each other. In a 1,536-dimensional space (the output size of OpenAI’s text-embedding-3-small model), something counterintuitive happens: random points tend to cluster around the same distance from each other, and small differences in angle become highly significant. This means that when two vectors are meaningfully close, it really signals something. The noise floor drops because coincidental proximity becomes rare.

This is why cosine similarity, which measures the angle between two vectors rather than the raw distance, works so well for comparing embeddings. Two documents can be at very different scales (one is longer, one shorter) but if they’re about similar things, they’ll point in the same direction through the high-dimensional space.

The practical consequence: you can take two arbitrary pieces of text, compute their embeddings independently, calculate cosine similarity, and get a meaningful measure of semantic relatedness without any shared context. This is what powers semantic search. You don’t need the query to share keywords with the documents. You need them to be near each other in embedding space.

Why Training Objectives Shape the Space

Not all embeddings are equal, and the difference comes down to what the model was trained to do.

Word2vec was trained to predict missing words from context. BERT was trained to predict masked tokens in bidirectional context. Models like Sentence-BERT were fine-tuned specifically on pairs of semantically similar and dissimilar sentences. Each training objective carves up the space differently.

If your model learned coordinates by predicting next tokens, items that co-occur frequently end up close together. If it learned by distinguishing similar from dissimilar sentence pairs, the space is more calibrated for semantic similarity. Why shrinking an AI model often makes it more useful touches on a related point: the task shapes the representation, and a model optimized for one thing may not be the right tool for another.

This has practical implications. An embedding model trained on general web text will produce a different geometric structure than one fine-tuned on medical records or legal documents. Using the wrong embedding model for your domain doesn’t just give you slightly worse results. It can give you a space where semantically important distinctions in your field (“benign” vs. “malignant”, say) are geometrically muddled.

The Compression Argument

There’s another way to understand why embeddings work: they’re forced to be efficient.

A vocabulary of 50,000 words could be represented as 50,000-dimensional one-hot vectors (one slot per word, every slot zero except one). This representation contains no information about relationships. “Cat” and “kitten” are as far apart as “cat” and “antibiotic”.

When you force all meaning into, say, 300 dimensions, the model must share coordinate space across all concepts. Similar things have to live near each other simply because there isn’t room for everything to be isolated. The bottleneck creates structure. This is sometimes called the information bottleneck principle: compression forces abstraction.

This is also why larger embedding dimensions don’t always win. A 3,072-dimensional embedding has more room to encode nuance, but it also has more room to store noise. A well-trained smaller embedding can outperform a poorly trained larger one because the compression forced the smaller model to find cleaner generalizations.

What Happens When the Space Breaks Down

Embeddings fail in predictable ways once you understand their structure.

They fail at negation. “No evidence of cancer” and “cancer” are semantically opposite, but their surface text is similar enough that many embedding models place them nearby. The geometric representation doesn’t naturally encode negation as distance.

They fail at rare concepts. If a term appears infrequently in training data, its position in the space is less reliable, essentially determined by a few examples rather than robust statistical patterns.

They fail at domain shift. A model trained on Wikipedia embeds the word “discharge” based on its common general usage. In a medical context, that word carries a specific meaning that will be poorly positioned relative to clinically relevant neighbors.

And they fail silently. Unlike a system that throws an error, an embedding model always returns a number. A badly calibrated embedding for a rare or out-of-domain term looks identical in format to a well-calibrated one. This is arguably the most dangerous failure mode. What most explanations get wrong about embeddings covers some of these pitfalls in more detail.

What This Means

Embeddings work because they solve the right problem. Raw text is discrete and categorical. Meaning is continuous and relational. Translating text into geometry creates a space where the distance between points can encode semantic proximity, where directions can encode consistent relationships, and where arithmetic on concepts produces meaningful results.

The numbers themselves are arbitrary. What matters is the structure of the space they create. That structure is not designed by hand. It’s learned from the statistical regularities of enormous amounts of human-generated content, which means it reflects the structure of meaning as humans actually use it, including all the ambiguities and cultural biases that come with that.

The reason embeddings generalize well is that language is not random. Words that appear in similar contexts carry related meanings. Images with similar visual statistics depict related things. The world has structure, and enough exposure to that structure lets a model learn coordinates that reflect it.

When you use cosine similarity to find related documents, you’re not doing a clever keyword trick. You’re navigating a learned map of meaning, one where proximity was earned through millions of examples. The coordinates are just numbers. The geometry they form is something more interesting.