Why Similarity Scores Rarely Reach 0

One subtle point about cosine similarity with text embeddings is that scores rarely fall all the way to 0, even when two passages appear unrelated. This happens partly because natural language itself is highly structured. English sentences typically share common grammatical patterns—subjects, verbs, actions, causes, and outcomes. As a result, even unrelated stories often contain overlapping conceptual features such as time passing, attempts, observations, or results. Modern embedding models capture these shared structures.

Embedding systems are trained on enormous corpora of real language and learn statistical patterns about how ideas co-occur. During training, semantically related words and phrases move closer together in vector space, but even distant topics often remain within the same broad semantic manifold. Two coherent English narratives therefore tend to have some baseline similarity, because they share elements of human activity and narrative structure. In practice, this means that completely unrelated passages might still score somewhere around 0.10–0.20, rather than near zero.

This effect was already observed in early work on distributional semantics and word embeddings. Researchers found that vector representations of language tend to cluster around shared linguistic patterns rather than spreading uniformly across space. The result is a kind of background similarity floor: meaningful comparisons happen above that baseline, but reaching a true zero is rare unless the text is random, nonsensical, or in a completely different domain or language.

For this reason, when interpreting cosine similarity for embeddings, it is often more useful to think in terms of relative differences between candidates rather than expecting unrelated text to produce values close to zero.

References

Tomas Mikolov et al. (2013). Efficient Estimation of Word Representations in Vector Space.
Tomas Mikolov et al. (2013). Linguistic Regularities in Continuous Space Word Representations.
Yoav Goldberg (2017). Neural Network Methods for Natural Language Processing.
Omer Levy & Yoav Goldberg (2014). Neural Word Embedding as Implicit Matrix Factorization.

References

Share this: