What Is a Vector, and Why Is It the Language of Similarity?

We have a goal, and it is a big one. We want to build a movie recommendation system, and we want it to be so good it makes the Netflix one look like it was designed in a cave. Fine words, but right now we have a very concrete problem sitting in front of us. A recommendation system runs on a computer, and a computer does calculations. Our raw material, on the other hand, is movies. So before we recommend anything, we must answer a much humbler question: how do we turn a movie into mathematics?

Let us pause for a second and ask something even more basic: what does a recommendation system actually do? When someone finishes a movie and we want to suggest the next one, we are not pulling titles out of a hat. We are looking for movies that resemble what they just watched, or resemble what people like them tend to enjoy. The whole thing, at its core, is a search for similarity. If we cannot measure how similar two movies or two users are, we have nothing to recommend. So the goal, "recommend movies", quietly turns into a sharper goal: find a way to measure similarity between users or movies.

“Similarity is about direction. Not about where a vector starts. Not about how long it is. Direction.”

And here is where our humbler question comes back with teeth. Similarity between two movies is a feeling, a judgement, a conversation at dinner. Similarity for a computer has to be a number. Ideally one clean number on a clear scale, where higher means more alike and lower means less alike. Our job, then, is to turn each movie into something a computer can chew on, and to find an operation that takes two of those somethings and returns a single similarity score. That is the whole plan of the post, and of the next few.

If a friend asks us whether two movies are similar, we know what to say. "They are both sci-fi", or "they were both huge flops", or "everyone hated them on social media". Nice answer for a friend, terrible answer for a computer. A computer does not know what "flop" means, and it certainly does not know what "similar" means, until we hand it numbers. So our first job, before anything else, is to describe a movie in a way that a machine can actually work with.

What if we summarised each movie with a single number? Say, its box office profit. That is one number, clean and simple, and similarity would be just as simple: two movies are alike if their profits are close, different if their profits are far apart. Simple! I am a sucker for simplicity, however let us hit the brakes as I sense trouble with this approach. For example, two movies can have the exact same profit and be nothing alike. A cult film that barely broke even and a quiet indie drama that barely broke even would now look identical to us. A movie is not one thing, it has many sides, and a single number throws them all away. Any similarity score built on top of it would be lying to us.

So we need more than one number per movie. What about two? The truth is that for this post and the next few, we are going to be working with 2 or 3 numbers per movie. ATTENTION AS WE HAVE NEW LINGO! We do not call them "numbers per movie", we call them dimensions. Dimensions is just a fancy word for how many entries we use to describe a movie. Cool, now why 2 or 3? Simple. So we can "see" what we are doing on a plot. Everything we build will scale to as many dimensions as we want later on.

Let us pick two: the profit (in millions) and some sentiment score from social media (in thousands of mentions, positive minus negative). These are not the deepest features in the world, and later we will add others that feel more relevant, but for now they are enough because they let us keep our feet on the ground and plot what we are doing.

Already we face a very small but very real question. If we write the pair $(7, 8)$ , how does the computer know which entry is profit and which is sentiment? It does not. To the computer, those are just two numbers in a row. We are the ones who have to decide, and once we decide, we can never swap them again. Position has to matter. If we mix up the slots, every comparison we ever do will be garbage.

This little object we just built, a list of numbers where the position of each entry carries meaning, has a name. We call it a vector. There is nothing scary about it. We write our first one like this:

$\vec{v} = \begin{pmatrix} v_1 \\ v_2 \end{pmatrix} = \begin{pmatrix} 7 \\ 8 \end{pmatrix} = \begin{pmatrix} \text{profit} \\ \text{sentiment} \end{pmatrix}$

That is one movie translated into mathematics. Seven million in profit, eight thousand more positive mentions than negative ones. That is not a bad evening at the cinema. The little arrow on top of the $v$ is just a reminder that $\vec{v}$ is a vector and not a lonely number, we will use that notation from now on.

Because we promised ourselves not to be sloppy, let us also write the general version. A vector with $n$ real entries:

$\vec{v} = (v_1, \ldots, v_n)$

lives in a space that mathematicians write with a double-struck $R$ and a little superscript, and in symbols we say:

$\vec{v} \in \mathbb{R}^n$

Do not let those symbols intimidate you. The $\mathbb{R}$ stands for real numbers (the ordinary numbers you already know, with or without decimals), the double-struck styling is just the convention for "the whole set of them", and the superscript tells us how many slots our vector has. So it reads, out loud, as "lists of $n$ real numbers". Our own $\vec{v}$ with two entries therefore lives in a two-dimensional version of that space. Which brings us to that wonky e, $\in$ . It just means "belongs to". So the whole line reads: $\vec{v}$ is a list of $n$ real numbers. For now we are living in the two-dimensional case, which is just a fancy way of saying our vectors have two real numbers in them. We can go to three, to ten, to a thousand. Nothing in the definition stops us, and later we will happily stack more features on top.

Here is the nice part. A pair of numbers is also a point. The point $(7, 8)$ sits seven units to the right and eight units up, which means we can pin it down on a plane with two axes. And once we have a point, we can draw an arrow from the origin to it:

Loss curve over 100 epochs

Why start the arrow at the origin? Two good reasons. The first is convenience. If the tail is at $(0, 0)$ and the tip is at $(7, 8)$ , then the vector is completely described by those two endpoint numbers, no fuss. The second reason is comparison. When we place two arrows at the same starting point, we can immediately eyeball which one is longer and how their directions differ. If they were scattered all over the plane, we would have to slide them around before we could say anything useful about them. Nothing in the world forbids a vector from living somewhere else on the plane, we are just being tidy.

So what has the arrow bought us? From the list view, two things were not obvious, but the arrow gives them to us for free: a vector has a direction (where it points) and a magnitude (how long it is). For two vectors to be the same, both their direction and their magnitude must agree:

Loss curve over 100 epochs

So far, we took a movie and translated it into mathematics. We chose a couple of dimensions, fixed their order, wrote them down as a vector $\vec{v}$ , and we even drew it as an arrow with a direction and a magnitude. For one movie, the translation is done. But similarity, the thing we actually came here for, is a relationship between two movies, not a property of one. You know what we need for comparison? At least a pair of things. And even with a pair sitting in front of us, staring at two arrows will not be enough. Our eyes will have an opinion, sure, but we promised the computer a number, one clean number that says how similar two movies are.

So the question that closes this post and opens the next is this: given two vectors, what operation takes both of them and gives us back a single number on a sensible scale?