The Intuition Behind Self-Attention

Attention is the successor to recurrent neural networks (RNNs) and is extremely popular. In fact, it’s the core mechanism that powers large language models (LLMs). Attention offers three distinct advantages compared to its RNN counterpart:

Parallelism: RNNs process tokens sequentially. Attention has no such restriction, and each token can be processed in parallel.
Constant Path Length: For an RNN to relate token 1 to token 100, information has to pass through 99 intermediate hidden states. In attention, any two tokens are connected directly in a single step.
No Fixed-Size Information Bottleneck: In RNNs, you have to cram all the past information into a single fixed-size hidden vector. Attention has one vector per token which removes this bottleneck.

Queries, Keys, and Values

Attention is based on the core premise that the true meaning of a token is context-dependent. For example, assume we want to represent the meaning of “plane” in the following two sentences:

The plane will be landing at 9:00 PM tonight.
The 2D plane is defined by the set of all vectors in $\mathbb{R}^2$ .

The word “plane” means something very different in each sentence. Attention allows us to update the representation of a token based on its surrounding context using queries, keys, and values:

Query: A learned vector representation of a token that is used to “ask” every token in the sequence, itself included, whether they are important to the current token’s meaning.
Key: A learned vector representation of a token that is used to “answer” a query (from any token, including its own) about whether it’s important to the asking token’s meaning.
Value: A learned vector representation of a token that holds the actual content it contributes. After relevance has been determined from the queries and keys, each token’s value is blended together, weighted by that relevance, to form the new representation.

Having three different representations is intentional. We want the token to behave differently based on its current role: asking questions, answering them, or generating a new representation.

The above visual describes how the attention mechanism works for a single token. Let’s take our “plane” example from earlier. Each token in the sequence has a query/key/value representation. For determining the meaning of plane, we take the dot product of plane’s query vector with the key vectors of all the tokens in the sequence. That produces a real-valued score for each token in the sequence which answers the question: How relevant is this token for determining the meaning of plane? We take the softmax of all the scores so that the scores form a probability distribution.

The final step is to take each token, multiply its value vector by its score, and sum up all the results. This produces a new representation of the word “plane” that incorporates the surrounding context.

Matrix Representation

One of the major benefits of attention is that it can be parallelized. We can see this in practice by looking at the matrix form for attention:

Z = \operatorname{Attention}(Q, K, V) = \operatorname{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

where $d_k$ is the size of the query/key vectors, $d_v$ is the size of the value vector, and $Q \in \mathbb{R}^{n \times d_k}, K \in \mathbb{R}^{n \times d_k}, V \in \mathbb{R}^{n \times d_v}$ are row-matrices containing the queries, keys, and values for each token in the sequence. This illustration represents the effect of doing the operation listed above:

This matrix form is the exact same operation we listed above! To convince yourself of why, take a look at the matrix we drew in the illustration. Let’s call $A = \operatorname{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)$ . Every position $(i, j)$ in $A$ is the result of taking the i-th query vector and dotting it with the j-th key vector.

Our goal is for the output to be a row-matrix $Z \in \mathbb{R}^{n \times d_v}$ where the i-th row corresponds to the meaning of the i-th token. Multiplying our $n \times n$ matrix in the illustration with $V$ produces exactly that. To see why, remember that matrix multiplication essentially produces:

Z[i][j] = \sum_k A[i][k] \cdot V[k][j] \implies Z[i][:] = \sum_k A[i][k] \cdot V[k][:]

This is saying that the value of row i in $Z$ is given by taking the sum over all k of token i’s attention on token k times token k’s value vector. This is exactly what we wanted.

An important detail we glossed over earlier is the $\sqrt{d_k}$ factor. This is intentionally used to normalize the variance of the attention scores. If the variance is too high, softmax converges to one-hot vectors, which produces vanishing gradients during backpropagation.