To really see what a transformer does, compute one self-attention output by hand. The procedure is fixed: build queries, keys, and values from the tokens, score each query against the keys, scale and softmax those scores, then mix the values by the resulting weights. Run that once and the rest of the architecture becomes context around a step you already understand.
Transformer architecture is a neural network design built around self-attention. Instead of relying mainly on step-by-step processing, it lets each token gather information from other relevant tokens in the sequence. If one word depends on another far away, attention gives the model a direct path between them.
When Self-Attention Applies
Use self-attention whenever a token's meaning depends on other tokens in the same sequence. Each token produces three vectors:
- a query, what this token is looking for
- a key, what this token offers for matching
- a value, the information that can be passed along
With token representations in a matrix , one attention head forms
where , , are learned matrices.
The Procedure, Step by Step
- Start with token vectors and add positional information, because attention alone does not tell the model which token came first.
- Build queries, keys, and values with learned linear maps; in self-attention they come from the same sequence.
- Compute attention weights: form scores with , scale by , apply softmax row by row, and combine the value vectors. The standard formula is
- Finish the block by sending the attention output through the rest of the transformer block — usually multi-head attention, a feedforward network, residual connections, and normalization.
- Stack blocks for the task: encoder-decoder stacks for sequence-to-sequence work, decoder-only stacks for next-token prediction.
In plain language, self-attention does not pick one token and ignore the rest — it builds a weighted average of value vectors. A strong query-key match earns a larger weight; a weak one earns less. Each token's output is a context-aware mixture, which is what helps with subject-verb agreement, pronoun reference, and other long-range relationships. The scaling by keeps raw dot-product scores from growing too large as the dimension grows; the softmax is applied row by row, each row answering one token's question about how much attention to give the others.
A Full Worked Attention Output
One query, two candidate tokens, single head, to keep the arithmetic clean. Let
Raw scores are the dot products:
With the scaling factor is , so the scaled scores stay and . Apply softmax:
The output is the weighted combination
The output lands closer to because the query matched more strongly. Real transformers do this in higher dimensions across many tokens at once, but the arithmetic is the same.
Self-Check at Each Step
- After scoring: are the scores dot products of this query with each key? Forgetting the scale factor is a frequent slip.
- After softmax: do the weights sum to ? If not, the row-wise softmax was applied wrong.
- After the mix: does the output sit between the values, leaning toward the best-matched one? If it lands outside that range, the weighting is off.
What Else Lives in the Block
A standard block also has multi-head attention (several relationship types at once), a position-wise feedforward network, residual connections, and layer normalization. The original sequence-to-sequence design had an encoder stack (self-attention over the input) and a decoder stack (masked self-attention so a position cannot look ahead, plus cross-attention where queries come from the decoder and keys/values from the encoder). Many modern language models use only the decoder side, specialized for next-token prediction. Because attention alone is permutation-equivariant, transformers must add positional information so order-sensitive tasks work.
Where the Procedure Goes Wrong
- Thinking attention is the whole transformer. It is central, but feedforward layers, residual paths, normalization, and positional information all matter.
- Mixing up self-attention and cross-attention. In self-attention , , share one sequence; in cross-attention they do not.
- Forgetting masking. Decoder-only models need causal masking so a token cannot attend to future tokens.
- Treating attention weights as a full explanation. Behavior also depends on value vectors, later layers, and nonlinear transforms.
Transformers fit language modeling, translation, summarization, code generation, speech, and many vision tasks — anywhere relationships across a sequence matter more than local patterns. Compute one small attention output by hand, follow the five steps once, and the leap from "I know the formula" to "I understand what the architecture is doing" usually happens right there.
Frequently Asked Questions
- What is a transformer in machine learning?
- A transformer is a neural network architecture built around self-attention. Instead of relying mainly on step-by-step sequential processing, it lets each token gather information directly from other relevant tokens in the sequence. That direct path between distant tokens is why transformers work well for language and other sequence tasks with long-range dependencies.
- How does self-attention work in plain language?
- Self-attention builds a weighted average of value vectors rather than picking one token and ignoring the rest. If one token strongly matches another token's key, that token gets a larger weight; weak matches get smaller weights. Each token's output is therefore a context-aware mixture of information from the sequence, helping with things like pronoun reference and subject-verb agreement.
- What are queries, keys, and values in attention?
- Each token produces three vectors from learned matrices. The query represents what the token is looking for, the key represents what the token offers for matching, and the value is the information that can be passed along. Attention scores come from comparing queries with keys, and the output mixes the values according to those scores.
- Why is attention scaled by the square root of the key dimension?
- The raw attention scores are dot products between queries and keys, and those dot products tend to become larger as the vector dimension grows. Dividing by the square root of the key dimension keeps the scores from growing too large before the softmax is applied, which keeps the attention weights in a useful range.
Need help with a problem?
Upload your question and get a verified, step-by-step solution in seconds.
Open GPAI Solver →