User:Toiziz/sandbox

Antiuniversity of London

The basic building blocks of the Transformer are scaled dot-product attention units. When a sentence is passed into a Transformer model, attention weights are calculated between every token simultaneously/ The attention layer produces embeddings for every token in context that contain information not only about the token itself, but also a weighted combination of other relevant tokens weighted by the attention weights.

Concretely, for one attention unit the Transformer model learns three weight matrices: the query weights $$W_Q$$, the key weights $$W_K$$, and the value weights $$W_V$$. For each token $$i$$, the input word embedding $$x_i$$ is multiplied with each of the three weight matrices to produce a query vector $$q_i = x_iW_Q$$, a key vector $$k_i = x_iW_K $$, and a value vector $$v_i=x_iW_V$$. Attention weights are calculated using the query and key vectors: the attention weight $$a_{ij}$$ from token $$i$$ to token $$j$$ is the dot product between $$q_i$$ and $$k_j$$. The attention weights are divided by the square root of the dimension of the key vectors, $$\sqrt{d_k}$$, which stabilizes gradients during training, and passed through a softmax which normalizes the weights to sum to $$1$$. The fact that $$W_Q$$ and $$W_K$$ are different matrices allows attention to be non-symmetric: if token $$i$$ attends to token $$j$$ (i.e. $$q_i\cdot k_j$$ is large), this does not necessarily mean that token $$j$$ will attend to token $$i$$ (i.e. $$q_j\cdot k_i$$ is large). The output of the attention unit for token $$i$$, $$z_i$$, is the weighted sum of the value vectors of all tokens, weighted by $$a_{ij}$$, the attention from $$i$$ to each token.

$$\begin{align} z_i = \sum_{j=1}^n a_{ij}v_j \end{align}$$

The attention calculation for all tokens can be expressed as one large matrix calculation, which is useful for training due to computational matrix operation optimizations which make matrix operations fast to compute. The matrices $$Q$$, $$K$$ and $$V$$ are defined as the matrices where the $$i$$th rows are vectors $$q_i$$, $$k_i$$, and $$v_i$$ respectively.

$$\begin{align} \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\mathrm{T}}{\sqrt{d_k}}\right)V \end{align}$$

Multi-head Attention
One set of $$\left( W_Q, W_K, W_V \right)$$ matrices is called an attention head, and each layer in a Transformer model has multiple attention heads. While one attention head attends to the tokens that are relevant to each token, with with multiple attention heads the model can learn to do this for different definitions of "relevance". Research has shown that many attention heads in Transformers encode relevance relations that are transparent to humans. For example there are attention heads that, for every token, attend mostly to the next word, or attention heads that mainly attend from verbs to their direct objects. Since Transformer models have multiple attention heads, they have the possibility of capturing many levels and types of relevance relations, from surface-level to semantic. The multiple outputs for the multi-head attention layer are concatenated to pass into