Attention is all you need.

Since the release of BERT and GPT-2, I created a set of slides that are more clear than these notes with better examples. See them here.

Attention is all you need

This work focuses on the task of natural language translation (e.g. translating english to german or vice versa.) This notebook focuses on the unique modules the authors present, and how the system fits together. The Transformer Network (TN) is composed of attention modules, linear mappings, regularization features and uses an Encoder-Decoder structure. Since this publication, Transformer Networks Encoders have been used to great success in a wide range of applications

I modify and present implementations from the tensor2tensor

The transformer uses a encoder-decoder

According to Cho et. al

The input sentence

Next, we decode

- Its sequential and cannot be easily parallelized.
- Often
\mathbf{z} is input into each instance of the decoding function. Because from\mathbf{z} there is O(n) distance to each input symbol, it becomes difficult to learn long range dependencies. - The path between an output symbol and its corresponding source symbol depends on the length of
x .

TN's stateless auto-regressive strategy decodes encoded (but not summarized) source words and the current output words ouputting probability distributions for new symbols. This allows the model to be parallelized.

The authors describe attention as follows:

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

As noted by the authors, attention maps a query to a combination of given outputs, as determined by the query's corresponding compatibility with the input keys. As the autological "Scaled Dot-Product Attention" method implies, the authors use dot product for their compatibility function. One could use any metric, learned or otherwise, for example cosine distance or a feedforward neural network layer.

For their formulation of attention to work, there are a few requirements for the inputs. There must be mapping between the keys and values, and the compatibility function must be valid (be defined for) the queries and the keys. In the paper, there is 1:1 mapping between the keys and values (by index), and the dot-product compatibility function requires that the queries and the keys have the same dimensionality.

- Each key
K_i maps to a valueV_i . - Each query
Q_j will operate on all the keys with a compatibility function (dot product). As shown in (b), the closer the vectors are in high-dimensional space, the more compatible. These scores will be transformed into a probability distribution by a softmax. - Then, each query will be mapped to a linear combination of the values as determined by the probability distribution (c).

As shown in the example above, the query

The scaled dot product attention is straight forward.

The authors note that the variance of a dot product scales with the size of the input vectors. Increased variance will result in increased magnitude, "pushing the softmax function into regions where it has extremely small gradients." This motivates the scaling of the dot-product based on the dimensionality of the input vectors.

Below is an implementation for scaled dot product attention. Each line corresponds to a box in the figure above.

With a single query, self attention will have no effect. This is because the attention mechanism will be a linear combination of the values, and it can only reproduce itself so it serves as an identity function.

When there are multiple queries, the vectors that are most *compatible* will become more similar because they are mapped to combinations consisting mostly of the already-compatible vectors. The remaining vectors will also be normalized.

Note that, especially with values greater than 1, a vector can have a greater dot product with other vectors rather than itself. So, similarity is aptly not the correct word to describe this interaction (at least when using a dot product). Thus, the first vector is mapped to a construction consisting mostly of itself and the second vector follows the same trend but more extreme. Lastly, the third vector, less compatible than the others - becomes pseduo-normalized.

The transformer uses "Multi-Head Attention" as its primary module for representational power. It is built up using scaled dot product attention. But, rather than attend raw queries a single time, this method attends *h* linear projections of the input. For each of the *h* heads, the inputs (K,Q,V) are linearily projected with a learned mapping.

The compatiblity function and the projections are linear. Does including a non-linearity effect the performance of this method? How well would the transformer perform using a feed forward layer?

Thus, the multi-headed attention is a function from

This two linear transforms with a nonlinear (RELU) operation. The denotation of position-wise remarks on the fact that it is not a convolution, nor does it have any directly spatial functionality.

The remaining features used by the network is residual layers, layer normalization and positional encoding. The structure and features of the model all work to make short paths between inputs and outputs, while also being highly regularized. Layer normalization and residual layers are topics on-to-themselves.

The positional encoding is used to represent the position of the queries in their embeddings. This is important because the attention mechanisms have no notion of order among the queries, and order determines the semantics of a sentence. The authors use a positional encoding that uses:

As the authors describe:

That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from2\pi to10000 \cdot 2\pi . We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offsetk ,PE_{pos+k} can be represented as a linear function ofPE_{pos} .

Each instance of the transformer will output a probability for the next symbol. As you can see, the encoder and decoder stacks are repeated N times each. In the paper the default was N = 6. The input and ouput of each stack is the of the same dimensionality. In addition to attention modules, they use a few techniques to regularize their network: layer normalization, residual connections, and dropout.

First, an input embedding for each word is retrieved. TN uses Byte-Encoding Representation with a shared embedding matrix

The two sublayers are Multi-Head Attention (self-attending) and a feed forward layer. This process manipulates the inputs and captures their interactions, outputting a sequence of the same dimensionality. As the authors note, this is used to make using the residual connections more natural. (It is possible to use residual connections with varying dimensions, but it is less clean.)

The residual connections maintain a direct path to the inputs, and the normalization stabilizes the embeddings. This encoder architecture mirrors Highway Networks

The decoder resembles the encoder. All symbols already generated (beginning with a start symbol) are embedded and combined with the positional encoding.

Next, masked self-attention is computed. A mask is applied so that only the right-most output can see previous outputs, preventing any contamination. After this, multi-headed attenion is applied, where the output sequences are the queries, and the encoded symbols are the keys and the queries. This maps the dimensionality of the vectors to be the same as those outputted by the encoder. The previous outputs served to mark what information has already been regressed; the linear maps intrinsic to the multi-head attention could make the compatibility *negative* in some sense, ignoring already generated content. After a feed forward layer, this process is repeated N times.

After the first "execution" of the decoder, the inputs to the module are derived from the encoded symbols rather than the previous output symbols. Note that while the encoder and decoder modules are repeated, they do not share weights. They are separate instances. Thus, I expect the first and second decoder layers learn different things. Finally, after decoding the encoded inputs, a linear map is applied to the vectors and a softmax generates an output probability distribution.

The linear layer takes an input of

When decoding an output sequence, the network is run repeatedly. During the first run

A greedy approach looks something like this:

Using beam search (as the authors did do), a path is selected by maintaining