Attention Mechanisms and Transformers

Introduction What’s Wrong with Seq2Seq Models?Attention Cues in Biology Queries, Keys, and Values Nadaraya-Watson Kernel Regression Average Pooling Nonparametric Attention Pooling Parametric Attention Pooling Attention Pooling and Attention Scoring Functions Additive Attention Scaled Dot-Product Attention Masked Softmax Operation Basic Attention Mechanisms Bahdanau Attention Multi-Head Attention Self-Attention CNNs/RNNs vs. Self-Attention Positional Encoding “Vanilla” Transformer Model Positionwise Feed-Forward Network Transformers for Vision Patch Embedding ViT’s MLP ViT’s “Add & Norm” Layer Summary Transformer Optimizations Improved Attention Span Longer Attention Span: Transformer-XL Adaptive Attention Span Sparse Attention Matrix Factorization: Sparse Transformers Locality-Sensitive Hashing: Reformer KV Caching Flash Attention Sources

Introduction

What’s Wrong with Seq2Seq Models?

The Seq2seq model originates from language modeling.
- Normally has an encoder-decoder architecture:
  - Encoder processes the input sequence and compresses it into a fixed-length context vector, where this representation is expected to be a good summary of the meaning of the whole source sequence.
  - Decoder is initialized with the context vector to emit the transformed output.
  - Both encoder and decoder are typically RNNs (LSTMs/GRUs).
- Transforms an input sequence (source) to new one (target) where both sequences can be of arbitrary lengths:
- Limitation: incapable to remember long sentences because of the fixed-length context vector.
  - The context vector struggles to retain information from long sentences, often causing early parts of the sequence to be forgotten by the time processing completes.

Attention Cues in Biology

When inspecting a visual scene, our optic nerve receives information at the order of bits per second, far exceeding what our brain can fully process. Fortunately, our ancestors had learned from experience (or, data in our case) that not all sensory inputs are created equal.

To explain how our attention is deployed in the visual world, a two-component framework has emerged. Idea dates back to William James, who is considered the “father of American psychology” [James, 2007].
- While all the paper products are printed in black and white, the coffee cup is red.

Queries, Keys, and Values

Inspired by the nonvolitional and volitional attention cues that explain the attentional deployment.

Consider the case where only nonvolitional cues are available. To bias selection over sensory inputs, we can simply use:
- Parameterized fully-connected layer.
- Non-parameterized max or average pooling.
- In the context of attention mechanisms, we refer to volitional cues as queries.
- Given any query, attention mechanisms bias selection over sensory inputs via attention pooling.
  - These sensory inputs are called values in the context of attention mechanisms.
  - Every value is paired with a key, which can be thought of the nonvolitional cue of that sensory input.

Nadaraya-Watson Kernel Regression

Example of ML algorithm with attention mechanisms.

We generate a dataset according to the following non-linear function with the noise:

Average Pooling

Begin with the “dumbest” estimator for this regression problem.

Nonparametric Attention Pooling

Better idea was proposed by Nadaraya and Watson to weight the outputs according to their input locations:
- is assigned to the corresponding value .
- Consider a Gaussian kernel defined as
- Key that is closer to the given query will get more attention via a larger attention weight assigned to the key’s corresponding value.
- Now let us take a look at the attention weights. Here testing inputs are queries while training inputs are keys. Since both inputs are sorted, we can see that the closer the query-key pair is, the higher attention weight is in the attention pooling.

Parametric Attention Pooling

We can easily integrate learnable parameters into attention pooling.
- Comparing with nonparametric attention pooling, the region with large attention weights becomes sharper in the learnable and parametric setting:

Attention Pooling and Attention Scoring Functions

In the previous section, we obtained probability distribution over values that are paired with keys: output of the attention pooling is simply a weighted sum of the values based on the attention weights.

At a high level, we can use the above algorithm to instantiate the framework of attention mechanisms.
- NB: attention weights are a probability distribution, weighted sum is a weighted average.
- More formally, suppose that we have:
  - Query .
  - Key-value pairs , where any and any .

Different choices of the attention scoring function lead to different behaviors of attention pooling.

Additive Attention

When queries and keys are vectors of different lengths, we can use additive attention as the scoring function. Given a query and a key , the additive attention scoring function is

Scaled Dot-Product Attention

More computationally efficient but requires both query and key having the same vector of length .

Assume that all elements of query and key are i.i.d variables with zero mean and unit variance.

In practice, we often think in mini-batches for efficiency.

Masked Softmax Operation

In some cases, not all the values should be fed into attention pooling.
- E.g.: for efficient minibatch processing, some text sequences are padded with special tokens that do not carry meaning. To get an attention pooling over only meaningful tokens as values, we can specify a valid sequence length (in number of tokens) to filter out those beyond this specified range when computing softmax. In this way, we can implement such a masked softmax operation, where any value beyond the valid length is masked as zero.

Basic Attention Mechanisms

Bahdanau Attention

As we get from the intro, attention mechanism was born to help memorize long source sentences in neural machine translation (NMT).
- is the number of tokens in the input sequence.
- Decoder hidden state at time step is the query.
- Encoder hidden states are both the keys and values.
- Attention weight is computed using the additive attention scoring function.

Matrix of alignment scores is a nice byproduct to explicitly show the correlation between source and target words:

Multi-Head Attention

Given the same set of queries, keys, and values, we may want our model to combine knowledge from different behaviors of the same attention mechanism, such as capturing dependencies of various ranges (e.g., shorter-range vs. longer-range) within a sequence.
- Usually, understanding the role of a word in a sentence requires understanding how it is related to different parts of the sentence. For example, in some languages, subjects define verb inflection (e.g., gender agreement), verbs define the case of their objects, and many more. In other words, each word is part of many relations.
- and are learnable parameters.
- is attention pooling (e.g. additive attention, scaled dot-product attention, etc.).

Self-Attention

We often use CNNs or RNNs to encode a sequence.
- E.g. it can be multi-head attention.

CNNs/RNNs vs. Self-Attention

Comparing CNN (padding tokens are omitted), RNN, and self-attention architectures

Let’s compare architectures for mapping a sequence of tokens to another sequence of equal length, where each input or output token is represented by a -dimensional vector.
- CNN (consider a convolutional layer whose kernel size is ):
  - Computational complexity of the convolutional layer is .
  - Since CNNs are hierarchical, there are sequential operations and maximum path length is .
- RNN:
  - When updating the hidden state, multiplication of the weight matrix and the -dimensional hidden state has a computational complexity of . Since the sequence length is , the computational complexity of the recurrent layer is .
  - According to the figure above, there are sequential operations that cannot be parallelized and the maximum path length is also .
- Self-attention:
  - Queries, keys, and values are all matrices. Consider the scaled dot-product attention, where a matrix is multiplied by a matrix, then the output matrix is multiplied by a matrix. As a result, the self-attention has a computational complexity.
  - As we can see in the figure, each token is directly connected to any other token via self-attention. Therefore, computation can be parallel with sequential operations and the maximum path length is also .

Self-attention enjoy parallel computation and has shortest maximum path length.

However, the quadratic computational complexity with respect to a sequence length makes self-attention prohibitively slow for very long sequences.

Positional Encoding

Self-attention ditches sequential operations in favor of parallel computation.

Fixed positional encoding can be based on sine and cosine functions [Vaswani et al., 2017].
- In the positional embedding matrix , rows correspond to positions within a sequence and columns represent different positional encoding dimensions.
  - E.g.: from the graph below, we can see that the -th and the -th columns of the positional embedding matrix have a higher frequency than the -th and the -th columns. The offset between the -th and the -th (same for the -th and the -th) columns is due to the alternation of sine and cosine functions.

Absolute Positional Information

Let’s see how monotonically decreased frequency along encoding dimension relates to absolute positional info.
- Since the outputs are float numbers, such continuous representations are more space-efficient than binary representations.

Relative Positional Information

Besides capturing absolute positional information, the above positional encoding also allows a model to easily learn to attend by relative positions.
- This is because for any fixed position offset , the positional encoding at position can be represented by a linear projection of that at position :

“Vanilla” Transformer

Vaswani, et al. (2017)

We have compared CNNs, RNNs, and self-attention.

Model

Transformer is composed of an encoder and a decoder:
- Input (source) and output (target) sequence embeddings are added with sinusoidal positional encoding before being fed into the encoder and the decoder that stack modules based on self-attention.
- Encoder is a stack of multiple identical layers, where each layer has 2 sublayers.
  - The first is a multi-head self-attention pooling.
  - The second is a position-wise feed-forward network.
    - By position-wise (or point-wise), it means that it applies the same linear transformation (with the same weights) to each element in the sequence.
    - This can also be viewed as a convolutional layer with filter size of 1.
  - Inspired by the ResNet design, a residual connection is employed around both sublayers.
  - The addition from the residual connection is followed by layer normalization [Ba et al., 2016].
- Decoder is similar to the encoder, except that the decoder contains two multi-head attention sublayers instead of one in each identical repeating layers.
  - The first multi-head attention sublayer is masked to prevent positions from attending to the future. Thus, each position in decoder is allowed to only attend to all positions in the decoder up to that position.
    - This masked attention preserves the auto-regressive property, ensuring that the prediction only depends on those output tokens that have been generated.
  - The second multi-head attention sublayer is called encoder-decoder attention, where queries are from outputs of previous decoder layer, and keys and values are from transformer encoder outputs.

Positionwise Feed-Forward Network

Position-wise FFNN transforms representation at all the sequence positions using the same MLP.
- Since the same MLP transforms at all the positions, when the inputs at all these positions are the same, their outputs are also identical:

Transformers for Vision

Transformer architecture was initially proposed for Seq2Seq learning, with a focus on machine translation.

Vision Transformers (ViTs) extract patches from images and feed them into a Transformer encoder to obtain a global representation, which will finally be transformed into the output label:
- Consider an input image with height , width , channels, and patch height and width both as .

Patch Embedding

Splitting an image into patches and linearly projecting these flattened patches can be simplified as a single convolution operation, where both the kernel size and the stride size are set to the patch size:

ViT’s MLP

The MLP of vision Transformer encoder is slightly different from the position-wise FFN of original Transformer encoder.
- Activation function uses the GELU, which is smoother version of the ReLU.
- Dropout is applied to the output of each fully connected layer in the MLP for regularization.

ViT’s “Add & Norm” Layer

Normalization is applied right before the multi-head attention or MLP (not after as in vanilla Transformer).

Summary

Input image is fed into PatchEmbedding whose output is concatenated with <cls> token embedding.

For small datasets like ImageNet (1.2M images), vision Transformer does not outperform the ResNet.
- Transformers lack useful principles in convolution, such as translation invariance and locality.

It was shown to be effective with data-efficient training strategies of DeiT (Touvron et al., 2021).

Quadratic complexity of self-attention makes the architecture less suitable for higher-resolution images.

Transformer Optimizations

As we get it, the transformer architecture is heavily bottlenecked by the self-attention mechanism, which has quadratic time and memory complexity.

Improved Attention Span

Make the context that can be used in self-attention longer, more efficient and flexible.

Longer Attention Span: Transformer-XL

Dai, et al. (2019)

“XL” means “extra long”

The vanilla Transformer has a fixed and limited attention span.
- The model cannot capture very long term dependencies.
- It is hard to predict the first few tokens in each segment given no or thin context.
- The evaluation is expensive: whenever the segment is shifted to the right by one, the new segment is re-processed from scratch, although there are a lot of overlapped tokens.

Hidden State Reuse

The recurrent connection between segments is introduced into the model by continuously using the hidden states from the previous segments:
- Let's label the hidden state of the -th layer for the -th segment in the model as .
  - Not the difference between and .
  - NB: both key and value rely on the extended hidden state, while the query only consumes hidden state at current step.
  - The concatenation operation is along the sequence length dimension.

Relative Positional Encoding

In order to work with this new form of attention span, Transformer-XL proposed a new type of positional encoding based on reparametrization of dot-product of keys and queries.
- Q: Why?
- Replaces with relative positional encoding .
- Replaces with 2 trainable parameters (for content) and (for location) in 2 different terms.
- Splits into two matrices, for content information and for location information.

Adaptive Attention Span

Sukhbaatar, et al. (2019)

One key advantage of Transformer is the capability of capturing long-term dependencies.

The authors hypothesized that different attention heads might assign scores differently within the same context window:

Given the -th token, we need to compute the attention weights between this token and other keys at positions , where defines the -th token's context window:

A soft mask function is added to control for an effective adjustable attention span, which maps the distance between query and key into a value. is parameterized by and is to be learned:
- is differentiable so it is trained jointly with other parts of the model.
- Parameters , are learned separately per head.
- Moreover, the loss function has an extra L1 penalty on .

In the experiments of Transformer with adaptive attention span, Sukhbaatar, et al. (2019) found a general tendency that lower layers do not require very long attention spans, while a few attention heads in higher layers may use exceptionally long spans.

Adaptive attention span also helps greatly reduce the number of FLOPS, especially in a big model with many attention layers and a large context length.

Sparse Attention Matrix Factorization: Sparse Transformers

Child, et al. (2019)

Sparse Transformer introduced factorized self-attention, through sparse matrix factorization.

a set of attention connectivity pattern , where each records a set of key positions that the -th query vector attends to:
- NB: the size of is not fixed, while is always of size , and thus .

In auto-regressive models, one attention span is defined as as it allows each token to attend to all the positions in the past. In factorized self-attention, the set is decomposed into a tree of dependencies, such that for every pair of where , there is a path connecting back to and can attend to either directly or indirectly.

Sparse Transformer proposed two types of fractorized attention:

There are three ways to use sparse factorized attention patterns in Transformer architecture:

Locality-Sensitive Hashing: Reformer

Kitaev, et al. (2020)

Reformer proposed two main changes:

KV Caching

Applicable only to auto-regressive inference, not training.

In self-attention, for each token in the input sequence, the model computes key and value vectors.
- Keys (K) and Values (V) for past tokens are computed once and cached.
- Only the new token's K and V are computed and appended to the cache.
- Attention operates over the cached K and V, avoiding recomputation.

Flash Attention

Dao, et al. (2022)

GitHub

Attention algorithm used to scale transformer models more efficiently, enabling faster training and inference.
- Proposed to optimize the attention computations by writing custom CUDA kernels.

Standard attention mechanism relies on High Bandwidth Memory (HBM) to write/read keys/queries/values.
- NB: global memory of the GPU is confusingly called the "High Bandwidth Memory" here…
- HBM offers large capacity and high bandwidth but has higher latency compared to on-chip memory like SRAM / Shared Memory.

FlashAttention optimizes this workflow by loading keys, queries, and values into Shared Memory once, fusing the operations of the attention mechanism (such as softmax and matrix multiplication), and writing the result back to HBM:
- Key element is to compute the S matrices in small pieces which can fit in the smaller shared memory of the SM.

Following Flash-attention 1, two successive improved versions have been released by the same lab: Flash-attention 2 and 3.
- In comparison to Flash-attention 1, the improvements in Flash-attention 2 and 3 are less about the general attention mechanism than about tailoring its low level implementation more specifically to the GPU by (1) reducing the number of non-matmul operations as much as possible (2) partitioning carefully the workload among wraps and thread blocks (for Flash Attention 2) and carefully optimizing for FP8 and Tensor Core support on the latest Hopper (H100) architecture for Flash Attention 3.

Sources

https://d2l.ai/chapter_attention-mechanisms-and-transformers/index.html

https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html

https://lilianweng.github.io/posts/2018-06-24-attention/

https://lilianweng.github.io/posts/2020-04-07-the-transformer-family/

https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/

https://huggingface.co/docs/text-generation-inference/conceptual/flash_attention

https://medium.com/@sachinkalsi/flashattention-understanding-gpu-architecture-part-1-0a8a9a0bb725

Attention Mechanisms and Transformers

Table of Contents

Table of Contents

Attention Mechanisms and Transformers

Introduction

What’s Wrong with Seq2Seq Models?

Attention Cues in Biology

Queries, Keys, and Values

Nadaraya-Watson Kernel Regression

Average Pooling

Nonparametric Attention Pooling

Parametric Attention Pooling

Attention Pooling and Attention Scoring Functions

Additive Attention

Scaled Dot-Product Attention

Masked Softmax Operation

Basic Attention Mechanisms

Bahdanau Attention

Multi-Head Attention

Self-Attention

CNNs/RNNs vs. Self-Attention

Positional Encoding

“Vanilla” Transformer

Model

Positionwise Feed-Forward Network

Transformers for Vision

Patch Embedding

ViT’s MLP

ViT’s “Add & Norm” Layer

Summary

Transformer Optimizations

Improved Attention Span

Longer Attention Span: Transformer-XL

Adaptive Attention Span

Sparse Attention Matrix Factorization: Sparse Transformers

Locality-Sensitive Hashing: Reformer

KV Caching

Flash Attention

Sources