Vision transformer

A vision transformer (ViT) is a transformer designed for computer vision. A ViT breaks down an input image into a series of patches (rather than breaking up text into tokens), serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

ViT has found applications in image recognition, image segmentation, and autonomous driving.

History
Transformers were introduced in 2017, in a paper "Attention Is All You Need", and have found widespread use in natural language processing. In 2020, they were adapted for computer vision, yielding ViT.

In 2021 a pure transformer model demonstrated better performance and greater efficiency than CNNs on image classification.

A study in June 2021 added a transformer backend to ResNet, which dramatically reduced costs and increased accuracy.

In the same year, some important variants of the Vision Transformers were proposed. These variants are mainly intended to be more efficient, more accurate or better suited to a specific domain. Among the most relevant is the Swin Transformer, which through some modifications to the attention mechanism and a multi-stage approach achieved state-of-the-art results on some object detection datasets such as COCO. Another interesting variant is the TimeSformer, designed for video understanding tasks and able to capture spatial and temporal information through the use of divided space-time attention.

Overview
The basic architecture, used by the original 2020 paper, is as follows. In summary, it is a BERT-like encoder-only Transformer.

The input image is of type $$\R^{H\times W \times C}$$, where $$H, W, C$$ are height, width, channel (RGB). It is then split into square-shaped patches of type $$\R^{P\times P \times C}$$.

For each patch, the patch is pushed through a linear operator, to obtain a vector ("patch embedding"). The position of the patch is also transformed into a vector by "position encoding". The two vectors are added, then pushed through several Transformer encoders.

The attention mechanism in a ViT repeatedly transforms representation vectors of image patches, incorporating more and more semantic relations between image patches in an image. This is analogous to how in natural language processing, as representation vectors flow through a transformer, they incorporate more and more semantic relations between words, from syntax to semantics.

The above architecture turns an image into a sequence of vector representations. To use these for downstream applications, an additional head needs to be trained to interpret them.

For example, to use it for classification, one can add a shallow MLP on top of it that outputs a probability distribution over classes. The original paper uses a linear-GeLU-linear-softmax network.

Original ViT
Transformers found their initial applications in natural language processing tasks, as demonstrated by language models such as BERT and GPT-3. By contrast the typical image processing system uses a convolutional neural network (CNN). Well-known projects include Xception, ResNet, EfficientNet, DenseNet, and Inception.

Transformers measure the relationships between pairs of input tokens (words in the case of text strings), termed attention. The cost is quadratic in the number of tokens. For images, the basic unit of analysis is the pixel. However, computing relationships for every pixel pair in a typical image is prohibitive in terms of memory and computation. Instead, ViT computes relationships among pixels in various small sections of the image (e.g., 16x16 pixels), at a drastically reduced cost. The sections (with positional embeddings) are placed in a sequence. The embeddings are learnable vectors. Each section is arranged into a linear sequence and multiplied by the embedding matrix. The result, with the position embedding is fed to the transformer.

As in the case of BERT, a fundamental role in classification tasks is played by the class token. A special token that is used as the only input of the final MLP Head as it has been influenced by all the others.

The architecture for image classification is the most common and uses only the Transformer Encoder in order to transform the various input tokens. However, there are also other applications in which the decoder part of the traditional Transformer Architecture is also used.

Masked Autoencoder
In Masked Autoencoder, there are two ViTs put end-to-end. The first one takes in image patches with positional encoding, and outputs vectors representing each patch. The second one takes in vectors with positional encoding and outputs image patches again. During training, both ViTs are used. An image is cut into patches, and only 25% of the patches are put into the first ViT. The second ViT takes the encoded vectors and outputs a reconstruction of the full image. During use, only the first ViT is used.

Swin Transformer
The Swin Transformer ("Shifted windows") takes inspiration from standard convolutional neural networks:


 * Instead of performing self-attention over the entire sequence of tokens, one for each patch, it performs "shifted window based" self-attention, which means only performing attention over square-shaped blocks of patches. One block of patches is analogous to the receptive field of one convolution.
 * After every few attention blocks, there is a "merge layer", which merges neighboring 2x2 tokens into a single token. This is analogous to pooling (by 2x2 convolution kernels, with stride 2). Merging means concatenation followed by multiplication with a matrix.

It is improved by Swin Transformer V2, which modifies upon the ViT by a different attention mechanism (Figure 1):


 * layernorm immediately after each attention and feedforward layer ("res-post-norm");
 * scaled cosine attention to replace the original dot product attention;
 * log-spaced continuous relative position bias, which allows transfer learning across different window resolutions.

ViT-VQGAN
In ViT-VQGAN, there are two ViT encoders and a discriminator. One encodes 8x8 patches of an image into a list of vectors, one for each patch. The vectors can only come from a discrete set of "codebook", as in vector quantization. Another encodes the quantized vectors back to image patches. The training objective attempts to make the reconstruction image (the output image) faithful to the input image. The discriminator (usually a convolutional network, but other networks are allowed) attempts to decide if an image is an original real image, or a reconstructed image by the ViT.

The idea is essentially the same as vector quantized variational autoencoder (VQVAE) plus generative adversarial network (GAN).

After such a ViT-VQGAN is trained, it can be used to code an arbitrary image into a list of symbols, and code an arbitrary list of symbols into an image. The list of symbols can be used to train into a standard autoregressive transformer (like GPT), for autoregressively generating an image. Further, one can take a list of caption-image pairs, convert the images into strings of symbols, and train a standard GPT-style transformer. Then at test time, one can just give an image caption, and have it autoregressively generate the image. This is the structure of Google Parti.

Comparison with Convolutional Neural Networks
Due to the commonly used (comparatively) large patch size, ViT performance depends more heavily on decisions including that of the optimizer, dataset-specific hyperparameters, and network depth than convolutional networks. Preprocessing with a layer of smaller-size, overlapping (stride < size) convolutional filters helps with performance and stability.

The CNN translates from the basic pixel level to a feature map. A tokenizer translates the feature map into a series of tokens that are then fed into the transformer, which applies the attention mechanism to produce a series of output tokens. Finally, a projector reconnects the output tokens to the feature map. The latter allows the analysis to exploit potentially significant pixel-level details. This drastically reduces the number of tokens that need to be analyzed, reducing costs accordingly.

The differences between CNNs and Vision Transformers are many and lie mainly in their architectural differences.

In fact, CNNs achieve excellent results even with training based on data volumes that are not as large as those required by Vision Transformers.

This different behaviour seems to derive from the different inductive biases they possess. The filter-oriented architecture of CNNs can be somehow exploited by these networks to grasp more quickly the particularities of the analysed images even if, on the other hand, they end up limiting them making it more complex to grasp global relations.

On the other hand, the Vision Transformers possess a different kind of bias toward exploring topological relationships between patches, which leads them to be able to capture also global and wider range relations but at the cost of a more onerous training in terms of data.

Vision Transformers also proved to be much more robust to input image distortions such as adversarial patches or permutations.

However, choosing one architecture over another is not always the wisest choice, and excellent results have been obtained in several Computer Vision tasks through hybrid architectures combining convolutional layers with Vision Transformers.

The Role of Self-Supervised Learning
The considerable need for data during the training phase has made it essential to find alternative methods to train these models, and a central role is now played by self-supervised methods. Using these approaches, it is possible to train a neural network in an almost autonomous way, allowing it to deduce the peculiarities of a specific problem without having to build a large dataset or provide it with accurately assigned labels. Being able to train a Vision Transformer without having to have a huge vision dataset at its disposal could be the key to the widespread dissemination of this promising new architecture.

Applications
Vision Transformers have been used in many Computer Vision tasks with excellent results and in some cases even state-of-the-art.

Among the most relevant areas of application are:

Vision Transformer-based algorithms such as DINO (self-distillation with no labels) also show promising properties on biological datasets such as images generated with the Cell Painting assay. DINO has been demonstrated to learn image representations which could be used to cluster images and explore morphological profiles in a feature space.
 * Image Classification
 * Object Detection
 * Video Deepfake Detection
 * Image segmentation
 * Anomaly detection
 * Image Synthesis
 * Cluster analysis
 * Autonomous Driving