User:Fgpacini/Feature learning

Feature learning is typically either supervised, unsupervised or self-supervised.


 * In supervised feature learning, features are learned using labeled input data. Labeled data includes input-label pairs where the input is given to the model and it must produce the ground truth label as the correct answer. This can be leveraged to generate feature representations with the model which result in high label prediction accuracy. Examples include supervised neural networks, multilayer perceptron and (supervised) dictionary learning.
 * In unsupervised feature learning, features are learned with unlabeled input data by analyzing the relationship between points in the dataset. Examples include dictionary learning, independent component analysis, matrix factorization and various forms of clustering.
 * In self-supervised feature learning, features are learned using unlabeled data like unsupervised learning, however input-label pairs are constructed from each data point, which enables learning the structure of the data through supervised methods such as gradient descent. Classical examples include word embeddings and autoencoders. SSL has been applied more recently to many modalities through the use of deep neural network architectures such as CNNs and Transformers.

Self-supervised
Self-supervised representation learning is learning features by training on the structure of unlabeled data rather than relying on explicit labels for an information signal. This approach has enabled the combined use of deep neural networks and larger unlabeled datasets in order to produce deep feature representations. Training tasks typically fall under the classes of either contrastive, generative or both. Contrastive representation learning trains representations for associated data pairs, called positive samples, to be aligned, while pairs with no relation, called negative samples, are contrasted. A larger portion of negative samples is typically necessary in order to prevent catastrophic collapse, which is when all inputs are mapped to the same representation. Generative representation learning tasks the model with producing the correct data to either match a restricted input or reconstruct the full input from a lower dimensional representation.

A common setup for self-supervised representation learning of a certain data type (e.g. text, image, audio, video) is to pretrain the model using large datasets of general context, unlabeled data. Depending on the context, the result of this is either a set of representations for common data segments (e.g. words) which new data can be broken into, or a neural network able to convert new data into a set of lower dimensional features. In either case, the output representations can then be used as an initialization in many different problem settings where labeled data may be limited. Specialization of the model to specific tasks is typically done with supervised learning, either by fine-tuning the model / representations with the labels as the signal, or freezing the representations and training an additional model which takes them as an input.

Many self-supervised training schemes have been developed for use in representation learning of various modalities, often first showing successful application in text or image before being transferred to other data types.

Text
Word2vec is a word embedding technique which learns to represent words through self-supervision over each word and its neighboring words in a sliding window across a large corpus of text. The model has two possible training schemes to produce word vector representations, one generative and one contrastive. The first is word prediction given each of the neighboring words as an input. The second is training on the representation similarity for neighboring words and representation dissimilarity for random pairs of words. A limitation of word2vec is that only the pairwise co-occurrence structure of the data is used, and not the ordering or entire set of context words. More recent transformer-based representation learning approaches attempt to solve this with word prediction tasks. GPT pretrains on next word prediction using prior input words as context, whereas BERT masks random tokens in order to provide bidirectional context.

Other self-supervised techniques extend word embeddings by finding representations for larger text structures such as sentences or paragraphs in the input data. Doc2vec extends the generative training approach in word2vec by adding an additional input to the word prediction task based on the paragraph it is within, and is therefore intended to represent paragraph level context.

Graph
The goal of many graph representation learning techniques is to produce an embedded representation of each node based on the overall network topology. node2vec extends the word2vec training technique to nodes in a graph by using co-occurrence in random walks through the graph as the measure of association. Another approach is to maximize mutual information, a measure of similarity, between the representations of associated structures within the graph. An example is Deep Graph Infomax, which uses contrastive self-supervision based on mutual information between the representation of a “patch” around each node, and a summary representation of the entire graph. Negative samples are obtained by pairing the graph representation with either representations from another graph in a multigraph training setting, or corrupted patch representations in single graph training.

Image
The domain of image representation learning has employed many different self-supervised training techniques, including transformation, inpainting, patch discrimination and clustering.

Examples of generative approaches are Context Encoders, which trains an AlexNet CNN architecture to generate a removed image region given the masked image as input, and iGPT, which applies the GPT-2 language model architecture to images by training on pixel prediction after reducing the image resolution.

Many other self-supervised methods use siamese networks, which generate different views of the image through various augmentations that are then aligned to have similar representations. The challenge is avoiding collapsing solutions where the model encodes all images to the same representation. SimCLR is a contrastive approach which uses negative examples in order to generate image representations with a ResNet CNN. Bootstrap Your Own Latent (BYOL) removes the need for negative samples by encoding one of the views with a slow moving average of the model parameters as they are being modified during training.

Video
With analogous results in masked prediction and clustering, video representation learning approaches are often similar to image techniques but must utilize the temporal sequence of video frames as an additional learned structure. Examples include VCP, which masks video clips and trains to choose the correct one given a set of clip options, and Xu et al., who train a 3D-CNN to identify the original order given a shuffled set of video clips.

Audio
Self-supervised representation techniques have also been applied to many audio data formats, particularly for speech processing. Wav2vec 2.0 discretizes the audio waveform into timesteps via temporal convolutions, and then trains a transformer on masked prediction of random timesteps using a contrastive loss. This is similar to the BERT language model, except as in many SSL approaches to video, the model chooses among a set of options rather than over the entire word vocabulary.

Multimodal
Self-supervised learning has more recently been used to develop joint representations of multiple data types. Approaches usually rely on some natural or human-derived association between the modalities as an implicit label, for instance video clips of animals or objects with characteristic sounds, or captions written to describe images. CLIP produces a joint image-text representation space by training to align image and text encodings from a large dataset of image-caption pairs using a contrastive loss. MERLOT Reserve trains a transformer-based encoder to jointly represent audio, subtitles and video frames from a large dataset of videos through 3 joint pretraining tasks: contrastive masked prediction of either audio or text segments given the video frames and surrounding audio and text context, along with contrastive alignment of video frames with their corresponding captions.

Multimodal representation models are typically unable to assume direct correspondence of representations in the different modalities, since the precise alignment can often be noisy or ambiguous. For example, the text "dog" could be paired with many different pictures of dogs, and correspondingly a picture of a dog could be captioned with varying degrees of specificity. This limitation means that downstream tasks may require an additional generative mapping network between modalities to achieve optimal performance, such as in DALLE-2 for text to image generation.