Deep neural networks
have shown outstanding performance on various machine learning tasks, especially on supervised learning in computer vision (image classification[32, 54, 59], semantic segmentation [85, 45]), natural language processing (pre-trained language models [33, 74, 84, 149]83], question answering [111, 150, 35, 5] etc.) and graph learning (node classification [70, 106, 138, 58], graph classification [155, 7, 123] etc.). Generally, the supervised learning is trained over a specific task with a large manually labeled dataset which is randomly divided into training, validatiton and test sets.
However, supervised learning is meeting its bottleneck. It not only relies heavily on expensive manual labeling but also suffers from generalization error, spurious correlations, and adversarial attacks. We expect the neural network to learn more with fewer labels, fewer samples, or fewer trials. As a promising candidate, self-supervised learning has drawn massive attention for its fantastic data efficiency and generalization ability, with many state-of-the-art models following this paradigm. In this survey, we will take a comprehensive look at the development of the recent self-supervised learning models and discuss their theoretical soundness, including frameworks such as Pre-trained Language Models (PTM), Generative Adversarial Networks (GAN), Autoencoder and its extensions, Deep Infomax, and Contrastive Coding.
The term “self-supervised learning” was first introduced in robotics, where the training data is automatically labeled by finding and exploiting the relations between different input sensor signals. It was then borrowed by the field of machine learning. In a speech on AAAI 2020, Yann LeCun described self-supervised learning as ”the machine predicts any parts of its input for any observed part.” We can summarize them into two classical definitions following LeCun’s:
Obtain “labels” from the data itself by using a “semi-automatic” process.
Predict part of the data from other parts.
Specifically, the “other part” here could be incomplete, transformed, distorted, or corrupted. In other words, the machine learns to ’recover’ whole, or parts of, or merely some features of its original input.
People are often confused by unsupervised learning and self-supervised learning. Self-supervised learning can be viewed as a branch of unsupervised learning since there is no manual label involved. However, narrowly speaking, unsupervised learning concentrates on detecting specific data patterns, such as clustering, community discovery, or anomaly detection, while self-supervised learning aims at recovering, which is still in the paradigm of supervised settings. Figure1 provides a vivid explanation of the differences between them.
There exist several comprehensive reviews related to Pre-trained Language Models , Generative Adversarial Networks , Autoencoder and contrastive learning for visual representation . However, none of them concentrates on the inspiring idea of self-supervised learning that illustrates researchers and models in many fields. In this work, we collect studies from natural language processing, computer vision, and graph learning in recent years to present an up-to-date and comprehensive retrospective on the frontier of self-supervised learning. To sum up, our contributions are:
We provide a detailed and up-to-date review of self-supervised learning for representation. We introduce the background knowledge, models with variants, and important frameworks. One can easily grasp the frontier ideas of self-supervised learning.
We categorize self-supervised learning models into generative, contrastive, and generative-contrastive (adversarial), with particular genres inner each one. We demonstrate the pros and cons of each category and discuss the recent attempt to shift from generative to contrastive.
We examine the theoretical soundness of self-supervised learning methods and show how it can benefit the downstream supervised learning tasks.
We identify several open problems in this field, analyze the limitations and boundaries, and discuss the future direction for self-supervised representation learning.
We organize the survey as follows. In Section 2, we introduce the preliminary knowledge for new computer vision, natural language processing, and graph learning. From Section 3 to Section 5, we will introduce the empirical self-supervised learning methods utilizing generative, contrastive and generative-contrastive objectives. In Section 6, we investigate the theoretical basis behind the success of self-supervised learning and its merits and drawbacks. In Section 7, we discuss the open problems and future directions in this field.
2.1 Representation Learning in NLP
Pre-trained word representations are key components in natural language processing tasks. Word embedding is to represent words as low-dimensional real-valued vectors. There are two kinds of word embeddings: non-contextual and contextual embeddings.
Non-contextual Embeddings does not consider the context information of the token; that is, these models only map the token into a distributed embedding space. Thus, for each word in the vocabulary , embedding will assign it a specific vector., where is the dimension of the embedding. These embeddings can not model complex characteristics of word use and polysemous.
To model both complex characteristics of word use and polysemy, contextual embedding is proposed. For a text sequence , , the contextual embedding of depends on the whole sequence.
here is the embedding function. Since for a certain token , the embedding can be different if in difference context, is called contextual embedding. This kind of embedding can distinguish the semantics of words in different contexts.
Distributed word representation represents each word as dense, real-valued, and low-dimensional vector. First-generation word embedding is introduced as a neural network language model (NNLM). For NNLM, most of the complexity is caused by the non-linear hidden layer in the model. Mikolov et al. proposed Word2Vec Model  to learn the word representations efficiently. There are two models: Continuous Bag-of-Words Model (CBOW) and Continuous Skip-gram Model (SG). Word2Vec is a kind of context prediction model, and it is one of the most popular implementations to generate non-contextual word embeddings for NLP. In the first-generation word embedding, the same word has the same embedding. Since a word can have multiple senses, therefore, second-generation word embedding method is proposed. In that case, each word token has its embedding. These embeddings also called contextualized word embedding since the embeddings of word tokens depend on its context. ELMo (Embeddings from Language Model) is one implementation to generate those contextual word embeddings. ELMo is an RNN-based bidirectional language model, it learns multiple embedding for each word token, and dependent on how to combine those embeddings based on the downstream tasks. ELMo is a feature-based approach, that is, the model is used as a feature extractor to extract word embedding, and send those embeddings to the downstream task model. The parameters of the extractor are fixed, only the parameters in the backend model can be trained.
Recently, BERT (Bidirectional Encoder Representations from Transformers) bring large improvements on 11 NLP tasks. Different from feature-based approaches like ELMo, BERT is a fine-tuned approach. The model is first pre-trained on a large number of corpora through self-supervised learning, then fine-tuned with labeled data. As the name showed, BERT is an encoder of the transformer, in the training stage, BERT masked some tokens in the sentence, then training to predict the masked word. When use BERT, initialized the BERT model with pre-trained weights, and fine-tune the pre-trained model to solve downstream tasks.
2.2 Representation Learning in CV
Computer vision is one of the greatest benefited fields thanks to deep learning. In the past few years, researchers have developed a range of efficient network architectures for supervised tasks. For self-supervised tasks, many of them are also proved to be useful. In this section, we introduce ResNet architecture, which is the backbone of a large part of the self-supervised techniques for visual representation models.
Since AlexNet, CNN architecture is going deeper and deeper. While AlexNet had only five convolutional layers, the VGG network and GoogleNet (also codenamed Inception_v1)  had 19 and 22 layers respectively.
Evidence[127, 120] reveals that network depth is of crucial importance, and driven by its significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients , which hamper convergence from the beginning. This problem, however, has been addressed mainly by normalized initialization [79, 116]and intermediate normalization layers 77].
When deeper networks can start converging, a degradation problem has been exposed: with the network depth increasing, accuracy becomes saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model will lead to higher training error
Residual neural network (ResNet), proposed by He et al. , effectively resolved this problem. Instead of asking every few stacked layers to directly learn a desired underlying mapping, authors of  design a residual mapping architecture ResNet. The core idea of ResNet is the introduction of “shortcut connections”(Fig. 2), which are those skipping over one or more layers.
A building block is defined as:
Here and are the input and output vectors of the layers considered. The function represents the residual mapping to be learned. For the example in Fig. 2 that has two layers, in which
denotes ReLU and the biases are omitted for simplifying notations. The operation is performed by a shortcut connection and element-wise addition.
Because of its compelling results, ResNet blew people’s minds and quickly became one of the most popular architectures in various computer vision tasks. Since then, ResNet architecture has been drawing extensive attention from researchers, and multiple variants based on ResNet are proposed, including ResNeXt, Densely Connected CNN, wide residual networks.
2.3 Representation Learning on Graphs
As a ubiquitous data structure, graphs are extensively employed in multitudes of fields and become the backbone of many systems. The central problem in machine learning on graphs is to find a way to represent graph structure so that it can be easily utilized by machine learning models . To tackle this problem, researchers propose a series of methods for graph representation learning at node level and graph level, which has become a research spotlight recently.
We first define several basic terminologies. Generally, a graph is defined as , where denotes a set of vertices, denotes the number of vertices in the graph, and denotes a set of edges connecting the vertices. is the optional original vertex feature matrix. When input features are unavailable,
is set as orthogonal matrix or initialized with normal distribution, etc., in order to make the input node features less correlated. The problem of node representation learning is to learn latent node representations, which is also termed as network representation learning, network embedding, etc. There are also some graph-level representation learning problems, which aims to learn an embedding for the whole graph.
In general, existing network embedding approaches are broadly categorized as (1) factorization-based approaches such as NetMF[105, 104], GraRep , HOPE , (2) shallow embedding approaches such as DeepWalk , LINE , HARP , and (3) neural network approaches[82, 19]. Recently, graph convolutional network (GCN) and its multiple variants have become the dominant approaches in graph modeling, thanks to the utilization of graph convolution that effectively fuses graph topology and node features.
However, the majority of advanced graph representation learning methods require external guidance like annotated labels. Many researchers endeavor to propose unsupervised algorithms [52, 48, 139], which do not rely on any external labels. Self-supervised learning also opens up an opportunity for effective utilization of the abundant unlabeled data [125, 100].
|NLP||G||AR||Following words||Next word prediction||-||-||-|
|PixelCNN [134, 136]||CV||G||AR||Following pixels||Next pixel prediction||-||-||-|
|Whole image||Image reconstruction||-||-||-|
|word2vec [89, 88]||NLP||G||AE||Context words||CBOW & SkipGram||End-to-end|
|Graph||G||AE||Graph edges||Link prediction||End-to-end|
|SpanBERT ||NLP||G||AE||Masked words||Masked language model||-||-||-|
|VQ-VAE 2 ||CV||G||AE||Whole image||Image reconstruction||-||-||-|
|XLNet ||NLP||G||AE+AR||Masked words||Permutation language model||-||-||-|
|Relative postion prediction||-||-||-|
|PIRL ||CV||C||-||Jigsaw||✓||Memory bank|
|RotNet ||CV||C||-||Rotation Prediction||-||-||-|
|Deep InfoMax ||CV||C||-||
|Pre-trained GNN ||Graph||C||-||
|Local Aggregation ||CV||C||-||-||-||-|
|Instance discrimination||Memory bank|
|MoCo v2 ||CV||C||-||✓||Momentum|
|GAN ||CV||G-C||AE||Whole image||Image reconstruction||-||-||-|
|Adversarial AE ||CV||G-C||AE||-||-||-|
|BiGAN/ALI [39, 42]||CV||G-C||AE||-||-||-|
|Colorization ||CV||G-C||AE||Image color||Colorization||-||-||-|
|Inpainting ||CV||G-C||AE||Parts of images||Inpainting||-||-||-|
|80]||CV||G-C||AE||Details of images||Super-resolution||-||-||-|
|ELECTRA ||NLP||G-C||AE||Masked words||Replaced token detection||✓||End-to-end|
|WKLM ||NLP||G-C||AE||Masked entities||Replaced entity detection||✓||End-to-end|
|ANE ||Graph||G-C||AE||Graph edges||Link prediction||-||-||-|
|GraphSGAN ||Graph||G-C||AE||Graph nodes||Node classification||-||-||-|
3 Generative Self-supervised Learning
3.1 Auto-regressive (AR) Model
Auto-regressive (AR) models can be viewed as “Bayes net structure” (directed graph model). The joint distribution can be factorized as a product of conditionals
where the probability of each variable is dependent on the previous variables.
In NLP, the objective of auto-regressive language modeling is usually maximizing the likelihood under the forward autoregressive factorization . GPT  and GPT-2  use Transformer decoder architecture  for language model. Different from GPT, GPT-2 removes the fine-tuning processes of different tasks. In order to learn unified representations that generalize across different tasks, GPT-2 models , which means given different tasks, the same inputs can have different outputs.
The auto-regressive models have also been employed in computer vision, such as PixelRNN  and PixelCNN . The general idea is to use auto-regressive methods to model images pixel by pixel. For example, the lower (right) pixels are generated by conditioning on the upper (left) pixels. The pixel distributions of PixelRNN and PixelCNN are modeled by RNN and CNN, respectively. For 2D images, auto-regressive models can only factorize probabilities according to specific directions (such as right and down). Therefore, masked filters are employed in CNN architecture.
Furthermore, two convolutional networks are combined to remove the blind spot in images. Based on PixelCNN, WaveNet  – a generative model for raw audio was proposed. In order to deal with long-range temporal dependencies, dilated causal convolutions are developed to improve the receptive field. Moreover, Gated Residual blocks and skip connections are employed to empower better expressivity.
The auto-regressive models can also be applied to graph domain, such as for graph generation problem. You et al.  propose GraphRNN to generate realistic graphs with deep auto-regressive models. They decompose the graph generation process into a sequence generation of nodes and edges, conditioned on the graph generated so far. The objective of GraphRNN is defined as the likelihood of the observed graph generation sequences. GraphRNN can be viewed as a hierarchical model, where a graph-level RNN maintains the state of the graph and generates new nodes, while an edge-level RNN generates new edges based on the current graph state. After that, MRNN  and GCPN 
are proposed as auto-regressive approaches. MRNN and GCPN both use a reinforcement learning framework to generate molecule graphs through optimizing domain-specific rewards. However, MRNN mainly uses RNN-based networks for state representations, but GCPN employs GCN-based encoder networks.
The advantage of auto-regressive models is that it can model the context dependency well. However, one shortcoming of the AR model is that the token at each position can only access its context from one direction.
3.2 Flow-based Model
The goal of flow-based models is to estimate complex high-dimensional densitiesfrom data. However, directly formalizing the densities are difficult. Generally, flow-based models first define a latent variable which follows a known distribution . Then define , where is an invertible and differentiable function. The goal is to learn the transformation between and so that the density of can be depicted. According to the integral rule, . Therefore, the densities of and satisfies:
and the objective is maximum likelihood:
The advantage of flow-based models is that the mapping between and is invertible. However, it also requires that and must have the same dimension. needs to be carefully designed since it should be invertible and the Jacobian determinant in Eq. (3) should also be calculated easily. NICE  and RealNVP  design affine coupling layer to parameterize . The core idea is to split into two blocks and apply a transformation from to in an auto-regressive manner, that is and . More recently, Glow  was proposed and it introduces invertible convolutions and simplifies RealNVP.
3.3 Auto-encoding (AE) Model
The goal of the auto-encoding model is to reconstruct (part of) inputs from (corrupted) inputs.
3.3.1 Basic AE Model
Autoencoder (AE) was first introduced in 
for pre-training in ANNs. Before Autoencoder, Restricted Boltzmann Machine (RBM) can also be viewed as a special “autoencoder”. RBM is an undirected graphical model, and it only contains two layers: visible layer and hidden layer. The objective of RBM is to minimize the difference between the marginal distribution of models and data distributions. In contrast, autoencoder can be regarded as a directed graphical model, and it can be trained more easily. Autoencoder is typically for dimensionality reduction. Generally, autoencoder is a feedforward neural network trained to produce its input at the output layer. AE is comprised of an encoder network and a decoder network . The objective of AE is to make and as similar as possible (such as through mean-square error). It can be shown that linear autoencoder corresponds to PCA. Sometimes the number of hidden units is greater than the number of input units, and some interesting structures can be discovered by imposing sparsity constraints on the hidden units .
3.3.2 Context Prediction Model (CPM)
The idea of the Context Prediction Model (CPM) is predicting contextual information based on inputs.
In NLP, when it comes to SSL on word embedding, CBOW, and Skip-Gram  are pioneering works. CBOW aims to predict the input tokens based on context tokens. In contrast, Skip-Gram aims to predict context tokens based on input tokens. Usually, negative sampling is employed to ensure computational efficiency and scalability. Following CBOW architecture, FastText  is proposed by utilizing subword information.
Inspired by the progress of word embedding models in NLP, many network embedding models are proposed based on a similar context prediction objective. Deepwalk  samples truncated random walks to learn latent node embedding based on the Skip-Gram model. It treats random walks as the equivalent of sentences. However, LINE  aims to generate neighbors based on current nodes.
where denotes edge set, denotes the node, represents the weight of edge . LINE also uses negative sampling to sample multiple negative edges to approximate the objective.
3.3.3 Denoising AE Model
The idea of denoising autoencoder models is that representation should be robust to the introduction of noise. The masked language model (MLM) can be regarded as a denoising AE model because its input masks predicted tokens. To model text sequence, masked language model (MLM) randomly masks some of the tokens from the input, and then predict them based their context information, which is similar to the Cloze task . Specifically, in BERT , a unique token [MASK] is introduced in the training process to mask some tokens. However, one shortcoming of this method is that there are no input [MASK] tokens for down-stream tasks. To mitigate this, the authors do not always replace the predicted tokens with [MASK] in training. Instead, they replace them with original words or random words with a small probability.
There emerge some extensions of MLM. SpanBERT  chooses to mask continuous random spans rather than random tokens adopted by BERT. Moreover, it trains the span boundary representations to predict the masked spans, which is inspired by ideas in coreference resolution. ERNIE (Baidu)  masks entities or phrases to learn entity-level and phrase-level knowledge, which obtains good results in Chinese natural language processing tasks.
Compared with the AR model, in MLM, the predicted tokens have access to contextual information from both sides. However, MLM assumes that the predicted tokens are independent of each other if the unmasked tokens are given.
3.3.4 Variational AE Model
The variational auto-encoding model assumes that data are generated from underlying latent (unobserved) representation. The posterior distribution over a set of unobserved variables given some data is approximated by a variational distribution, .
In variational inference and evidence lower bound (ELBO) on the log-likelihood of data is maximized during training.
where is evidence probability, is prior and is likelihood probability. The right-hand side of the above equation is called ELBO. From the auto-encoding perspective, the first term of ELBO is a regularizer forcing the posterior to approximate the prior. The second term is the likelihood of reconstructing the original input data based on latent variables.
Variational Autoencoders (VAE)  is one important example where variational inference is utilized. VAE assumes the prior and the approximate posterior
both follow Gaussian distributions. Specifically, let. Furthermore, reparameterization trick is utilized for modeling approximate posterior . Assume , where . Both and are parameterized by neural networks. Based on calculated latent variable , decoder network is utilized to reconstruct the input data.
Recently, a novel and powerful variational AE model called VQ-VAE  was proposed. VQ-VAE aims to learn discrete latent variables, which is motivated since many modalities are inherently discrete, such as language, speech, and images. VQ-VAE relies on vector quantization (VQ) to learn the posterior distribution of discrete latent variables. Specifically, the discrete latent variables are calculated by the nearest neighbor lookup using a shared, learnable embedding table. In training, the gradients are approximated through straight-through estimator  as
where refers to the codebook, the operator refers to a stop-gradient operation that blocks gradients from flowing into its argument, and
is a hyperparameter which controls the reluctance to change the code corresponding to the encoder output.
on ImageNet. First, the authors enlarge the scale and enhance the autoregressive priors by a powerful PixelCNN  prior. Additionally, they adopt a multi-scale hierarchical organization of VQ-VAE, which enables learning local information and global information of images separately. Nowadays, VAE and its variants have been widely used in the computer vision area, such as image representation learning, image generation, video generation.
Variational auto-encoding models have also been employed in node representation learning on graphs. For example, Variational graph auto-encoder (VGAE)  uses the same variational inference technique as VAE with graph convolutional networks (GCN)  as the encoder. Due to the uniqueness of graph-structured data, the objective of VGAE is to reconstruct the adjacency matrix of the graph by measuring node proximity. Zhu et al.  propose DVNE, a deep variational network embedding model in Wasserstein space. It learns Gaussian node embedding to model the uncertainty of nodes. 2-Wasserstein distance is used to measure the similarity between the distributions for its effectiveness in preserving network transitivity. vGraph  can perform node representation learning and community detection collaboratively through a generative variational inference framework. It assumes that each node can be generated from a mixture of communities, and each community is defined as a multinomial distribution over nodes.
3.4 Hybrid Generative Models
3.4.1 Combining AR and AE Model.
Some works propose models to combine the advantages of both AR and AE. MADE 
makes a simple modification to autoencoder. It masks the autoencoder’s parameters to respect auto-regressive constraints. Specifically, for the original autoencoder, neurons between two adjacent layers are fully-connected through MLPs. However, in MADE, some connections between adjacent layers are masked to ensure that each input dimension is reconstructed solely from the dimensions preceding it. MADE can be easily parallelized on conditional computations, and it can get direct and cheap estimates of high-dimensional joint probabilities by combining AE and AR models.
In NLP, Permutation Language Model (PLM)  is a representative model that combines the advantage of auto-regressive model and auto-encoding model. XLNet  introduces PLM and it is a generalized auto-regressive pretraining method. XLNet enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order. To formalize the idea, let denote the set of all possible permutations of the length- index sequence , the objective of PLM can be expressed as follows:
Actually, for each text sequence, different factorization orders are sampled. Therefore, each token can see its contextual information from both sides. Based on the permuted order, XLNet also conducts reparameterization with positions to let the model know which position is needed to predict, and then a special two-stream self-attention is introduced for target-aware prediction.
3.4.2 Combining AE and Flow-based Models
In the graph domain, GraphAF  is a flow-based auto-regressive model for the molecule graph generation. It can generate molecules in an iterative process, and also calculate the exact likelihood in parallel. GraphAF formalizes the molecule generation as a sequential decision process. It incorporates detailed domain knowledge into the reward design, such as valency check. Inspired by the recent progress of flow-based models, it defines an invertible transformation from a base distribution (e.g., multivariate Gaussian) to a molecular graph structure. Additionally, Dequantization technique  is utilized to convert discrete data (including node types and edge types) into continuous data.
4 Contrastive Self-supervised Learning
From statistical perspective, machine learning models are categorized into two classes: generative model and discriminative model. Given the joint distribution of the input and target , the generative model calculates the by:
while the discriminative model tries to model the by:
Notice that most of the representation learning tasks hope to model relationships between . Thus for a long time, people believe that the generative model is the only choice for representation learning.
However, recent breakthroughs in contrastive learning, such as Deep InfoMax, MoCo and SimCLR, shed light on the potential of discriminative models for representation. Contrastive learning aims at ”learn to compare” through a Noise Contrastive Estimation (NCE) objective formatted as:
where is similar to , is dissimilar to and is an encoder (representation function). The similarity measure and encoder may vary from task to task, but the framework remains the same. With more dissimilar pairs involved, we have the InfoNCE  formulated as:
Here we divide recent contrastive learning frameworks into 2 types: context-instance contrast and instance-instance contrast. Both of them achieve amazing performance in downstream tasks, especially on classification problems under the linear protocol.
4.1 Context-Instance Contrast
The context-instance contrast, or so-called global-local contrast, focuses on modeling the belonging relationship between the local feature of a sample and its global context representation. When we learn the representation for a local feature, we hope it is associative to the representation of the global content, such as stripes to tigers, sentences to its paragraph, and nodes to their neighbors.
There are two main types of Context-Instance Contrast: Predict Relative Position (PRP) and Maximize Mutual Information (MI). The differences between them are:
PRP focuses on learning relative positions between local components. The global context serves as an implicit requirement for predicting these relations (such as understanding what an elephant looks like is critical for predicting relative position between its head and tail).
MI focuses on learning the explicit belonging relationships between local parts and global context. The relative positions between local parts are ignored.
4.1.1 Predict Relative Position
Many data contain rich spatial or sequential relations between parts of it. For example, in image data such as Fig. 6, the elephant’s head is on the right of its tail. In text data, a sentence like ”Nice to meet you.” would probably be ahead of ”Nice to meet you, too.”. Various models regard recognizing relative positions between parts of it as the pretext task . It could be relative positions of two patches from a sample , or to recover positions of shuffled segments of an image (solve jigsaw) [93, 143, 67], or to the rotation angle’s degree of an image . A similar jigsaw technique is applied in PIRL  to augment the positive sample, but PIRL does not regard solving jigsaw and recovering spatial relation as its objective.
In the pre-trained language model, similar ideas such as Next Sentence Prediction (NSP) are also adopted. NSP loss is initially introduced by BERT , where for a sentence, the model is asked to distinguish the right next sentence, and a randomly sampled one. However, some later work empirically proves that NSP helps little, even harm the performance. So in RoBERTa , the NSP loss is removed.
To replace NSP, ALBERT  proposes Sentence Order Prediction (SOP) task. That is because, in NSP, the negative next sentence is sampled from other passages that may have different topics with the current one, turning the NSP into a far easier topic model problem. In SOP, two sentences that exchange their position are regarded as a negative sample, making the model concentrate on the coherence of the semantic meaning.
4.1.2 Maximize Mutual Information
This kind of method derives from mutual information (MI) – a basic concept in statistics. Mutual information targets at modeling this association, and our objective is to maximize it. Generally, this kind of models optimize
where is the representation encoder, is a class of encoders with some constraints and is a sample-based estimator for the real mutual information. In applications, MI is notorious for its hard computation. A common practice is to maximize ’s lower bound with an NCE objective.
Deep InfoMax  is the first one to explicitly model mutual information through a contrastive learning task, which maximize the MI between a local patch and its global context. For real practices, take image classification as an example, we can encode a dog image into , and take out a local feature vector . To conduct contrast between instance and context, we need two other things:
a summary function to generate the context vector
another cat image and its context vector .
The contrastive objective is then formulated as
Deep InfoMax provides us with a new paradigm and boosts the development of self-supervised learning. The first influential follower is Contrastive Predictive Coding (CPC)  for speech recognition. CPC maximize the association between a segment of audio and its context audio. To improve data efficiency, it takes several negative context vectors at the same time. Later on, CPC has also been applied in image classification.
AMDIM  enhances the positive association between a local feature and its context. It randomly samples two different views of an image (truncated, discolored, and so forth) to generate the local feature vector and context vector, respectively. CMC  extends it into several different views for one image, and samples another irrelevant image as the negative. However, CMC is fundamentally different from Deep InfoMax and AMDIM because it proposes to measure the context-context similarity rather than instance-context similarity. We will discuss it in the next subsection.
In language pre-training, InfoWord  proposes to maximize the mutual information between a global representation of a sentence and n-grams
in it. The context is induced from the sentence with selected n-grams being masked, and the negative contexts are randomly picked out from the corpus.
In graph learning, Deep Graph InfoMax (DGI)  considers a node’s representation as the local feature and the average of randomly samples 2-hop neighbors as context. However, it is hard to generate negative contexts on a single graph. To solve this problem, DGI proposes to corrupt the original context by keeping the sub-graph structure and permuting the node features. DGI is followed by a number of works, such as InfoGraph , which targets at learning graph-level representation rather than node level, maximizing the mutual information between graph-level representation and substructures at different levels.
In , the authors systematically analysis the pre-training strategies for graph neural networks from two dimensions: attribute/structural and node-level/graph-level. For structural prediction at node-level, it also proposes to maximize the MI between the representations of k-hop neighborhood and its context graph. SGRL  further separates nodes in the context graph into k-hop context subgraphs, and maximize their MI with target node respectively.
4.2 Context-Context Contrast
The  provides empirical evidence that the success of the models mentioned above is only loosely connected to MI by showing that an upper bound MI estimator leads to ill-conditioned and lower performance representation. Instead, more should be attributed to encoder architecture and a negative sampling strategy related to metric learning. A significant focus in metric learning is to perform hard positive sampling while increasing the negative sampling efficiency, and they probably play a more critical role in MI-based models’ success.
Recently, MoCo  and SimCLR  empirically support the above conclusion. They outperform the context-instance-based methods and achieve a competitive result to supervised methods under the linear classification protocol, through a context-to-context level direct comparison. We will start with cluster-based discrimination proposed earlier and then turn to instance discrimination advocated by MoCo and SimCLR.
4.2.1 Cluster-based Discrimination
Context-context contrast is first studied in clustering-based methods [148, 81, 94, 20], especially the DeepCluster  which first achieves competitive performance to the supervised model AlexNet .
Image classification asks the model to categorize images correctly, and the representation of images in the same category should be similar. Therefore, the motivation is to draw similar images near in the embedding space. In supervised learning, this drawing-near process is accomplished via label supervision; in self-supervised learning, however, we do not have such labels. To solve the label problem, Deep Cluster 
proposes to leverage clustering to yield pseudo labels and asks a discriminator to predict on images’ label. The training could be formulated in two steps. In the first step, DeepCluster uses K-means to cluster encoded representation and produces pseudo labels for each sample. Then in the second step, the discriminator predicts whether two samples are from the same cluster and back-propagates to the encoder. These two steps are performed iteratively.
Recently, Local Aggregation (LA)  has pushed forward the boundary of the cluster-based method. It points out several drawbacks of DeepCluster and makes the corresponding optimization. First, in DeepCluster, samples are assigned to mutual-exclusive clusters, but LA identifies neighbors separately for each example. Second, DeepCluster optimizes a cross-entropy discriminative loss, while LA employs an objective function that directly optimizes a local soft-clustering metric. These two changes substantially boost the performance of LA representation on downstream tasks.
A similar work to aggregate similar vectors together in embedding space is VQ-VAE [135, 112] that we introduce in Section 3. To conquer the traditional deficiency for VAE to generate high-fidelity images, VQ-VAE proposes to quantize vectors. For the feature matrix encoded from an image, VQ-VAE substitutes each 1-dimensional vector in the matrix to the nearest one in an embedding dictionary. This process is somehow the same as what LA is doing.
Clustering-based discrimination may also help in the generalization of other pre-trained models, transferring models from pretext objectives to real tasks better. Traditional representation learning models have only two stages: one for pre-training, and the other for evaluation. ClusterFit  introduces a cluster prediction fine-tuning stage similar to DeepCluster between the above two stages, which improves the representation’s performance on downstream classification evaluation.
4.2.2 Instance Discrimination
The prototype of leverage instance discrimination as a pretext task is InstDisc . On the basis of InstDisc, CMC  proposes to adopt multiple different views of an image as positive samples and take another one as the negative. In the embedding space, CMC draws near multiple views of an image, and pull away from other samples. However, it is somehow constrained by the idea of Deep InfoMax, sampling one negative sample for each positive one.
In MoCo , researchers further develop the idea of leveraging instance discrimination via momentum contrast, which substantially increases the amount of negative samples. For example, given an input image , our intuition is to learn a instinct representation by a query encoder that can distinguish from any other images. Therefore, for a set of other images , we employ an asynchronously updated key encoder to yield and , and optimize the following objective
where is the number of negative samples. This formula is in the form of InfoNCE.
Besides, MoCo presents two other critical ideas in dealing with negative sampling efficiency.
First, it abandons the traditional end-to-end training framework and designs the momentum contrast learning with two encoders (query and key), which prevents the fluctuation of loss convergence in the beginning period.
Second, to enlarge the capacity of negative samples, MoCo employs a queue (with K as large as 65536) to save the recently encoded batches as negative samples. This significantly improves the negative sampling efficiency.
There are some other auxiliary techniques to ensure the training convergence, such as batch shuffling to avoid trivial solution and temperature hyper-parameter to adjust the scale.
However, MoCo adopts a too simple positive sample strategy: a pair of positive representation comes from the same sample without any transformation or augmentation, making the positive pair far too easy to distinguish. PIRL  adds jigsaw augmentation as described in Section 4.1.1. In order to produce a pretext-invariant representation, PIRL asks encoder to regard an image and its jigsawed one as similar pairs.
In SimCLR , the authors further illustrate the importance of a hard positive sample strategy by introducing data augmentation in 10 forms. This data augmentation is similar to CMC , which leverages several different views to augment the positive pairs. SimCLR follows the end-to-end training framework instead of momentum contrast from MoCo, and to handle the large scale negative samples problem, SimCLR chooses a batch size of as large as 8196.
The details are as follows. A minibatch of samples is augmented to be samples . For a pair of a positive sample and (derive from one original sample), other are treated as negative ones. A pairwise contrastive loss NT-Xent loss  is defined as
noted that is asymmetrical, and the
function here is a cosine similarity function that can normalize the representations. The summed up loss is
SimCLR also provides some other useful techniques, including a learnable nonlinear transformation between the representation and the contrastive loss, more training steps, and deeper neural networks.  conducts ablation studies to show that techniques in SimCLR can also further improve MoCo’s performance.
In graph learning, Graph Contrastive Coding (GCC) 
is a pioneer to leverage instance discrimination as the pretext task for structural information pre-training. For each node, we sample two subgraphs independently by random walk with restart and use top eigenvectors from their normalized graph Laplacian matrices as nodes’ initial representations. Then we use GNN to encode them and calculate the InfoNCE loss as what MoCo and SimCLR do, where the node embeddings from the same node (in different subgraphs) are viewed as similar. Results show that GCC learns better transferable structural knowledge than previous work such as struc2vec, GraphWave  and ProNE .
5 Generative-Contrastive (Adversarial) Self-supervised Learning
5.1 Why Generative-Contrastive (Adversarial)?
A reason for the generative model’s success in self-supervised learning is its ability to fit the data distribution, based on which varied downstream tasks can be conducted. The objective of generative self-supervised learning is usually formulated as a maximum likelihood function
where is all the samples we hope to model, and is a conditional constraint such as context information. This objective is then optimized by the Maximum Likelihood Estimation (MLE). Nevertheless, MLE has two fatal problems:
Sensitive and Conservative Distribution. When , becomes super large, making generative model extremely sensitive to rare samples. It directly leads to a conservative distribution, which has a low performance.
Low-level Abstraction Objective. In MLE, the representation distribution is modeled at ’s level, such as pixels in images, words in texts, and nodes in graphs. However, most of the classification tasks target high-level abstraction, such as object detection, long paragraph understanding, and molecule classification.
These two problems severely restrict the development of generative self-supervised learning. Fortunately, discriminative and contrastive objectives can solve this problem because they are designed to serve for human-level understanding. Take autoencoder and GAN  as examples, autoencoders leverage a pointwise reconstruction loss, which may fit pixel-level patterns out of the sample distribution. However, GANs utilize contrastive objectives that distinguish the generated samples from real ones, which fit at the semantic-level to avoid this problem.
In terms of the difference to contrastive learning, adversarial methods still preserve the generator structure consisted of an encoder and a decoder, while the contrastive abandons the decoder component (as shown in Fig. 12). It is critical because, on the one hand, the generator endows adversarial learning with strong expressiveness that is peculiar to generative models; on the other hand, it also makes the objective of adversarial methods far more challenging to learn than that of contrastive methods, leading to unstable convergence. In the adversarial setting, the existence of decoder asks the representation to be ”reconstructive,” in other words, contains all the necessary information for constructing the inputs. In the contrastive setting, however, we only need to learn ”distinguishable” information to discriminate different samples.
To sum up, the adversarial method absorbs merits from both generative and contrastive methods together with some drawbacks. In a situation where we need to fit on an implicit distribution, it is a better choice. In the next several subsections, we will discuss its various applications on representation learning.
5.2 Generate with Complete Input
In this section, we introduce GAN and its variants for representation learning, which focus on capturing the complete information of the sample.
The inception of adversarial representation learning should be attributed to Generative Adversarial Networks (GAN) , which proposes the adversarial training framework. Follow GAN, many variants [80, 98, 60, 16, 66, 62] emerge and reshape people’s understanding of deep learning’s potential. The training process of GAN could be viewed as two players play a game; one generates fake samples while another tries to distinguish it from real ones. To formulate this problem, we define as the generator, as the discriminator, as the real sample distribution, as the learned latent sample distribution, we want to optimize this min-max game
Before VQ-VAE2, GAN maintains dominating performance on image generation tasks over purely generative models, such as autoregressive PixelCNN and autoencoder VAE. It is natural to think about how this framework could benefit representation learning.
However, there is a gap between generation and representation. Compared to autoencoder’s explicit latent sample distribution , GAN’s latent distribution is implicitly modeled. We need to extract this implicit distribution out. To bridge this gap, AAE  first proposes a solution to follow the natural idea of the autoencoder. The generator in GAN could be viewed as an implicit autoencoder. In order to extract the representation out, we can replace the generator with an explicit variational autoencoder (VAE). Recall the objective of VAE
As we mentioned before, compared to loss of autoencoder, discriminative loss in GAN better models the high-level abstraction. To alleviate the problem, AAE substitutes the KL divergence function to a discriminative loss
that asks the discriminator to distinguish representation from the encoder and a prior distribution.
However, AAE still preserves the reconstruction error, which contradicts with GAN’s core idea. Based on AAE, BiGAN  and ALI  argue to embrace adversarial learning without reservation and put forward a new framework. Given a real sample
Generator : the generator here virtually acts as the decoder, generates fake samples by from a prior latent distribution (e.g. [uniform(-1,1)], d refers to dimension).
Encoder : a newly added component, mapping real sample to representation . This is also exactly what we want to train.
Discriminator : given two inputs [, ] and [, ], decide which one is from the real sample distribution.
It is easy to see that their training goal is , in other words, encoder should learn to ”convert” generator . This goal could be rewritten as a loss for autoencoder , but it is not the same as traditional autoencoder because the distribution does not make any assumption about the data itself. The distribution is shaped by the discriminator, which captures the semantic-level difference. Based on BiGAN and ALI, later studies [26, 40] discover that GAN with deeper and larger networks and modified architectures can produce even better results on downstream task.
5.3 Recover with Partial Input
As we mentioned above, GAN’s architecture is not born for representation learning, and modification is needed to apply its framework. While BiGAN and ALI choose to extract the implicit distribution directly, some other methods such as colorization [156, 157, 76, 75], inpainting [60, 98] and super-resolution  apply the adversarial learning via in a different way. Instead of asking models to reconstruct the whole input, they provide models with partial input and ask them to recover the rest parts. This is similar to denoising autoencoder (DAE) such as BERT’s family in natural language processing, but notice that it is conducted in an adversarial manner.
Colorization is first proposed by . The problem can be described as given one color channel in an image and to predict the value of two other channels ,
. The encoder and decoder networks can be set to any form of convolutional neural networks. What is interesting is that to avoid the uncertainty brings by traditional generative methods such as VAE, the author transforms the generation task into a classification one. The first figure out the common locating area of
and then split it into 313 categories. The classification is performed through a softmax layer with hyper-parameteras adjustment. Based on , a range of colorization-based representation methods [157, 76, 75] are proposed to benefit downstream tasks.
Inpainting [60, 98] is more straight forward. We will ask the model to predict an arbitrary part of an image given the rest of it. Then a discriminator is employed to distinguish the inpainted image with the original one. Super-resolution method SRGAN  follows the same idea to recover high-resolution images from blurred low-resolution ones in the adversarial setting.
5.4 Pre-trained Language Model
For a long time, the pre-trained language model (PTM) focuses on maximum likelihood estimation based pretext task, because discriminative objectives are thought to be helpless due to languages’ vibrant patterns. However, recently some work shows excellent performance and sheds light on contrastive objectives’ potential in PTM.
The pioneering work is ELECTRA , surpassing BERT given at the same computation budget. ELECTRA proposes Replaced Token Detection (RTD) and leverages GAN’s structure to pre-train a language model. In this setting, the generator is is a small Masked Language Model (MLM), which replaces masked tokens in a sentence to words. The discriminator is asked to predict which words are replaced. Notice that replaced means not the same with original unmasked inputs. The training is conducted in two stages:
Warm-up the generator: train the with MLM pretext task for some steps to warm up the parameters.
Trained with the discriminator: ’s parameters is initialized with ’s and then trained with the discriminative objective (a cross-entropy loss). During this period, ’s parameter is frozen.
The final objective could be written as
Though ELECTRA is structured as GAN, it is not trained in the GAN setting. That is because compared to image data, which is continuous, word tokens are discrete, which stops the gradient backpropagation. A possible substitution is to leverage policy gradient, but experiments in ELECTRA show that performance is slightly lower. Theoretically speaking, is actually turning the conventional -class softmax classification into a binary classification. This substantially saves the computation effort, but may somehow harm the representation quality due to the early degeneration of embedding space. In summary, ELECTRA is still an inspiring pioneer work in leveraging discriminative objective.
At the same time, WKLM  proposes to perform RTD at the entity-level. For entities in Wikipedia paragraphs, WKLM replaced them with similar entities and trained the language model to distinguish them in a similar discriminative objective as ELECTRA, performing quite well in downstream tasks like question answering. Similar work is REALM , which conducts higher article-level retrieval augmentation to the language model. However, REALM is not using the discriminative objective.
5.5 Graph Learning
The most natural idea is to follow BiGAN  and ALI ’s practice that asks discriminator to distinguish representation from generated and prior distribution. Adversarial Network Embedding (ANE)  designs a generator that is updated in two stages: 1) encodes sampled graph into target embedding and computes traditional NCE with a context encoder like Skip-gram, 2) discriminator is asked to distinguish embedding from and a sampled one from a prior distribution. The optimized objective is a sum of the above two objectives, and the generator could yield better node representation for the classification task.
GraphGAN  consider to model the link prediction task and follow the original GAN style discriminative objective to distinguish directly at node-level rather than representation-level. The model first selects nodes from the subgraph of the target node according to embedding encoded by the generator . Then some neighbor nodes to selected from the subgraph, together with those selected by
, are put into a binary classifierto decide whether they are linked to . Because this framework involves a discrete selection procedure, while the discriminator could be updated by gradient descents, the generator is updated via policy gradients.
GraphSGAN  applies the adversarial method in semi-supervised graph learning with the motivation that most classification errors in the graph are caused by marginal nodes. Consider samples in the same category; they are usually clustered in the embedding space. Between clusters, there are density gaps where few samples exist. The author provides a rigorous mathematical proof that if we generate enough fake samples in density gaps, we are able to perform complete classification theoretically. During the training, GraphSGAN leverages a generator to generate fake nodes in density gaps and asks the discriminator to classify nodes into their original categories and a category for those fake ones. In the test period, fake samples are removed, and classification results on original categories could be improved substantially.
5.6 Domain Adaptation and Multi-modality Representation
Essentially, the discriminator in adversarial learning serves to match the discrepancy between latent representation distribution and data distribution. This function naturally relates to domain adaptation and multi-modality representation problems, which aim at aligning different representation distribution. [1, 43, 117, 2] studies how GAN can help on domain adaptation. [17, 141] leverage adversarial sampling to improve the negative samples’ quality. For multi-modality representation, ’s image to image translation, ’s text style transfer, ’s word to word translation and  image to text translation show great power of adversarial representation learning.
6 Theory behind Self-supervised Learning
In last three sections, we introduces a number of empirical works for self-supervised learning. However, we are also curious about their theoretical foundations. In this part, we will provide some theoretical insights on self-supervised learning’s success from different perspectives.
6.1.1 Divergence Matching
As generative models, GANs pays attention to the difference between real data distribution and generated data distribution :
f-GAN shows that the generative-adversarial approach is a special case of an exsiting more general variational divergence estimation problem, and uses f-divergence to train the generative models. f-divergence reflects the difference of two distributions and :
Different divergence functions leads to different GAN variants.  also discusses the effects of various choices of divergence functions.
6.1.2 Disentangled Representation
An important drawback of supervised learning is that it easily get trapped into spurious information. A famous example is that supervised neural networks learn to distinguish dogs and wolves by whether they are in the grass or snow , which means the supervised models do not learn the disentangled representations of the grass and the dog, which should be mutual independent.
As an alternative, GAN show its superior potential in learning disentangled features empirically and theoretically. InfoGAN 
first proposes to learn disentangled representation with DCGAN. Conventionally, we sample white noise from a uniform or Gaussian distribution as input to generator of GAN. However, this white noise does not make any sense to the characteristics of the image we generated. We assume that there should be a latent codewhose dimensions represent different characteristics of the image respectively (such as rotation degree and width). We will learn this jointly in the discrimination period by the discriminator , and maximize ’s mutual information with the image , where refers to the generator (actually the decoder).
Since mutual information is notoriously hard to compute, the authors leverage the variational inference approcach to estimates its lower bound , and the final objective for InfoGAN is modified as:
Experiments show that InfoGAN surely learns a good disentangled representation on MNIST. This further encourage researchers to identify whether the modular structures for generation inner the GAN could be disentangled and independent with each others. GAN dissection  is a pioneer work in applying causal analysis into understading GAN. They identify the correlations between channels in the convolutional layers and objects in the generated images, and examine whether they are causally-related with the output.  takes another step to examine these channels’ conditional independence via rigorous counterfactual interventions over them. Results indicate that in BigGAN researchers are able to disentangle backgrounds and objects, such as replacing the background of a cock from the bare soil with the grassland.
These work indicates the ability of GAN to learn disentangled features and other self-supervised learning methods are likely to be capable too.
6.2 Maximizing Lower Bound
6.2.1 Evidence Lower Bound
VAE (variational auto-encoder) learns the representation through learning a distribution to approximate the posteriori distribution ,
where ELBO (Evidence Lower Bound Objective) is the lower bound of the optimization target . VAE maximizes the to minimize the difference between and .
where is the regularization loss to approximate the Gaussian Distribution and is the reconstruction loss.
6.2.2 Mutual Information
Most of current contrastive learning methods aim to maximize the MI(Mutual Information) of the input and its representation with joint density and marginal densities and :
Deep Infomax maximizes the MI of local and global features and replaces KL-divergence with JS-divergence, which is similar to GAN mentioned above. Therefore the optimization target of Deep Infomax becomes:
The form of the objective optimization function is similar to (26), except that the data distribution becomes the global and local feature distributions. From a probability point of view, GAN and DeepInfoMax are derived from the same process but for a different learning target. The encoder in GAN, to an extent, works the same as the encoder in representation learning models. The idea of generative-adversarial learning deserves to be used in self-learning areas.
Therefore the MI . The approximation becomes increasingly accurate, and also increases as N grows. This implies that it is useful to use large negative samples(large values of N). But  has demonstrated that increasing the number of negative samples does not necessarily help. Negative sampling remains a key challenge to study.
Though maximizing ELBO and MI has been achieved to obtain the state-of-art result in self-supervised representation learning, it is demonstrated that MI and ELBO are loosely connected with the downstream task performance. Maximizing the lower bound(MI and ELBO) is not sufficient to learn useful representations. On the one hand, looser bounds often yield better test accuracy in downstream tasks. On the other hand, achieving the same lower bound value can lead to vastly different representations and performance on downstream tasks, which indicates that it does not necessarily capture useful information of data
. There is a non-trivial interaction between the representation encoder, critic, and loss function.
MI maximization can also be analyzed from the metric learning view.  provides some insight by connecting InfoNCE to the triplet (k-plet) loss in deep learning community. The InfoNCE (33) can be rewriten as follows:
In particular is contrained to the form for a certain function . Then the InfoNCE is corresponding to the expectation of the multi-class k-pair loss:
In metric learning, the encoder is share across views( and ) and the critic function is symmetric, while the MI maximization,e.g. DeepInfoxMax, CMC and MoCo, is not contrained by these conditions. (35) can be viewed as learning encoders with a parameter-less inner product.
6.3 Contrastive Self-supervised Representation Learning
It seems intuitive that minimizing the aforementioned loss functions should lead the representations better to capture the ”similarity” between different entities, but it is unclear why the learned representations should also lead to better performance on downstream tasks, such as linear classification tasks. Intuitively, a self-supervised representation learning framework must capture the feature in unlabelled data and the similarity with semantic information that is implicitly present in downstream tasks.  proposed a conceptual framework to analyze contrastive learning on average classification tasks.
Contrastive learning assumes that similar data pair comes from a distribution and negative sample from a dstribution that is presumably unrelated to . Under the hypothesis that semantically similar points are sampled from the same latent class, the unsupervised loss can be expressed as:
The self-supervised learning is to find a funtion that minimizes the empirical unsupervised loss within the capacity of the used encoder. As negative points are sampled independently identically from the datasets, can be decomposed into and accoding to the latent class the negative sample drawed from. The intraclass deviation controls the and implies the unexpected loss contradictive to our optimization target, which is caused by the negative sampling strategies. Under the context of only 1 negative sample, it is proved that optimizing unsupervised loss benefits the downstream classification tasks:
With probability at least , is the feature mapping function the encoder can capture, is the generalization error. When the sampled pair and the numebr of latent class , and . If the encoder is powerful enough and trained using suffiently large number of samples, the learned function with low as well as low will have good performance on supervised tasks (low ).
Contrastive learning also has limitations. In fact, contrastive learning does not always pick the best supervised representation function . Minimizing the unsupervised loss to get low does not mean that because high and high does not imply high , resulting the failure of the algorithm.
The relationship between and are further explored on the condition of mean classifier loss , where indicates that a label only corresponds to a embedding vector . If there exists a functoin that has intraclass concentration in strong sense and can separate latent classes with high margin(on average) with mean classifier, then will be low. If is in every direction for every class and has maximum norm , then can be controlled by .
For all and with the probability at least , . Under the assumption and context, optimizing the unsupervised loss indeed helps pick the best downstream task supervised loss.
Besides, the general belief is that increasing the number of negative samples always helps, at the cost of increased computational costs. Noise Contrastive Estimation(NCE)
explains that increasing the number of negative samples can provably improve the variance of learning parameters. However, argues that this does not hold for contrastive learning and shows that it can hurt performance when the negative samples exceed a threshold. s
Under the assumptions, contradictive representation learning is theoretically proved to benefit the downstream classification tasks. More detailed proofs can be found in . This connects the ”similarity” in unlabelled data with the semantic information in downstream tasks. Though the connection temporarily is only in a restricted context, more generalized research deserves exploration.
7 Discussions and Future Directions
In this section, we will discuss several open problems and future directions in self-supervised learning for representation.
Theoretical Foundation Though self-supervised learning has achieved great success, few works investigate the mechanisms behind it. In Section 6, we list several recent works on this topic and show that theoretical analysis is significant to avoid misleading empirical conclusions.
In , researchers present a conceptual framework to analyze the contrastive objective’s function in generalization ability.  empirically proves that mutual information is only loosely related to the success of several MI-based methods, in which the sampling strategies and architecture design may count more. This type of works is crucial for self-supervised learning to form a solid foundation, and more work related to theory analysis is urgently needed.
Transferring to downstream tasks
There is an essential gap between pre-training and downstream tasks. Researchers design elaborate pretext tasks to help models learn some critical features of the dataset that can transfer to other jobs, but sometimes this may fail to realize. Besides, the process of selecting pretext tasks seems to be too heuristic and tricky without patterns to follow.
A typical example is the selection of pre-training tasks in BERT and ALBERT. BERT uses Next Sentence Prediction (NSP) to enhance its ability for sentence-level understanding. However, ALBERT shows that NSP equals to a naive topic model, which is far too easy for language model pre-training and even decrease the performance of BERT.
For the pre-training task selection problem, a probably exciting direction would be to design pre-training tasks for a specific downstream task automatically, just as what Neural Architecture Search  does for neural network architecture.
Transferring across datasets This problem is also known as how to learn inductive biases or inductive learning. Traditionally, we split a dataset into the training used for learning the model parameters and the testing part for evaluation. An essential prerequisite for this learning paradigm is that data in the real world conforms to the distribution in our dataset. Nevertheless, this assumption frequently fails in experiments.
Self-supervised representation learning solves part of this problem, especially in the field of natural language processing. Vast amounts of corpora used in the language model pre-training help to cover most patterns in language and therefore contribute to the success of PTMs in various language tasks. However, this is based on the fact that text in the same language shares the same embedding space. For other tasks like machine translation and fields like graph learning where embedding spaces are different for different datasets, how to learn the transferable inductive biases efficiently is still an open problem.
Exploring potential of sampling strategies In , the authors attribute one of the reasons for the success of mutual information-based methods to better sampling strategies. MoCo , SimCLR , and a series of other contrastive methods may also support this conclusion. They propose to leverage super large amounts of negative samples and augmented positive samples, whose effects are studied in deep metric learning. How to further release the power of sampling is still an unsolved and attractive problem.
Early Degeneration for Contrastive Learning Contrastive learning methods such as MoCo  and SimCLR  is rapidly approaching the performance of supervised learning for computer vision. However, their incredible performances are generally limited to the classification problem. Meanwhile, the generative-contrastive method ELETRA  for language model pre-training is also outperforming other generative methods on several standard NLP benchmarks with fewer model parameters. However, some remarks indicate that ELETRA’s performance on language generation and neural entity extraction is not up to expectations.
Problems above are probably because the contrastive objectives often get trapped into embedding spaces’ early degeneration problem, which means that the model over-fits to the discriminative pretext task too early, and therefore lost the ability to generalize. We expect that there would be techniques or new paradigms to solve the early degeneration problem while preserving contrastive learning’s advantages.
The work is supported by the National Key R&D Program of China (2018YFB1402600), NSFC for Distinguished Young Scholar (61825602), and NSFC (61836013).
-  (2014) Domain-adversarial neural networks. arXiv preprint arXiv:1412.4446. Cited by: §5.6.
-  (2018) Domain adaptation with adversarial training and graph embeddings. arXiv preprint arXiv:1805.05151. Cited by: §5.6.
-  (2017) Fixing a broken elbo. arXiv preprint arXiv:1711.00464. Cited by: §6.2.2.
-  (2019) A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229. Cited by: §6.2.2, §6.3, §6.3, §6.3, §7.
-  (2019) Learning to retrieve reasoning paths over wikipedia graph for question answering. arXiv preprint arXiv:1911.10470. Cited by: §1.
-  (2019) Learning representations by maximizing mutual information across views. In NIPS, pp. 15509–15519. Cited by: TABLE I, Fig. 7, §4.1.2.
-  (2019) Simgnn: a neural network approach to fast graph similarity computation. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 384–392. Cited by: §1.
-  (1987) Modular learning in neural networks.. In AAAI, pp. 279–284. Cited by: §3.3.1.
-  (2018) Gan dissection: visualizing and understanding generative adversarial networks. arXiv preprint arXiv:1811.10597. Cited by: §6.1.2.
-  (2003) A neural probabilistic language model. Journal of machine learning research 3 (Feb), pp. 1137–1155. Cited by: §2.1.
-  (2013) Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §3.3.4.
-  (1994) Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5 2, pp. 157–66. Cited by: §2.2.
-  (2018) Counterfactuals uncover the modular structure of deep generative models. arXiv preprint arXiv:1812.03253. Cited by: Fig. 17, §6.1.2.
-  (2019) Rethinking lossy compression: the rate-distortion-perception tradeoff. arXiv preprint arXiv:1901.07821. Cited by: §6.2.2.
-  (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. Cited by: TABLE I, §3.3.2.
-  (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §3.3.4, §5.2.
Kbgan: adversarial learning for knowledge graph embeddings. arXiv preprint arXiv:1711.04071. Cited by: §5.6.
-  (2015) GraRep: learning graph representations with global structural information. In CIKM ’15, Cited by: §2.3.
-  (2016) Deep neural networks for learning graph representations. In AAAI, Cited by: §2.3.
-  (2018) Deep clustering for unsupervised learning of visual features. In Proceedings of the ECCV (ECCV), pp. 132–149. Cited by: TABLE I, Fig. 9, §4.2.1, §4.2.1.
-  (2018) HARP: hierarchical representation learning for networks. In AAAI, Cited by: §2.3.
-  (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: TABLE I, Fig. 11, §4.2.2, §4.2, §4.2, §7, §7.
-  (2017) On sampling strategies for neural network-based collaborative filtering. In Proceedings of the 23rd ACM SIGKDD, pp. 767–776. Cited by: §4.2.2.
-  (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In NIPS, pp. 2172–2180. Cited by: §6.1.2.
-  (2020) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: TABLE I, §4.2.2, §4.2, §6.3.
-  (2017) Triple generative adversarial nets. In NIPS, pp. 4088–4098. Cited by: §5.2.
-  (2020) Electra: pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555. Cited by: TABLE I, Fig. 15, §5.4, §7.
-  (2017) Word translation without parallel data. arXiv preprint arXiv:1710.04087. Cited by: §5.6.
Adversarial network embedding.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: TABLE I, §5.5, §5.5.
-  (2019) Transformer-xl: attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988. Cited by: §3.4.1.
-  (1994) Learning classification with unlabeled data. In NIPS, pp. 112–119. Cited by: Fig. 1.
Imagenet: a large-scale hierarchical image database.
2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1, §3.3.4.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §2.1, TABLE I, §3.3.3, §4.1.1.
-  (2018) Semi-supervised learning on graphs with generative adversarial nets. In Proceedings of the 27th ACM CIKM, pp. 913–922. Cited by: TABLE I, Fig. 16, §5.5, §5.5.
-  (2019) Cognitive graph for multi-hop reading comprehension at scale. arXiv preprint arXiv:1905.05460. Cited by: §1.
-  (2014) Nice: non-linear independent components estimation. arXiv preprint arXiv:1410.8516. Cited by: TABLE I, §3.2.
-  (2016) Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Cited by: TABLE I, §3.2.
-  (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE ICCV, pp. 1422–1430. Cited by: TABLE I, Fig. 6, §4.1.1.
-  (2016) Adversarial feature learning. arXiv preprint arXiv:1605.09782. Cited by: TABLE I, Fig. 13, §5.2, §5.2, §5.5.
-  (2019) Large scale adversarial representation learning. In NIPS, pp. 10541–10551. Cited by: TABLE I, §5.2.
-  (2018) Learning structural node embeddings via diffusion wavelets. In Proceedings of the 24th ACM SIGKDD, pp. 1320–1329. Cited by: §4.2.2.
-  (2016) Adversarially learned inference. arXiv preprint arXiv:1606.00704. Cited by: TABLE I, Fig. 13, §5.2, §5.5.
-  (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §5.6.
-  (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: TABLE I, Fig. 6, §4.1.1.
-  (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §1.
-  (2014) Generative adversarial nets. In NIPS, pp. 2672–2680. Cited by: TABLE I, §6.1.1.
-  (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD, pp. 855–864. Cited by: TABLE I.
-  (2018) Graphite: iterative generative modeling of graphs. In ICML, Cited by: §2.3.
-  (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304. Cited by: §4, §6.3.
-  (2020) Realm: retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909. Cited by: §5.4.
-  (2017) Representation learning on graphs: methods and applications. IEEE Data Eng. Bull. 40, pp. 52–74. Cited by: §2.3.
-  (2017) Inductive representation learning on large graphs. In NIPS, Cited by: §2.3.
-  (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: TABLE I, Fig. 10, §4.2.2, §4.2, §4.2, §6.3, §7, §7.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §2.2, §2.2.
-  (2018) Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: TABLE I, Fig. 7, §4.1.2, §6.2.2.
-  (2019) Flow++: improving flow-based generative models with variational dequantization and architecture design. In ICML, pp. 2722–2730. Cited by: §3.4.2.
-  (2019) Strategies for pre-training graph neural networks. In ICLR, Cited by: TABLE I, §4.1.2.
-  (2020) Heterogeneous graph transformer. arXiv preprint arXiv:2003.01332. Cited by: §1.
-  (2017) Densely connected convolutional networks. 2017 IEEE CVPR, pp. 2261–2269. Cited by: §1, §2.2.
-  (2017) Globally and locally consistent image completion. ACM Transactions on Graphics (ToG) 36 (4), pp. 1–14. Cited by: §5.2, §5.3, §5.3.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. ArXiv abs/1502.03167. Cited by: §2.2.
-  (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §5.2.
-  (2020) GCC: graph contrastive coding for graph neural network pre-training. Cited by: TABLE I, §4.2.2.
-  (2019) Self-supervised visual feature learning with deep neural networks: a survey. arXiv preprint arXiv:1902.06162. Cited by: §1, §4.1.1.
-  (2020) Spanbert: improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics 8, pp. 64–77. Cited by: TABLE I, §3.3.3.
-  (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §5.2.
-  (2018) Learning image representations by completing damaged jigsaw puzzles. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 793–802. Cited by: TABLE I, Fig. 6, §4.1.1.
-  (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.3.4, §6.2.2.
-  (2018) Glow: generative flow with invertible 1x1 convolutions. In NIPS, pp. 10215–10224. Cited by: TABLE I, §3.2.
-  (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §2.3, §3.3.4.
-  (2016) Variational graph auto-encoders. arXiv preprint arXiv:1611.07308. Cited by: TABLE I, §3.3.4.
-  (2019) A mutual information maximization perspective of language representation learning. arXiv preprint arXiv:1910.08350. Cited by: TABLE I, §4.1.2.
-  (2012) Imagenet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105. Cited by: §2.2, §4.2.1.
-  (2019) Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942. Cited by: §1, TABLE I, §4.1.1.
-  (2016) Learning representations for automatic colorization. In ECCV, pp. 577–593. Cited by: TABLE I, §5.3, §5.3.
-  (2017) Colorization as a proxy task for visual understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6874–6883. Cited by: §5.3, §5.3.
-  (1989-12) Backpropagation applied to handwritten zip code recognition. Neural Comput. 1 (4), pp. 541–551. External Links: Cited by: §2.2.
-  (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
-  (1998) Efficient backprop. In Neural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop, Berlin, Heidelberg, pp. 9–50. External Links: Cited by: §2.2.
-  (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: TABLE I, §5.2, §5.3, §5.3.
-  (2016) Unsupervised visual representation learning by graph-based consistent constraints. In ECCV, pp. 678–694. Cited by: §4.2.1.
-  (2018) Adaptive graph convolutional neural networks. ArXiv abs/1801.03226. Cited by: §2.3.
-  (2012) Sentiment analysis and opinion mining. Synthesis lectures on human language technologies 5 (1), pp. 1–167. Cited by: §1.
-  (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1, §4.1.1.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.
-  (2015) Adversarial autoencoders. arXiv preprint arXiv:1511.05644. Cited by: TABLE I, §5.2.
-  (2015) Masked autoencoder for distribution estimation. Cited by: §3.4.1.
-  (2013) Efficient estimation of word representations in vector space. CoRR abs/1301.3781. Cited by: §2.1, TABLE I.
-  (2013) Distributed representations of words and phrases and their compositionality. In NIPS’13, pp. 3111–3119. Cited by: §2.1, TABLE I, §3.3.2.
-  (2019) Self-supervised learning of pretext-invariant representations. arXiv preprint arXiv:1912.01991. Cited by: TABLE I, Fig. 6, §4.1.1, §4.2.2.
-  (2010) Rectified linear units improve restricted boltzmann machines. In ICML, Cited by: §2.2.
-  (2011) Sparse autoencoder. CS294A Lecture notes 72 (2011), pp. 1–19. Cited by: §3.3.1.
-  (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, pp. 69–84. Cited by: Fig. 6, §4.1.1.
-  (2018) Boosting self-supervised learning via knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9359–9367. Cited by: §4.2.1.
-  (2016) F-gan: training generative neural samplers using variational divergence minimization. In NIPS, pp. 271–279. Cited by: §6.1.1.
-  (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: TABLE I, §4.1.2, §4, §6.2.2.
-  (2016) Asymmetric transitivity preserving graph embedding. In KDD ’16, Cited by: §2.3.
-  (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2536–2544. Cited by: TABLE I, §5.2, §5.3, §5.3.
-  (2020) Self-supervised graph representation learning via global context prediction. arXiv preprint arXiv:2003.01604. Cited by: §4.1.2.
-  (2020) Self-supervised graph representation learning via global context prediction. ArXiv abs/2003.01604. Cited by: §2.3, TABLE I.
-  (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD, pp. 701–710. Cited by: §2.3, TABLE I, §3.3.2.
-  (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §2.1.
-  (2019) MolecularRNN: generating realistic molecular graphs with optimized properties. arXiv preprint arXiv:1905.13372. Cited by: §3.1.
-  (2018) Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 459–467. Cited by: §2.3.