1 Introduction
Deep neural networks
[78]have shown outstanding performance on various machine learning tasks, especially on supervised learning in computer vision (image classification
[32, 54, 59], semantic segmentation [85, 45]), natural language processing (pretrained language models [33, 74, 84, 149][83], question answering [111, 150, 35, 5] etc.) and graph learning (node classification [70, 106, 138, 58], graph classification [155, 7, 123] etc.). Generally, the supervised learning is trained over a specific task with a large manually labeled dataset which is randomly divided into training, validatiton and test sets.However, supervised learning is meeting its bottleneck. It not only relies heavily on expensive manual labeling but also suffers from generalization error, spurious correlations, and adversarial attacks. We expect the neural network to learn more with fewer labels, fewer samples, or fewer trials. As a promising candidate, selfsupervised learning has drawn massive attention for its fantastic data efficiency and generalization ability, with many stateoftheart models following this paradigm. In this survey, we will take a comprehensive look at the development of the recent selfsupervised learning models and discuss their theoretical soundness, including frameworks such as Pretrained Language Models (PTM), Generative Adversarial Networks (GAN), Autoencoder and its extensions, Deep Infomax, and Contrastive Coding.
The term “selfsupervised learning” was first introduced in robotics, where the training data is automatically labeled by finding and exploiting the relations between different input sensor signals. It was then borrowed by the field of machine learning. In a speech on AAAI 2020, Yann LeCun described selfsupervised learning as ”the machine predicts any parts of its input for any observed part.” We can summarize them into two classical definitions following LeCun’s:

Obtain “labels” from the data itself by using a “semiautomatic” process.

Predict part of the data from other parts.
Specifically, the “other part” here could be incomplete, transformed, distorted, or corrupted. In other words, the machine learns to ’recover’ whole, or parts of, or merely some features of its original input.
People are often confused by unsupervised learning and selfsupervised learning. Selfsupervised learning can be viewed as a branch of unsupervised learning since there is no manual label involved. However, narrowly speaking, unsupervised learning concentrates on detecting specific data patterns, such as clustering, community discovery, or anomaly detection, while selfsupervised learning aims at recovering, which is still in the paradigm of supervised settings. Figure
1 provides a vivid explanation of the differences between them.There exist several comprehensive reviews related to Pretrained Language Models [107], Generative Adversarial Networks [142], Autoencoder and contrastive learning for visual representation [64]. However, none of them concentrates on the inspiring idea of selfsupervised learning that illustrates researchers and models in many fields. In this work, we collect studies from natural language processing, computer vision, and graph learning in recent years to present an uptodate and comprehensive retrospective on the frontier of selfsupervised learning. To sum up, our contributions are:

We provide a detailed and uptodate review of selfsupervised learning for representation. We introduce the background knowledge, models with variants, and important frameworks. One can easily grasp the frontier ideas of selfsupervised learning.

We categorize selfsupervised learning models into generative, contrastive, and generativecontrastive (adversarial), with particular genres inner each one. We demonstrate the pros and cons of each category and discuss the recent attempt to shift from generative to contrastive.

We examine the theoretical soundness of selfsupervised learning methods and show how it can benefit the downstream supervised learning tasks.

We identify several open problems in this field, analyze the limitations and boundaries, and discuss the future direction for selfsupervised representation learning.
We organize the survey as follows. In Section 2, we introduce the preliminary knowledge for new computer vision, natural language processing, and graph learning. From Section 3 to Section 5, we will introduce the empirical selfsupervised learning methods utilizing generative, contrastive and generativecontrastive objectives. In Section 6, we investigate the theoretical basis behind the success of selfsupervised learning and its merits and drawbacks. In Section 7, we discuss the open problems and future directions in this field.
2 Background
2.1 Representation Learning in NLP
Pretrained word representations are key components in natural language processing tasks. Word embedding is to represent words as lowdimensional realvalued vectors. There are two kinds of word embeddings: noncontextual and contextual embeddings.
Noncontextual Embeddings does not consider the context information of the token; that is, these models only map the token into a distributed embedding space. Thus, for each word in the vocabulary , embedding will assign it a specific vector., where is the dimension of the embedding. These embeddings can not model complex characteristics of word use and polysemous.
To model both complex characteristics of word use and polysemy, contextual embedding is proposed. For a text sequence , , the contextual embedding of depends on the whole sequence.
here is the embedding function. Since for a certain token , the embedding can be different if in difference context, is called contextual embedding. This kind of embedding can distinguish the semantics of words in different contexts.
Distributed word representation represents each word as dense, realvalued, and lowdimensional vector. Firstgeneration word embedding is introduced as a neural network language model (NNLM)[10]. For NNLM, most of the complexity is caused by the nonlinear hidden layer in the model. Mikolov et al. proposed Word2Vec Model[88] [89] to learn the word representations efficiently. There are two models: Continuous BagofWords Model (CBOW) and Continuous Skipgram Model (SG). Word2Vec is a kind of context prediction model, and it is one of the most popular implementations to generate noncontextual word embeddings for NLP. In the firstgeneration word embedding, the same word has the same embedding. Since a word can have multiple senses, therefore, secondgeneration word embedding method is proposed. In that case, each word token has its embedding. These embeddings also called contextualized word embedding since the embeddings of word tokens depend on its context. ELMo (Embeddings from Language Model)[102] is one implementation to generate those contextual word embeddings. ELMo is an RNNbased bidirectional language model, it learns multiple embedding for each word token, and dependent on how to combine those embeddings based on the downstream tasks. ELMo is a featurebased approach, that is, the model is used as a feature extractor to extract word embedding, and send those embeddings to the downstream task model. The parameters of the extractor are fixed, only the parameters in the backend model can be trained.
Recently, BERT (Bidirectional Encoder Representations from Transformers)[33] bring large improvements on 11 NLP tasks. Different from featurebased approaches like ELMo, BERT is a finetuned approach. The model is first pretrained on a large number of corpora through selfsupervised learning, then finetuned with labeled data. As the name showed, BERT is an encoder of the transformer, in the training stage, BERT masked some tokens in the sentence, then training to predict the masked word. When use BERT, initialized the BERT model with pretrained weights, and finetune the pretrained model to solve downstream tasks.
2.2 Representation Learning in CV
Computer vision is one of the greatest benefited fields thanks to deep learning. In the past few years, researchers have developed a range of efficient network architectures for supervised tasks. For selfsupervised tasks, many of them are also proved to be useful. In this section, we introduce ResNet architecture[54], which is the backbone of a large part of the selfsupervised techniques for visual representation models.
Since AlexNet[73], CNN architecture is going deeper and deeper. While AlexNet had only five convolutional layers, the VGG network [120]and GoogleNet (also codenamed Inception_v1) [127] had 19 and 22 layers respectively.
Evidence[127, 120] reveals that network depth is of crucial importance, and driven by its significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [12], which hamper convergence from the beginning. This problem, however, has been addressed mainly by normalized initialization [79, 116]and intermediate normalization layers [61]
, which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation
[77].When deeper networks can start converging, a degradation problem has been exposed: with the network depth increasing, accuracy becomes saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model will lead to higher training error
Residual neural network (ResNet), proposed by He et al. [54], effectively resolved this problem. Instead of asking every few stacked layers to directly learn a desired underlying mapping, authors of [54] design a residual mapping architecture ResNet. The core idea of ResNet is the introduction of “shortcut connections”(Fig. 2), which are those skipping over one or more layers.
A building block is defined as:
(1) 
Here and are the input and output vectors of the layers considered. The function represents the residual mapping to be learned. For the example in Fig. 2 that has two layers, in which
denotes ReLU
[91] and the biases are omitted for simplifying notations. The operation is performed by a shortcut connection and elementwise addition.Because of its compelling results, ResNet blew people’s minds and quickly became one of the most popular architectures in various computer vision tasks. Since then, ResNet architecture has been drawing extensive attention from researchers, and multiple variants based on ResNet are proposed, including ResNeXt[145], Densely Connected CNN[59], wide residual networks[153].
2.3 Representation Learning on Graphs
As a ubiquitous data structure, graphs are extensively employed in multitudes of fields and become the backbone of many systems. The central problem in machine learning on graphs is to find a way to represent graph structure so that it can be easily utilized by machine learning models [51]. To tackle this problem, researchers propose a series of methods for graph representation learning at node level and graph level, which has become a research spotlight recently.
We first define several basic terminologies. Generally, a graph is defined as , where denotes a set of vertices, denotes the number of vertices in the graph, and denotes a set of edges connecting the vertices. is the optional original vertex feature matrix. When input features are unavailable,
is set as orthogonal matrix or initialized with normal distribution, etc., in order to make the input node features less correlated. The problem of node representation learning is to learn latent node representations
, which is also termed as network representation learning, network embedding, etc. There are also some graphlevel representation learning problems, which aims to learn an embedding for the whole graph.In general, existing network embedding approaches are broadly categorized as (1) factorizationbased approaches such as NetMF[105, 104], GraRep [18], HOPE [97], (2) shallow embedding approaches such as DeepWalk [101], LINE [128], HARP [21], and (3) neural network approaches[82, 19]. Recently, graph convolutional network (GCN)[70] and its multiple variants have become the dominant approaches in graph modeling, thanks to the utilization of graph convolution that effectively fuses graph topology and node features.
However, the majority of advanced graph representation learning methods require external guidance like annotated labels. Many researchers endeavor to propose unsupervised algorithms [52, 48, 139], which do not rely on any external labels. Selfsupervised learning also opens up an opportunity for effective utilization of the abundant unlabeled data [125, 100].
Model  FOS  Type  Generator  Selfsupervision  Pretext Task 


NS strategy  
GPT/GPT2 [109, 110] 
NLP  G  AR  Following words  Next word prediction        
PixelCNN [134, 136]  CV  G  AR  Following pixels  Next pixel prediction        
NICE [36]  CV  G 

Whole image  Image reconstruction        
RealNVP [37]  CV  G        
Glow [69]  CV  G        
word2vec [89, 88]  NLP  G  AE  Context words  CBOW & SkipGram  Endtoend  
FastText [15]  NLP  G  AE  CBOW  Endtoend  

Graph  G  AE  Graph edges  Link prediction  Endtoend  
VGAE [71]  Graph  G  AE  Endtoend  
BERT [33]  NLP  G  AE 


      
SpanBERT [65]  NLP  G  AE  Masked words  Masked language model        
ALBERT [74]  NLP  G  AE 


      
ERNIE [126]  NLP  G  AE 


      
VQVAE 2 [112]  CV  G  AE  Whole image  Image reconstruction        
XLNet [149]  NLP  G  AE+AR  Masked words  Permutation language model        
RelativePosition [38]  CV  C   

Relative postion prediction        
CDJP [67]  CV  C   

Endtoend  
PIRL [90]  CV  C    Jigsaw  ✓  Memory bank  
RotNet [44]  CV  C    Rotation Prediction        
Deep InfoMax [55]  CV  C   

MI Maximization  Endtoend  
AMDIM [6]  CV  C    ✓  Endtoend  
CPC [96]  CV  C    Endtoend  
InfoWord [72]  NLP  C    Endtoend  
DGI [139]  Graph  C    ✓  Endtoend  
InfoGraph [123]  Graph  C   


SGRL [100]  Graph  C    Endtoend  
Pretrained GNN [57]  Graph  C   


Endtoend  
DeepCluster [20]  CV  C   

Cluster discrimination        
Local Aggregation [160]  CV  C          
ClusterFit [147]  CV  C          
InstDisc [144]  CV  C   

Instance discrimination  Memory bank  
CMC [130]  CV  C    ✓  Endtoend  
MoCo [53]  CV  C    Momentum  
MoCo v2 [25]  CV  C    ✓  Momentum  
SimCLR [22]  CV  C    ✓ 


GCC [63]  Graph  C    ✓  Momentum  
GAN [46]  CV  GC  AE  Whole image  Image reconstruction        
Adversarial AE [86]  CV  GC  AE        
BiGAN/ALI [39, 42]  CV  GC  AE        
BigBiGAN [40]  CV  GC  AE        
Colorization [75]  CV  GC  AE  Image color  Colorization        
Inpainting [98]  CV  GC  AE  Parts of images  Inpainting        
[80]  CV  GC  AE  Details of images  Superresolution        
ELECTRA [27]  NLP  GC  AE  Masked words  Replaced token detection  ✓  Endtoend  
WKLM [146]  NLP  GC  AE  Masked entities  Replaced entity detection  ✓  Endtoend  
ANE [29]  Graph  GC  AE  Graph edges  Link prediction        
GraphGAN [140]  Graph  GC  AE        
GraphSGAN [34]  Graph  GC  AE  Graph nodes  Node classification       
3 Generative Selfsupervised Learning
3.1 Autoregressive (AR) Model
Autoregressive (AR) models can be viewed as “Bayes net structure” (directed graph model). The joint distribution can be factorized as a product of conditionals
(2) 
where the probability of each variable is dependent on the previous variables.
In NLP, the objective of autoregressive language modeling is usually maximizing the likelihood under the forward autoregressive factorization [149]. GPT [109] and GPT2 [110] use Transformer decoder architecture [137] for language model. Different from GPT, GPT2 removes the finetuning processes of different tasks. In order to learn unified representations that generalize across different tasks, GPT2 models , which means given different tasks, the same inputs can have different outputs.
The autoregressive models have also been employed in computer vision, such as PixelRNN [136] and PixelCNN [134]. The general idea is to use autoregressive methods to model images pixel by pixel. For example, the lower (right) pixels are generated by conditioning on the upper (left) pixels. The pixel distributions of PixelRNN and PixelCNN are modeled by RNN and CNN, respectively. For 2D images, autoregressive models can only factorize probabilities according to specific directions (such as right and down). Therefore, masked filters are employed in CNN architecture.
Furthermore, two convolutional networks are combined to remove the blind spot in images. Based on PixelCNN, WaveNet [133] – a generative model for raw audio was proposed. In order to deal with longrange temporal dependencies, dilated causal convolutions are developed to improve the receptive field. Moreover, Gated Residual blocks and skip connections are employed to empower better expressivity.
The autoregressive models can also be applied to graph domain, such as for graph generation problem. You et al. [152] propose GraphRNN to generate realistic graphs with deep autoregressive models. They decompose the graph generation process into a sequence generation of nodes and edges, conditioned on the graph generated so far. The objective of GraphRNN is defined as the likelihood of the observed graph generation sequences. GraphRNN can be viewed as a hierarchical model, where a graphlevel RNN maintains the state of the graph and generates new nodes, while an edgelevel RNN generates new edges based on the current graph state. After that, MRNN [103] and GCPN [151]
are proposed as autoregressive approaches. MRNN and GCPN both use a reinforcement learning framework to generate molecule graphs through optimizing domainspecific rewards. However, MRNN mainly uses RNNbased networks for state representations, but GCPN employs GCNbased encoder networks.
The advantage of autoregressive models is that it can model the context dependency well. However, one shortcoming of the AR model is that the token at each position can only access its context from one direction.
3.2 Flowbased Model
The goal of flowbased models is to estimate complex highdimensional densities
from data. However, directly formalizing the densities are difficult. Generally, flowbased models first define a latent variable which follows a known distribution . Then define , where is an invertible and differentiable function. The goal is to learn the transformation between and so that the density of can be depicted. According to the integral rule, . Therefore, the densities of and satisfies:(3) 
and the objective is maximum likelihood:
(4) 
The advantage of flowbased models is that the mapping between and is invertible. However, it also requires that and must have the same dimension. needs to be carefully designed since it should be invertible and the Jacobian determinant in Eq. (3) should also be calculated easily. NICE [36] and RealNVP [37] design affine coupling layer to parameterize . The core idea is to split into two blocks and apply a transformation from to in an autoregressive manner, that is and . More recently, Glow [69] was proposed and it introduces invertible convolutions and simplifies RealNVP.
3.3 Autoencoding (AE) Model
The goal of the autoencoding model is to reconstruct (part of) inputs from (corrupted) inputs.
3.3.1 Basic AE Model
Autoencoder (AE) was first introduced in [8]
for pretraining in ANNs. Before Autoencoder, Restricted Boltzmann Machine (RBM)
[122] can also be viewed as a special “autoencoder”. RBM is an undirected graphical model, and it only contains two layers: visible layer and hidden layer. The objective of RBM is to minimize the difference between the marginal distribution of models and data distributions. In contrast, autoencoder can be regarded as a directed graphical model, and it can be trained more easily. Autoencoder is typically for dimensionality reduction. Generally, autoencoder is a feedforward neural network trained to produce its input at the output layer. AE is comprised of an encoder network and a decoder network . The objective of AE is to make and as similar as possible (such as through meansquare error). It can be shown that linear autoencoder corresponds to PCA. Sometimes the number of hidden units is greater than the number of input units, and some interesting structures can be discovered by imposing sparsity constraints on the hidden units [92].3.3.2 Context Prediction Model (CPM)
The idea of the Context Prediction Model (CPM) is predicting contextual information based on inputs.
In NLP, when it comes to SSL on word embedding, CBOW, and SkipGram [89] are pioneering works. CBOW aims to predict the input tokens based on context tokens. In contrast, SkipGram aims to predict context tokens based on input tokens. Usually, negative sampling is employed to ensure computational efficiency and scalability. Following CBOW architecture, FastText [15] is proposed by utilizing subword information.
Inspired by the progress of word embedding models in NLP, many network embedding models are proposed based on a similar context prediction objective. Deepwalk [101] samples truncated random walks to learn latent node embedding based on the SkipGram model. It treats random walks as the equivalent of sentences. However, LINE [128] aims to generate neighbors based on current nodes.
(5) 
where denotes edge set, denotes the node, represents the weight of edge . LINE also uses negative sampling to sample multiple negative edges to approximate the objective.
3.3.3 Denoising AE Model
The idea of denoising autoencoder models is that representation should be robust to the introduction of noise. The masked language model (MLM) can be regarded as a denoising AE model because its input masks predicted tokens. To model text sequence, masked language model (MLM) randomly masks some of the tokens from the input, and then predict them based their context information
[33], which is similar to the Cloze task [129]. Specifically, in BERT [33], a unique token [MASK] is introduced in the training process to mask some tokens. However, one shortcoming of this method is that there are no input [MASK] tokens for downstream tasks. To mitigate this, the authors do not always replace the predicted tokens with [MASK] in training. Instead, they replace them with original words or random words with a small probability.There emerge some extensions of MLM. SpanBERT [65] chooses to mask continuous random spans rather than random tokens adopted by BERT. Moreover, it trains the span boundary representations to predict the masked spans, which is inspired by ideas in coreference resolution. ERNIE (Baidu) [126] masks entities or phrases to learn entitylevel and phraselevel knowledge, which obtains good results in Chinese natural language processing tasks.
Compared with the AR model, in MLM, the predicted tokens have access to contextual information from both sides. However, MLM assumes that the predicted tokens are independent of each other if the unmasked tokens are given.
3.3.4 Variational AE Model
The variational autoencoding model assumes that data are generated from underlying latent (unobserved) representation. The posterior distribution over a set of unobserved variables given some data is approximated by a variational distribution, .
(6) 
In variational inference and evidence lower bound (ELBO) on the loglikelihood of data is maximized during training.
(7) 
where is evidence probability, is prior and is likelihood probability. The righthand side of the above equation is called ELBO. From the autoencoding perspective, the first term of ELBO is a regularizer forcing the posterior to approximate the prior. The second term is the likelihood of reconstructing the original input data based on latent variables.
Variational Autoencoders (VAE) [68] is one important example where variational inference is utilized. VAE assumes the prior and the approximate posterior
both follow Gaussian distributions. Specifically, let
. Furthermore, reparameterization trick is utilized for modeling approximate posterior . Assume , where . Both and are parameterized by neural networks. Based on calculated latent variable , decoder network is utilized to reconstruct the input data.Recently, a novel and powerful variational AE model called VQVAE [135] was proposed. VQVAE aims to learn discrete latent variables, which is motivated since many modalities are inherently discrete, such as language, speech, and images. VQVAE relies on vector quantization (VQ) to learn the posterior distribution of discrete latent variables. Specifically, the discrete latent variables are calculated by the nearest neighbor lookup using a shared, learnable embedding table. In training, the gradients are approximated through straightthrough estimator [11] as
(8) 
where refers to the codebook, the operator refers to a stopgradient operation that blocks gradients from flowing into its argument, and
is a hyperparameter which controls the reluctance to change the code corresponding to the encoder output.
More recently, researchers propose VQVAE2 [112], which can generate versatile highfidelity images that rival that of state of the art Generative Adversarial Networks (GAN) BigGAN [16]
on ImageNet
[32]. First, the authors enlarge the scale and enhance the autoregressive priors by a powerful PixelCNN [134] prior. Additionally, they adopt a multiscale hierarchical organization of VQVAE, which enables learning local information and global information of images separately. Nowadays, VAE and its variants have been widely used in the computer vision area, such as image representation learning, image generation, video generation.Variational autoencoding models have also been employed in node representation learning on graphs. For example, Variational graph autoencoder (VGAE) [71] uses the same variational inference technique as VAE with graph convolutional networks (GCN) [70] as the encoder. Due to the uniqueness of graphstructured data, the objective of VGAE is to reconstruct the adjacency matrix of the graph by measuring node proximity. Zhu et al. [158] propose DVNE, a deep variational network embedding model in Wasserstein space. It learns Gaussian node embedding to model the uncertainty of nodes. 2Wasserstein distance is used to measure the similarity between the distributions for its effectiveness in preserving network transitivity. vGraph [124] can perform node representation learning and community detection collaboratively through a generative variational inference framework. It assumes that each node can be generated from a mixture of communities, and each community is defined as a multinomial distribution over nodes.
3.4 Hybrid Generative Models
3.4.1 Combining AR and AE Model.
Some works propose models to combine the advantages of both AR and AE. MADE [87]
makes a simple modification to autoencoder. It masks the autoencoder’s parameters to respect autoregressive constraints. Specifically, for the original autoencoder, neurons between two adjacent layers are fullyconnected through MLPs. However, in MADE, some connections between adjacent layers are masked to ensure that each input dimension is reconstructed solely from the dimensions preceding it. MADE can be easily parallelized on conditional computations, and it can get direct and cheap estimates of highdimensional joint probabilities by combining AE and AR models.
In NLP, Permutation Language Model (PLM) [149] is a representative model that combines the advantage of autoregressive model and autoencoding model. XLNet [149] introduces PLM and it is a generalized autoregressive pretraining method. XLNet enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order. To formalize the idea, let denote the set of all possible permutations of the length index sequence , the objective of PLM can be expressed as follows:
(9) 
Actually, for each text sequence, different factorization orders are sampled. Therefore, each token can see its contextual information from both sides. Based on the permuted order, XLNet also conducts reparameterization with positions to let the model know which position is needed to predict, and then a special twostream selfattention is introduced for targetaware prediction.
3.4.2 Combining AE and Flowbased Models
In the graph domain, GraphAF [119] is a flowbased autoregressive model for the molecule graph generation. It can generate molecules in an iterative process, and also calculate the exact likelihood in parallel. GraphAF formalizes the molecule generation as a sequential decision process. It incorporates detailed domain knowledge into the reward design, such as valency check. Inspired by the recent progress of flowbased models, it defines an invertible transformation from a base distribution (e.g., multivariate Gaussian) to a molecular graph structure. Additionally, Dequantization technique [56] is utilized to convert discrete data (including node types and edge types) into continuous data.
4 Contrastive Selfsupervised Learning
From statistical perspective, machine learning models are categorized into two classes: generative model and discriminative model. Given the joint distribution of the input and target , the generative model calculates the by:
(10) 
while the discriminative model tries to model the by:
(11) 
Notice that most of the representation learning tasks hope to model relationships between . Thus for a long time, people believe that the generative model is the only choice for representation learning.
However, recent breakthroughs in contrastive learning, such as Deep InfoMax, MoCo and SimCLR, shed light on the potential of discriminative models for representation. Contrastive learning aims at ”learn to compare” through a Noise Contrastive Estimation (NCE)
[49] objective formatted as:(12) 
where is similar to , is dissimilar to and is an encoder (representation function). The similarity measure and encoder may vary from task to task, but the framework remains the same. With more dissimilar pairs involved, we have the InfoNCE [96] formulated as:
(13) 
Here we divide recent contrastive learning frameworks into 2 types: contextinstance contrast and instanceinstance contrast. Both of them achieve amazing performance in downstream tasks, especially on classification problems under the linear protocol.
4.1 ContextInstance Contrast
The contextinstance contrast, or socalled globallocal contrast, focuses on modeling the belonging relationship between the local feature of a sample and its global context representation. When we learn the representation for a local feature, we hope it is associative to the representation of the global content, such as stripes to tigers, sentences to its paragraph, and nodes to their neighbors.
There are two main types of ContextInstance Contrast: Predict Relative Position (PRP) and Maximize Mutual Information (MI). The differences between them are:

PRP focuses on learning relative positions between local components. The global context serves as an implicit requirement for predicting these relations (such as understanding what an elephant looks like is critical for predicting relative position between its head and tail).

MI focuses on learning the explicit belonging relationships between local parts and global context. The relative positions between local parts are ignored.
4.1.1 Predict Relative Position
Many data contain rich spatial or sequential relations between parts of it. For example, in image data such as Fig. 6, the elephant’s head is on the right of its tail. In text data, a sentence like ”Nice to meet you.” would probably be ahead of ”Nice to meet you, too.”. Various models regard recognizing relative positions between parts of it as the pretext task [64]. It could be relative positions of two patches from a sample [38], or to recover positions of shuffled segments of an image (solve jigsaw) [93, 143, 67], or to the rotation angle’s degree of an image [44]. A similar jigsaw technique is applied in PIRL [90] to augment the positive sample, but PIRL does not regard solving jigsaw and recovering spatial relation as its objective.
In the pretrained language model, similar ideas such as Next Sentence Prediction (NSP) are also adopted. NSP loss is initially introduced by BERT [33], where for a sentence, the model is asked to distinguish the right next sentence, and a randomly sampled one. However, some later work empirically proves that NSP helps little, even harm the performance. So in RoBERTa [84], the NSP loss is removed.
To replace NSP, ALBERT [74] proposes Sentence Order Prediction (SOP) task. That is because, in NSP, the negative next sentence is sampled from other passages that may have different topics with the current one, turning the NSP into a far easier topic model problem. In SOP, two sentences that exchange their position are regarded as a negative sample, making the model concentrate on the coherence of the semantic meaning.
4.1.2 Maximize Mutual Information
This kind of method derives from mutual information (MI) – a basic concept in statistics. Mutual information targets at modeling this association, and our objective is to maximize it. Generally, this kind of models optimize
(14) 
where is the representation encoder, is a class of encoders with some constraints and is a samplebased estimator for the real mutual information. In applications, MI is notorious for its hard computation. A common practice is to maximize ’s lower bound with an NCE objective.
Deep InfoMax [55] is the first one to explicitly model mutual information through a contrastive learning task, which maximize the MI between a local patch and its global context. For real practices, take image classification as an example, we can encode a dog image into , and take out a local feature vector . To conduct contrast between instance and context, we need two other things:

a summary function to generate the context vector

another cat image and its context vector .
The contrastive objective is then formulated as
(15) 
Deep InfoMax provides us with a new paradigm and boosts the development of selfsupervised learning. The first influential follower is Contrastive Predictive Coding (CPC) [96] for speech recognition. CPC maximize the association between a segment of audio and its context audio. To improve data efficiency, it takes several negative context vectors at the same time. Later on, CPC has also been applied in image classification.
AMDIM [6] enhances the positive association between a local feature and its context. It randomly samples two different views of an image (truncated, discolored, and so forth) to generate the local feature vector and context vector, respectively. CMC [130] extends it into several different views for one image, and samples another irrelevant image as the negative. However, CMC is fundamentally different from Deep InfoMax and AMDIM because it proposes to measure the contextcontext similarity rather than instancecontext similarity. We will discuss it in the next subsection.
In language pretraining, InfoWord [72] proposes to maximize the mutual information between a global representation of a sentence and ngrams
in it. The context is induced from the sentence with selected ngrams being masked, and the negative contexts are randomly picked out from the corpus.
In graph learning, Deep Graph InfoMax (DGI) [139] considers a node’s representation as the local feature and the average of randomly samples 2hop neighbors as context. However, it is hard to generate negative contexts on a single graph. To solve this problem, DGI proposes to corrupt the original context by keeping the subgraph structure and permuting the node features. DGI is followed by a number of works, such as InfoGraph [123], which targets at learning graphlevel representation rather than node level, maximizing the mutual information between graphlevel representation and substructures at different levels.
In [57], the authors systematically analysis the pretraining strategies for graph neural networks from two dimensions: attribute/structural and nodelevel/graphlevel. For structural prediction at nodelevel, it also proposes to maximize the MI between the representations of khop neighborhood and its context graph. SGRL [99] further separates nodes in the context graph into khop context subgraphs, and maximize their MI with target node respectively.
4.2 ContextContext Contrast
Though MIbased contrastive learning achieves great success, some recent studies [132, 53, 25, 22] cast doubt on the actual improvement brought by MI.
The [132] provides empirical evidence that the success of the models mentioned above is only loosely connected to MI by showing that an upper bound MI estimator leads to illconditioned and lower performance representation. Instead, more should be attributed to encoder architecture and a negative sampling strategy related to metric learning. A significant focus in metric learning is to perform hard positive sampling while increasing the negative sampling efficiency, and they probably play a more critical role in MIbased models’ success.
Recently, MoCo [53] and SimCLR [22] empirically support the above conclusion. They outperform the contextinstancebased methods and achieve a competitive result to supervised methods under the linear classification protocol, through a contexttocontext level direct comparison. We will start with clusterbased discrimination proposed earlier and then turn to instance discrimination advocated by MoCo and SimCLR.
4.2.1 Clusterbased Discrimination
Contextcontext contrast is first studied in clusteringbased methods [148, 81, 94, 20], especially the DeepCluster [20] which first achieves competitive performance to the supervised model AlexNet [73].
Image classification asks the model to categorize images correctly, and the representation of images in the same category should be similar. Therefore, the motivation is to draw similar images near in the embedding space. In supervised learning, this drawingnear process is accomplished via label supervision; in selfsupervised learning, however, we do not have such labels. To solve the label problem, Deep Cluster [20]
proposes to leverage clustering to yield pseudo labels and asks a discriminator to predict on images’ label. The training could be formulated in two steps. In the first step, DeepCluster uses Kmeans to cluster encoded representation and produces pseudo labels for each sample. Then in the second step, the discriminator predicts whether two samples are from the same cluster and backpropagates to the encoder. These two steps are performed iteratively.
Recently, Local Aggregation (LA) [160] has pushed forward the boundary of the clusterbased method. It points out several drawbacks of DeepCluster and makes the corresponding optimization. First, in DeepCluster, samples are assigned to mutualexclusive clusters, but LA identifies neighbors separately for each example. Second, DeepCluster optimizes a crossentropy discriminative loss, while LA employs an objective function that directly optimizes a local softclustering metric. These two changes substantially boost the performance of LA representation on downstream tasks.
A similar work to aggregate similar vectors together in embedding space is VQVAE [135, 112] that we introduce in Section 3. To conquer the traditional deficiency for VAE to generate highfidelity images, VQVAE proposes to quantize vectors. For the feature matrix encoded from an image, VQVAE substitutes each 1dimensional vector in the matrix to the nearest one in an embedding dictionary. This process is somehow the same as what LA is doing.
Clusteringbased discrimination may also help in the generalization of other pretrained models, transferring models from pretext objectives to real tasks better. Traditional representation learning models have only two stages: one for pretraining, and the other for evaluation. ClusterFit [147] introduces a cluster prediction finetuning stage similar to DeepCluster between the above two stages, which improves the representation’s performance on downstream classification evaluation.
4.2.2 Instance Discrimination
The prototype of leverage instance discrimination as a pretext task is InstDisc [144]. On the basis of InstDisc, CMC [130] proposes to adopt multiple different views of an image as positive samples and take another one as the negative. In the embedding space, CMC draws near multiple views of an image, and pull away from other samples. However, it is somehow constrained by the idea of Deep InfoMax, sampling one negative sample for each positive one.
In MoCo [53], researchers further develop the idea of leveraging instance discrimination via momentum contrast, which substantially increases the amount of negative samples. For example, given an input image , our intuition is to learn a instinct representation by a query encoder that can distinguish from any other images. Therefore, for a set of other images , we employ an asynchronously updated key encoder to yield and , and optimize the following objective
(16) 
where is the number of negative samples. This formula is in the form of InfoNCE.
Besides, MoCo presents two other critical ideas in dealing with negative sampling efficiency.

First, it abandons the traditional endtoend training framework and designs the momentum contrast learning with two encoders (query and key), which prevents the fluctuation of loss convergence in the beginning period.

Second, to enlarge the capacity of negative samples, MoCo employs a queue (with K as large as 65536) to save the recently encoded batches as negative samples. This significantly improves the negative sampling efficiency.
There are some other auxiliary techniques to ensure the training convergence, such as batch shuffling to avoid trivial solution and temperature hyperparameter to adjust the scale.
However, MoCo adopts a too simple positive sample strategy: a pair of positive representation comes from the same sample without any transformation or augmentation, making the positive pair far too easy to distinguish. PIRL [90] adds jigsaw augmentation as described in Section 4.1.1. In order to produce a pretextinvariant representation, PIRL asks encoder to regard an image and its jigsawed one as similar pairs.
In SimCLR [22], the authors further illustrate the importance of a hard positive sample strategy by introducing data augmentation in 10 forms. This data augmentation is similar to CMC [130], which leverages several different views to augment the positive pairs. SimCLR follows the endtoend training framework instead of momentum contrast from MoCo, and to handle the large scale negative samples problem, SimCLR chooses a batch size of as large as 8196.
The details are as follows. A minibatch of samples is augmented to be samples . For a pair of a positive sample and (derive from one original sample), other are treated as negative ones. A pairwise contrastive loss NTXent loss [23] is defined as
(17) 
noted that is asymmetrical, and the
function here is a cosine similarity function that can normalize the representations. The summed up loss is
(18) 
SimCLR also provides some other useful techniques, including a learnable nonlinear transformation between the representation and the contrastive loss, more training steps, and deeper neural networks. [25] conducts ablation studies to show that techniques in SimCLR can also further improve MoCo’s performance.
In graph learning, Graph Contrastive Coding (GCC) [63]
is a pioneer to leverage instance discrimination as the pretext task for structural information pretraining. For each node, we sample two subgraphs independently by random walk with restart and use top eigenvectors from their normalized graph Laplacian matrices as nodes’ initial representations. Then we use GNN to encode them and calculate the InfoNCE loss as what MoCo and SimCLR do, where the node embeddings from the same node (in different subgraphs) are viewed as similar. Results show that GCC learns better transferable structural knowledge than previous work such as struc2vec
[113], GraphWave [41] and ProNE [154].5 GenerativeContrastive (Adversarial) Selfsupervised Learning
5.1 Why GenerativeContrastive (Adversarial)?
A reason for the generative model’s success in selfsupervised learning is its ability to fit the data distribution, based on which varied downstream tasks can be conducted. The objective of generative selfsupervised learning is usually formulated as a maximum likelihood function
(19) 
where is all the samples we hope to model, and is a conditional constraint such as context information. This objective is then optimized by the Maximum Likelihood Estimation (MLE). Nevertheless, MLE has two fatal problems:

Sensitive and Conservative Distribution. When , becomes super large, making generative model extremely sensitive to rare samples. It directly leads to a conservative distribution, which has a low performance.

Lowlevel Abstraction Objective. In MLE, the representation distribution is modeled at ’s level, such as pixels in images, words in texts, and nodes in graphs. However, most of the classification tasks target highlevel abstraction, such as object detection, long paragraph understanding, and molecule classification.
These two problems severely restrict the development of generative selfsupervised learning. Fortunately, discriminative and contrastive objectives can solve this problem because they are designed to serve for humanlevel understanding. Take autoencoder and GAN [108] as examples, autoencoders leverage a pointwise reconstruction loss, which may fit pixellevel patterns out of the sample distribution. However, GANs utilize contrastive objectives that distinguish the generated samples from real ones, which fit at the semanticlevel to avoid this problem.
In terms of the difference to contrastive learning, adversarial methods still preserve the generator structure consisted of an encoder and a decoder, while the contrastive abandons the decoder component (as shown in Fig. 12). It is critical because, on the one hand, the generator endows adversarial learning with strong expressiveness that is peculiar to generative models; on the other hand, it also makes the objective of adversarial methods far more challenging to learn than that of contrastive methods, leading to unstable convergence. In the adversarial setting, the existence of decoder asks the representation to be ”reconstructive,” in other words, contains all the necessary information for constructing the inputs. In the contrastive setting, however, we only need to learn ”distinguishable” information to discriminate different samples.
To sum up, the adversarial method absorbs merits from both generative and contrastive methods together with some drawbacks. In a situation where we need to fit on an implicit distribution, it is a better choice. In the next several subsections, we will discuss its various applications on representation learning.
5.2 Generate with Complete Input
In this section, we introduce GAN and its variants for representation learning, which focus on capturing the complete information of the sample.
The inception of adversarial representation learning should be attributed to Generative Adversarial Networks (GAN) [108], which proposes the adversarial training framework. Follow GAN, many variants [80, 98, 60, 16, 66, 62] emerge and reshape people’s understanding of deep learning’s potential. The training process of GAN could be viewed as two players play a game; one generates fake samples while another tries to distinguish it from real ones. To formulate this problem, we define as the generator, as the discriminator, as the real sample distribution, as the learned latent sample distribution, we want to optimize this minmax game
(20) 
Before VQVAE2, GAN maintains dominating performance on image generation tasks over purely generative models, such as autoregressive PixelCNN and autoencoder VAE. It is natural to think about how this framework could benefit representation learning.
However, there is a gap between generation and representation. Compared to autoencoder’s explicit latent sample distribution , GAN’s latent distribution is implicitly modeled. We need to extract this implicit distribution out. To bridge this gap, AAE [86] first proposes a solution to follow the natural idea of the autoencoder. The generator in GAN could be viewed as an implicit autoencoder. In order to extract the representation out, we can replace the generator with an explicit variational autoencoder (VAE). Recall the objective of VAE
(21) 
As we mentioned before, compared to loss of autoencoder, discriminative loss in GAN better models the highlevel abstraction. To alleviate the problem, AAE substitutes the KL divergence function to a discriminative loss
(22) 
that asks the discriminator to distinguish representation from the encoder and a prior distribution.
However, AAE still preserves the reconstruction error, which contradicts with GAN’s core idea. Based on AAE, BiGAN [39] and ALI [42] argue to embrace adversarial learning without reservation and put forward a new framework. Given a real sample

Generator : the generator here virtually acts as the decoder, generates fake samples by from a prior latent distribution (e.g. [uniform(1,1)], d refers to dimension).

Encoder : a newly added component, mapping real sample to representation . This is also exactly what we want to train.

Discriminator : given two inputs [, ] and [, ], decide which one is from the real sample distribution.
It is easy to see that their training goal is , in other words, encoder should learn to ”convert” generator . This goal could be rewritten as a loss for autoencoder [39], but it is not the same as traditional autoencoder because the distribution does not make any assumption about the data itself. The distribution is shaped by the discriminator, which captures the semanticlevel difference. Based on BiGAN and ALI, later studies [26, 40] discover that GAN with deeper and larger networks and modified architectures can produce even better results on downstream task.
5.3 Recover with Partial Input
As we mentioned above, GAN’s architecture is not born for representation learning, and modification is needed to apply its framework. While BiGAN and ALI choose to extract the implicit distribution directly, some other methods such as colorization [156, 157, 76, 75], inpainting [60, 98] and superresolution [80] apply the adversarial learning via in a different way. Instead of asking models to reconstruct the whole input, they provide models with partial input and ask them to recover the rest parts. This is similar to denoising autoencoder (DAE) such as BERT’s family in natural language processing, but notice that it is conducted in an adversarial manner.
Colorization is first proposed by [156]. The problem can be described as given one color channel in an image and to predict the value of two other channels ,
. The encoder and decoder networks can be set to any form of convolutional neural networks. What is interesting is that to avoid the uncertainty brings by traditional generative methods such as VAE, the author transforms the generation task into a classification one. The first figure out the common locating area of
and then split it into 313 categories. The classification is performed through a softmax layer with hyperparameter
as adjustment. Based on [156], a range of colorizationbased representation methods [157, 76, 75] are proposed to benefit downstream tasks.Inpainting [60, 98] is more straight forward. We will ask the model to predict an arbitrary part of an image given the rest of it. Then a discriminator is employed to distinguish the inpainted image with the original one. Superresolution method SRGAN [80] follows the same idea to recover highresolution images from blurred lowresolution ones in the adversarial setting.
5.4 Pretrained Language Model
For a long time, the pretrained language model (PTM) focuses on maximum likelihood estimation based pretext task, because discriminative objectives are thought to be helpless due to languages’ vibrant patterns. However, recently some work shows excellent performance and sheds light on contrastive objectives’ potential in PTM.
The pioneering work is ELECTRA [27], surpassing BERT given at the same computation budget. ELECTRA proposes Replaced Token Detection (RTD) and leverages GAN’s structure to pretrain a language model. In this setting, the generator is is a small Masked Language Model (MLM), which replaces masked tokens in a sentence to words. The discriminator is asked to predict which words are replaced. Notice that replaced means not the same with original unmasked inputs. The training is conducted in two stages:

Warmup the generator: train the with MLM pretext task for some steps to warm up the parameters.

Trained with the discriminator: ’s parameters is initialized with ’s and then trained with the discriminative objective (a crossentropy loss). During this period, ’s parameter is frozen.
The final objective could be written as
(23) 
Though ELECTRA is structured as GAN, it is not trained in the GAN setting. That is because compared to image data, which is continuous, word tokens are discrete, which stops the gradient backpropagation. A possible substitution is to leverage policy gradient, but experiments in ELECTRA show that performance is slightly lower. Theoretically speaking, is actually turning the conventional class softmax classification into a binary classification. This substantially saves the computation effort, but may somehow harm the representation quality due to the early degeneration of embedding space. In summary, ELECTRA is still an inspiring pioneer work in leveraging discriminative objective.
At the same time, WKLM [146] proposes to perform RTD at the entitylevel. For entities in Wikipedia paragraphs, WKLM replaced them with similar entities and trained the language model to distinguish them in a similar discriminative objective as ELECTRA, performing quite well in downstream tasks like question answering. Similar work is REALM [50], which conducts higher articlelevel retrieval augmentation to the language model. However, REALM is not using the discriminative objective.
5.5 Graph Learning
In graph learning, there are also attempts to utilize adversarial learning ( [29, 140, 34]). Interestingly, their ideas are quite different from each other.
The most natural idea is to follow BiGAN [39] and ALI [42]’s practice that asks discriminator to distinguish representation from generated and prior distribution. Adversarial Network Embedding (ANE) [29] designs a generator that is updated in two stages: 1) encodes sampled graph into target embedding and computes traditional NCE with a context encoder like Skipgram, 2) discriminator is asked to distinguish embedding from and a sampled one from a prior distribution. The optimized objective is a sum of the above two objectives, and the generator could yield better node representation for the classification task.
GraphGAN [140] consider to model the link prediction task and follow the original GAN style discriminative objective to distinguish directly at nodelevel rather than representationlevel. The model first selects nodes from the subgraph of the target node according to embedding encoded by the generator . Then some neighbor nodes to selected from the subgraph, together with those selected by
, are put into a binary classifier
to decide whether they are linked to . Because this framework involves a discrete selection procedure, while the discriminator could be updated by gradient descents, the generator is updated via policy gradients.GraphSGAN [34] applies the adversarial method in semisupervised graph learning with the motivation that most classification errors in the graph are caused by marginal nodes. Consider samples in the same category; they are usually clustered in the embedding space. Between clusters, there are density gaps where few samples exist. The author provides a rigorous mathematical proof that if we generate enough fake samples in density gaps, we are able to perform complete classification theoretically. During the training, GraphSGAN leverages a generator to generate fake nodes in density gaps and asks the discriminator to classify nodes into their original categories and a category for those fake ones. In the test period, fake samples are removed, and classification results on original categories could be improved substantially.
5.6 Domain Adaptation and Multimodality Representation
Essentially, the discriminator in adversarial learning serves to match the discrepancy between latent representation distribution and data distribution. This function naturally relates to domain adaptation and multimodality representation problems, which aim at aligning different representation distribution. [1, 43, 117, 2] studies how GAN can help on domain adaptation. [17, 141] leverage adversarial sampling to improve the negative samples’ quality. For multimodality representation, [159]’s image to image translation, [118]’s text style transfer, [28]’s word to word translation and [115] image to text translation show great power of adversarial representation learning.
6 Theory behind Selfsupervised Learning
In last three sections, we introduces a number of empirical works for selfsupervised learning. However, we are also curious about their theoretical foundations. In this part, we will provide some theoretical insights on selfsupervised learning’s success from different perspectives.
6.1 Gan
6.1.1 Divergence Matching
As generative models, GANs[46] pays attention to the difference between real data distribution and generated data distribution :
(24) 
fGAN[95] shows that the generativeadversarial approach is a special case of an exsiting more general variational divergence estimation problem, and uses fdivergence to train the generative models. fdivergence reflects the difference of two distributions and :
(25) 
Replace KLdivergence in (24) with JensonShannon(JS) divergence and calculate the replaced one with (25), the optimization target of the minmax GAN is achieved.
(26) 
Different divergence functions leads to different GAN variants. [95] also discusses the effects of various choices of divergence functions.
6.1.2 Disentangled Representation
An important drawback of supervised learning is that it easily get trapped into spurious information. A famous example is that supervised neural networks learn to distinguish dogs and wolves by whether they are in the grass or snow [114], which means the supervised models do not learn the disentangled representations of the grass and the dog, which should be mutual independent.
As an alternative, GAN show its superior potential in learning disentangled features empirically and theoretically. InfoGAN [24]
first proposes to learn disentangled representation with DCGAN. Conventionally, we sample white noise from a uniform or Gaussian distribution as input to generator of GAN. However, this white noise does not make any sense to the characteristics of the image we generated. We assume that there should be a latent code
whose dimensions represent different characteristics of the image respectively (such as rotation degree and width). We will learn this jointly in the discrimination period by the discriminator , and maximize ’s mutual information with the image , where refers to the generator (actually the decoder).Since mutual information is notoriously hard to compute, the authors leverage the variational inference approcach to estimates its lower bound , and the final objective for InfoGAN is modified as:
(27) 
Experiments show that InfoGAN surely learns a good disentangled representation on MNIST. This further encourage researchers to identify whether the modular structures for generation inner the GAN could be disentangled and independent with each others. GAN dissection [9] is a pioneer work in applying causal analysis into understading GAN. They identify the correlations between channels in the convolutional layers and objects in the generated images, and examine whether they are causallyrelated with the output. [13] takes another step to examine these channels’ conditional independence via rigorous counterfactual interventions over them. Results indicate that in BigGAN researchers are able to disentangle backgrounds and objects, such as replacing the background of a cock from the bare soil with the grassland.
These work indicates the ability of GAN to learn disentangled features and other selfsupervised learning methods are likely to be capable too.
6.2 Maximizing Lower Bound
6.2.1 Evidence Lower Bound
VAE (variational autoencoder) learns the representation through learning a distribution to approximate the posteriori distribution ,
(28) 
(29) 
where ELBO (Evidence Lower Bound Objective) is the lower bound of the optimization target . VAE maximizes the to minimize the difference between and .
(30) 
where is the regularization loss to approximate the Gaussian Distribution and is the reconstruction loss.
6.2.2 Mutual Information
Most of current contrastive learning methods aim to maximize the MI(Mutual Information) of the input and its representation with joint density and marginal densities and :
(31) 
Deep Infomax[55] maximizes the MI of local and global features and replaces KLdivergence with JSdivergence, which is similar to GAN mentioned above. Therefore the optimization target of Deep Infomax becomes:
(32) 
The form of the objective optimization function is similar to (26), except that the data distribution becomes the global and local feature distributions. From a probability point of view, GAN and DeepInfoMax are derived from the same process but for a different learning target. The encoder in GAN, to an extent, works the same as the encoder in representation learning models. The idea of generativeadversarial learning deserves to be used in selflearning areas.
Instance Discrimination[144][96] directly optimizes the proportion of gap of positive pairs and negative pairs. One of the commonly used estimators is InfoNCE[96]:
(33) 
Therefore the MI . The approximation becomes increasingly accurate, and also increases as N grows. This implies that it is useful to use large negative samples(large values of N). But [4] has demonstrated that increasing the number of negative samples does not necessarily help. Negative sampling remains a key challenge to study.
Though maximizing ELBO and MI has been achieved to obtain the stateofart result in selfsupervised representation learning, it is demonstrated that MI and ELBO are loosely connected with the downstream task performance[68][132]. Maximizing the lower bound(MI and ELBO) is not sufficient to learn useful representations. On the one hand, looser bounds often yield better test accuracy in downstream tasks. On the other hand, achieving the same lower bound value can lead to vastly different representations and performance on downstream tasks, which indicates that it does not necessarily capture useful information of data[3][131][14]
. There is a nontrivial interaction between the representation encoder, critic, and loss function
[132].MI maximization can also be analyzed from the metric learning view. [132] provides some insight by connecting InfoNCE to the triplet (kplet) loss in deep learning community. The InfoNCE (33) can be rewriten as follows:
(34) 
In particular is contrained to the form for a certain function . Then the InfoNCE is corresponding to the expectation of the multiclass kpair loss:
(35) 
In metric learning, the encoder is share across views( and ) and the critic function is symmetric, while the MI maximization,e.g. DeepInfoxMax, CMC and MoCo, is not contrained by these conditions. (35) can be viewed as learning encoders with a parameterless inner product.
6.3 Contrastive Selfsupervised Representation Learning
It seems intuitive that minimizing the aforementioned loss functions should lead the representations better to capture the ”similarity” between different entities, but it is unclear why the learned representations should also lead to better performance on downstream tasks, such as linear classification tasks. Intuitively, a selfsupervised representation learning framework must capture the feature in unlabelled data and the similarity with semantic information that is implicitly present in downstream tasks. [4] proposed a conceptual framework to analyze contrastive learning on average classification tasks.
Contrastive learning assumes that similar data pair comes from a distribution and negative sample from a dstribution that is presumably unrelated to . Under the hypothesis that semantically similar points are sampled from the same latent class, the unsupervised loss can be expressed as:
(36) 
The selfsupervised learning is to find a funtion that minimizes the empirical unsupervised loss within the capacity of the used encoder. As negative points are sampled independently identically from the datasets, can be decomposed into and accoding to the latent class the negative sample drawed from. The intraclass deviation controls the and implies the unexpected loss contradictive to our optimization target, which is caused by the negative sampling strategies. Under the context of only 1 negative sample, it is proved that optimizing unsupervised loss benefits the downstream classification tasks:
(37) 
With probability at least , is the feature mapping function the encoder can capture, is the generalization error. When the sampled pair and the numebr of latent class , and . If the encoder is powerful enough and trained using suffiently large number of samples, the learned function with low as well as low will have good performance on supervised tasks (low ).
Contrastive learning also has limitations. In fact, contrastive learning does not always pick the best supervised representation function . Minimizing the unsupervised loss to get low does not mean that because high and high does not imply high , resulting the failure of the algorithm.
The relationship between and are further explored on the condition of mean classifier loss , where indicates that a label only corresponds to a embedding vector . If there exists a functoin that has intraclass concentration in strong sense and can separate latent classes with high margin(on average) with mean classifier, then will be low. If is in every direction for every class and has maximum norm , then can be controlled by .
(38) 
For all and with the probability at least , . Under the assumption and context, optimizing the unsupervised loss indeed helps pick the best downstream task supervised loss.
As in the aformentioned models[53][25], (38) can also be extended to more than one negative samples for every similar pair. Then average loss is
(39) 
Besides, the general belief is that increasing the number of negative samples always helps, at the cost of increased computational costs. Noise Contrastive Estimation(NCE)[49]
explains that increasing the number of negative samples can provably improve the variance of learning parameters. However,
[4] argues that this does not hold for contrastive learning and shows that it can hurt performance when the negative samples exceed a threshold. sUnder the assumptions, contradictive representation learning is theoretically proved to benefit the downstream classification tasks. More detailed proofs can be found in [4]. This connects the ”similarity” in unlabelled data with the semantic information in downstream tasks. Though the connection temporarily is only in a restricted context, more generalized research deserves exploration.
7 Discussions and Future Directions
In this section, we will discuss several open problems and future directions in selfsupervised learning for representation.
Theoretical Foundation Though selfsupervised learning has achieved great success, few works investigate the mechanisms behind it. In Section 6, we list several recent works on this topic and show that theoretical analysis is significant to avoid misleading empirical conclusions.
In [4], researchers present a conceptual framework to analyze the contrastive objective’s function in generalization ability. [132] empirically proves that mutual information is only loosely related to the success of several MIbased methods, in which the sampling strategies and architecture design may count more. This type of works is crucial for selfsupervised learning to form a solid foundation, and more work related to theory analysis is urgently needed.
Transferring to downstream tasks
There is an essential gap between pretraining and downstream tasks. Researchers design elaborate pretext tasks to help models learn some critical features of the dataset that can transfer to other jobs, but sometimes this may fail to realize. Besides, the process of selecting pretext tasks seems to be too heuristic and tricky without patterns to follow.
A typical example is the selection of pretraining tasks in BERT and ALBERT. BERT uses Next Sentence Prediction (NSP) to enhance its ability for sentencelevel understanding. However, ALBERT shows that NSP equals to a naive topic model, which is far too easy for language model pretraining and even decrease the performance of BERT.
For the pretraining task selection problem, a probably exciting direction would be to design pretraining tasks for a specific downstream task automatically, just as what Neural Architecture Search [161] does for neural network architecture.
Transferring across datasets This problem is also known as how to learn inductive biases or inductive learning. Traditionally, we split a dataset into the training used for learning the model parameters and the testing part for evaluation. An essential prerequisite for this learning paradigm is that data in the real world conforms to the distribution in our dataset. Nevertheless, this assumption frequently fails in experiments.
Selfsupervised representation learning solves part of this problem, especially in the field of natural language processing. Vast amounts of corpora used in the language model pretraining help to cover most patterns in language and therefore contribute to the success of PTMs in various language tasks. However, this is based on the fact that text in the same language shares the same embedding space. For other tasks like machine translation and fields like graph learning where embedding spaces are different for different datasets, how to learn the transferable inductive biases efficiently is still an open problem.
Exploring potential of sampling strategies In [132], the authors attribute one of the reasons for the success of mutual informationbased methods to better sampling strategies. MoCo [53], SimCLR [22], and a series of other contrastive methods may also support this conclusion. They propose to leverage super large amounts of negative samples and augmented positive samples, whose effects are studied in deep metric learning. How to further release the power of sampling is still an unsolved and attractive problem.
Early Degeneration for Contrastive Learning Contrastive learning methods such as MoCo [53] and SimCLR [22] is rapidly approaching the performance of supervised learning for computer vision. However, their incredible performances are generally limited to the classification problem. Meanwhile, the generativecontrastive method ELETRA [27] for language model pretraining is also outperforming other generative methods on several standard NLP benchmarks with fewer model parameters. However, some remarks indicate that ELETRA’s performance on language generation and neural entity extraction is not up to expectations.
Problems above are probably because the contrastive objectives often get trapped into embedding spaces’ early degeneration problem, which means that the model overfits to the discriminative pretext task too early, and therefore lost the ability to generalize. We expect that there would be techniques or new paradigms to solve the early degeneration problem while preserving contrastive learning’s advantages.
Acknowledgments
The work is supported by the National Key R&D Program of China (2018YFB1402600), NSFC for Distinguished Young Scholar (61825602), and NSFC (61836013).
References
 [1] (2014) Domainadversarial neural networks. arXiv preprint arXiv:1412.4446. Cited by: §5.6.
 [2] (2018) Domain adaptation with adversarial training and graph embeddings. arXiv preprint arXiv:1805.05151. Cited by: §5.6.
 [3] (2017) Fixing a broken elbo. arXiv preprint arXiv:1711.00464. Cited by: §6.2.2.
 [4] (2019) A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229. Cited by: §6.2.2, §6.3, §6.3, §6.3, §7.
 [5] (2019) Learning to retrieve reasoning paths over wikipedia graph for question answering. arXiv preprint arXiv:1911.10470. Cited by: §1.
 [6] (2019) Learning representations by maximizing mutual information across views. In NIPS, pp. 15509–15519. Cited by: TABLE I, Fig. 7, §4.1.2.
 [7] (2019) Simgnn: a neural network approach to fast graph similarity computation. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 384–392. Cited by: §1.
 [8] (1987) Modular learning in neural networks.. In AAAI, pp. 279–284. Cited by: §3.3.1.
 [9] (2018) Gan dissection: visualizing and understanding generative adversarial networks. arXiv preprint arXiv:1811.10597. Cited by: §6.1.2.
 [10] (2003) A neural probabilistic language model. Journal of machine learning research 3 (Feb), pp. 1137–1155. Cited by: §2.1.
 [11] (2013) Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §3.3.4.
 [12] (1994) Learning longterm dependencies with gradient descent is difficult. IEEE transactions on neural networks 5 2, pp. 157–66. Cited by: §2.2.
 [13] (2018) Counterfactuals uncover the modular structure of deep generative models. arXiv preprint arXiv:1812.03253. Cited by: Fig. 17, §6.1.2.
 [14] (2019) Rethinking lossy compression: the ratedistortionperception tradeoff. arXiv preprint arXiv:1901.07821. Cited by: §6.2.2.
 [15] (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. Cited by: TABLE I, §3.3.2.
 [16] (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §3.3.4, §5.2.

[17]
(2017)
Kbgan: adversarial learning for knowledge graph embeddings
. arXiv preprint arXiv:1711.04071. Cited by: §5.6.  [18] (2015) GraRep: learning graph representations with global structural information. In CIKM ’15, Cited by: §2.3.
 [19] (2016) Deep neural networks for learning graph representations. In AAAI, Cited by: §2.3.
 [20] (2018) Deep clustering for unsupervised learning of visual features. In Proceedings of the ECCV (ECCV), pp. 132–149. Cited by: TABLE I, Fig. 9, §4.2.1, §4.2.1.
 [21] (2018) HARP: hierarchical representation learning for networks. In AAAI, Cited by: §2.3.
 [22] (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: TABLE I, Fig. 11, §4.2.2, §4.2, §4.2, §7, §7.
 [23] (2017) On sampling strategies for neural networkbased collaborative filtering. In Proceedings of the 23rd ACM SIGKDD, pp. 767–776. Cited by: §4.2.2.
 [24] (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In NIPS, pp. 2172–2180. Cited by: §6.1.2.
 [25] (2020) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: TABLE I, §4.2.2, §4.2, §6.3.
 [26] (2017) Triple generative adversarial nets. In NIPS, pp. 4088–4098. Cited by: §5.2.
 [27] (2020) Electra: pretraining text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555. Cited by: TABLE I, Fig. 15, §5.4, §7.
 [28] (2017) Word translation without parallel data. arXiv preprint arXiv:1710.04087. Cited by: §5.6.

[29]
(2018)
Adversarial network embedding.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: TABLE I, §5.5, §5.5.  [30] (2019) Transformerxl: attentive language models beyond a fixedlength context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988. Cited by: §3.4.1.
 [31] (1994) Learning classification with unlabeled data. In NIPS, pp. 112–119. Cited by: Fig. 1.

[32]
(2009)
Imagenet: a largescale hierarchical image database.
In
2009 IEEE conference on computer vision and pattern recognition
, pp. 248–255. Cited by: §1, §3.3.4.  [33] (2019) BERT: pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §2.1, TABLE I, §3.3.3, §4.1.1.
 [34] (2018) Semisupervised learning on graphs with generative adversarial nets. In Proceedings of the 27th ACM CIKM, pp. 913–922. Cited by: TABLE I, Fig. 16, §5.5, §5.5.
 [35] (2019) Cognitive graph for multihop reading comprehension at scale. arXiv preprint arXiv:1905.05460. Cited by: §1.
 [36] (2014) Nice: nonlinear independent components estimation. arXiv preprint arXiv:1410.8516. Cited by: TABLE I, §3.2.
 [37] (2016) Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Cited by: TABLE I, §3.2.
 [38] (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE ICCV, pp. 1422–1430. Cited by: TABLE I, Fig. 6, §4.1.1.
 [39] (2016) Adversarial feature learning. arXiv preprint arXiv:1605.09782. Cited by: TABLE I, Fig. 13, §5.2, §5.2, §5.5.
 [40] (2019) Large scale adversarial representation learning. In NIPS, pp. 10541–10551. Cited by: TABLE I, §5.2.
 [41] (2018) Learning structural node embeddings via diffusion wavelets. In Proceedings of the 24th ACM SIGKDD, pp. 1320–1329. Cited by: §4.2.2.
 [42] (2016) Adversarially learned inference. arXiv preprint arXiv:1606.00704. Cited by: TABLE I, Fig. 13, §5.2, §5.5.
 [43] (2016) Domainadversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §5.6.
 [44] (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: TABLE I, Fig. 6, §4.1.1.
 [45] (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §1.
 [46] (2014) Generative adversarial nets. In NIPS, pp. 2672–2680. Cited by: TABLE I, §6.1.1.
 [47] (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD, pp. 855–864. Cited by: TABLE I.
 [48] (2018) Graphite: iterative generative modeling of graphs. In ICML, Cited by: §2.3.
 [49] (2010) Noisecontrastive estimation: a new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304. Cited by: §4, §6.3.
 [50] (2020) Realm: retrievalaugmented language model pretraining. arXiv preprint arXiv:2002.08909. Cited by: §5.4.
 [51] (2017) Representation learning on graphs: methods and applications. IEEE Data Eng. Bull. 40, pp. 52–74. Cited by: §2.3.
 [52] (2017) Inductive representation learning on large graphs. In NIPS, Cited by: §2.3.
 [53] (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: TABLE I, Fig. 10, §4.2.2, §4.2, §4.2, §6.3, §7, §7.
 [54] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §2.2, §2.2.
 [55] (2018) Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: TABLE I, Fig. 7, §4.1.2, §6.2.2.
 [56] (2019) Flow++: improving flowbased generative models with variational dequantization and architecture design. In ICML, pp. 2722–2730. Cited by: §3.4.2.
 [57] (2019) Strategies for pretraining graph neural networks. In ICLR, Cited by: TABLE I, §4.1.2.
 [58] (2020) Heterogeneous graph transformer. arXiv preprint arXiv:2003.01332. Cited by: §1.
 [59] (2017) Densely connected convolutional networks. 2017 IEEE CVPR, pp. 2261–2269. Cited by: §1, §2.2.
 [60] (2017) Globally and locally consistent image completion. ACM Transactions on Graphics (ToG) 36 (4), pp. 1–14. Cited by: §5.2, §5.3, §5.3.
 [61] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. ArXiv abs/1502.03167. Cited by: §2.2.
 [62] (2017) Imagetoimage translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §5.2.
 [63] (2020) GCC: graph contrastive coding for graph neural network pretraining. Cited by: TABLE I, §4.2.2.
 [64] (2019) Selfsupervised visual feature learning with deep neural networks: a survey. arXiv preprint arXiv:1902.06162. Cited by: §1, §4.1.1.
 [65] (2020) Spanbert: improving pretraining by representing and predicting spans. Transactions of the Association for Computational Linguistics 8, pp. 64–77. Cited by: TABLE I, §3.3.3.
 [66] (2019) A stylebased generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §5.2.
 [67] (2018) Learning image representations by completing damaged jigsaw puzzles. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 793–802. Cited by: TABLE I, Fig. 6, §4.1.1.
 [68] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.3.4, §6.2.2.
 [69] (2018) Glow: generative flow with invertible 1x1 convolutions. In NIPS, pp. 10215–10224. Cited by: TABLE I, §3.2.
 [70] (2016) Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §2.3, §3.3.4.
 [71] (2016) Variational graph autoencoders. arXiv preprint arXiv:1611.07308. Cited by: TABLE I, §3.3.4.
 [72] (2019) A mutual information maximization perspective of language representation learning. arXiv preprint arXiv:1910.08350. Cited by: TABLE I, §4.1.2.
 [73] (2012) Imagenet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105. Cited by: §2.2, §4.2.1.
 [74] (2019) Albert: a lite bert for selfsupervised learning of language representations. arXiv preprint arXiv:1909.11942. Cited by: §1, TABLE I, §4.1.1.
 [75] (2016) Learning representations for automatic colorization. In ECCV, pp. 577–593. Cited by: TABLE I, §5.3, §5.3.
 [76] (2017) Colorization as a proxy task for visual understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6874–6883. Cited by: §5.3, §5.3.
 [77] (198912) Backpropagation applied to handwritten zip code recognition. Neural Comput. 1 (4), pp. 541–551. External Links: ISSN 08997667, Link, Document Cited by: §2.2.
 [78] (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
 [79] (1998) Efficient backprop. In Neural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop, Berlin, Heidelberg, pp. 9–50. External Links: ISBN 3540653112 Cited by: §2.2.
 [80] (2017) Photorealistic single image superresolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: TABLE I, §5.2, §5.3, §5.3.
 [81] (2016) Unsupervised visual representation learning by graphbased consistent constraints. In ECCV, pp. 678–694. Cited by: §4.2.1.
 [82] (2018) Adaptive graph convolutional neural networks. ArXiv abs/1801.03226. Cited by: §2.3.
 [83] (2012) Sentiment analysis and opinion mining. Synthesis lectures on human language technologies 5 (1), pp. 1–167. Cited by: §1.
 [84] (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1, §4.1.1.
 [85] (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.
 [86] (2015) Adversarial autoencoders. arXiv preprint arXiv:1511.05644. Cited by: TABLE I, §5.2.
 [87] (2015) Masked autoencoder for distribution estimation. Cited by: §3.4.1.
 [88] (2013) Efficient estimation of word representations in vector space. CoRR abs/1301.3781. Cited by: §2.1, TABLE I.
 [89] (2013) Distributed representations of words and phrases and their compositionality. In NIPS’13, pp. 3111–3119. Cited by: §2.1, TABLE I, §3.3.2.
 [90] (2019) Selfsupervised learning of pretextinvariant representations. arXiv preprint arXiv:1912.01991. Cited by: TABLE I, Fig. 6, §4.1.1, §4.2.2.
 [91] (2010) Rectified linear units improve restricted boltzmann machines. In ICML, Cited by: §2.2.
 [92] (2011) Sparse autoencoder. CS294A Lecture notes 72 (2011), pp. 1–19. Cited by: §3.3.1.
 [93] (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, pp. 69–84. Cited by: Fig. 6, §4.1.1.
 [94] (2018) Boosting selfsupervised learning via knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9359–9367. Cited by: §4.2.1.
 [95] (2016) Fgan: training generative neural samplers using variational divergence minimization. In NIPS, pp. 271–279. Cited by: §6.1.1.
 [96] (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: TABLE I, §4.1.2, §4, §6.2.2.
 [97] (2016) Asymmetric transitivity preserving graph embedding. In KDD ’16, Cited by: §2.3.
 [98] (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2536–2544. Cited by: TABLE I, §5.2, §5.3, §5.3.
 [99] (2020) Selfsupervised graph representation learning via global context prediction. arXiv preprint arXiv:2003.01604. Cited by: §4.1.2.
 [100] (2020) Selfsupervised graph representation learning via global context prediction. ArXiv abs/2003.01604. Cited by: §2.3, TABLE I.
 [101] (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD, pp. 701–710. Cited by: §2.3, TABLE I, §3.3.2.
 [102] (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §2.1.
 [103] (2019) MolecularRNN: generating realistic molecular graphs with optimized properties. arXiv preprint arXiv:1905.13372. Cited by: §3.1.
 [104] (2018) Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 459–467. Cited by: §2.3.
 [105] (2018) Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. In WSDM ’18, Cited by: §2.3.
 [106] (2018) Deepinf: social influence prediction with deep learning. In KDD’18, pp. 2110–2119. Cited by: §1.
 [107] (2020) Pretrained models for natural language processing: a survey. arXiv preprint arXiv:2003.08271. Cited by: §1.
 [108] (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §5.1, §5.2.
 [109] Improving language understanding by generative pretraining. Cited by:
Comments
There are no comments yet.