# Demystifying Self-Supervised Learning: An Information-Theoretical Framework

Self-supervised representation learning adopts self-defined signals as supervision and uses the learned representation for downstream tasks, such as masked language modeling (e.g., BERT) for natural language processing and contrastive visual representation learning (e.g., SimCLR) for computer vision applications. In this paper, we present a theoretical framework explaining that self-supervised learning is likely to work under the assumption that only the shared information (e.g., contextual information or content) between the input (e.g., non-masked words or original images) and self-supervised signals (e.g., masked-words or augmented images) contributes to downstream tasks. Under this assumption, we demonstrate that self-supervisedly learned representation can extract task-relevant and discard task-irrelevant information. We further connect our theoretical analysis to popular contrastive and predictive (self-supervised) learning objectives. In the experimental section, we provide controlled experiments on two popular tasks: 1) visual representation learning with various self-supervised learning objectives to empirically support our analysis; and 2) visual-textual representation learning to challenge that input and self-supervised signal lie in different modalities.

There are no comments yet.

## Authors

• 19 publications
• 64 publications
• 137 publications
• 57 publications
• ### Self-supervised Learning: Generative or Contrastive

Deep supervised learning has achieved great success in the last decade. ...
06/15/2020 ∙ by Fanjin, et al. ∙ 0

• ### Understanding Negative Samples in Instance Discriminative Self-supervised Representation Learning

Instance discriminative self-supervised representation learning has been...
02/13/2021 ∙ by Kento Nozawa, et al. ∙ 0

• ### i-Mix: A Strategy for Regularizing Contrastive Representation Learning

Contrastive representation learning has shown to be an effective way of ...
10/17/2020 ∙ by Kibok Lee, et al. ∙ 4

• ### Automatic Shortcut Removal for Self-Supervised Representation Learning

In self-supervised visual representation learning, a feature extractor i...
02/20/2020 ∙ by Matthias Minderer, et al. ∙ 0

• ### Self-supervised visual feature learning with curriculum

Self-supervised learning techniques have shown their abilities to learn ...
01/16/2020 ∙ by Vishal Keshav, et al. ∙ 71

• ### Taxonomy of multimodal self-supervised representation learning

Sensory input from multiple sources is crucial for robust and coherent h...
12/25/2020 ∙ by Alex Fedorov, et al. ∙ 12

• ### Self-Supervised Visual Representation Learning from Hierarchical Grouping

We create a framework for bootstrapping visual representation learning f...
12/05/2020 ∙ by Xiao Zhang, et al. ∙ 0

## Code Repositories

### Demystifying_Self_Supervised_Learning

None

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Self-supervised learning (SSL) zhang2016colorful ; srivastava2015unsupervised ; devlin2018bert ; oord2018representation ; bachman2019learning ; chen2020simple ; tian2019contrastive ; hjelm2018learning ; he2019momentum ; kong2019mutual ; arora2019theoretical learns representations using a proxy objective (i.e., SSL objective) between inputs and self-defined signals. Empirical evidences suggest that the learned representations can generalize well to a wide range of downstream tasks, even when there is no clear connection between the SSL objective and the downstream tasks. For example, BERT devlin2018bert

defines a prediction loss (i.e., a SSL objective) from non-masked words (i.e., inputs) to masked words (i.e., self-supervised signals). Then, one takes BERT as word features extractor and adopts the word features to various natural language processing applications, spanning sentiment analysis, question answering, dialogue system, and named-entity recognition

young2018recent . Despite showing success in practice, there are only a few work arora2019theoretical providing theoretical insights into SSL. In particular, Arora et al. arora2019theoretical presented provable guarantees on the performance for downstream classification task when using contrastive learning objectives in SSL. Our work shares a similar goal of demystifying SSL, but approaching it from an Information Theory cover2012elements perspective to understand when and why self-supervised learning is likely to work.

Based on this assumption, we develop an unsupervised compressed representation learning strategy. In particular, we extract task-relevant information by maximizing the mutual information between the learned representations and the self-supervised signals. Then, we discard task-irrelevant information by minimizing the conditional entropy of the learned representations given the self-supervised signals. We show this strategy 1) includes prior arts for SSL on contrastive agrawal2015learning ; arandjelovic2017look ; jayaraman2015learning ; oord2018representation ; bachman2019learning ; chen2020simple ; tian2019contrastive ; hjelm2018learning ; he2019momentum ; kong2019mutual ; ozair2019wasserstein ; arora2019theoretical ; henaff2019data and predictive learning zhang2016colorful ; pathak2016context ; vondrick2016generating ; tulyakov2018mocogan ; srivastava2015unsupervised ; peters2018deep ; devlin2018bert ; dai2019transformer ; bai2018empirical approaches; 2) paves the way to a larger space of composing SSL objectives; and 3) leads us a discussion on limitations and challenges of using these objectives. For instance, we can combine both contrastive and predictive learning approaches as our SSL objective, being aware that the contrastive objective requires larger batch size and the predictive objective is hard to optimize if the self-supervised signals are high-dimensional.

We first conduct controlled experiments on visual representation learning to 1) verify that the self-supervisedly learned representation could extract task-relevant and discard task-irrelevant information; and 2) compare different compositions of SSL objectives. Then, we perform self-supervised visual-textual representation learning in a challenging setting that input and self-supervised signals lie in very different modalities. We make our experiments publicly available at https://github.com/yaohungt/Demystifying_Self_Supervised_Learning.

## 2 An Information-Theoretical Framework for Self-supervised Learning

In this section we aim to show self-supervised learning (SSL) can learn a representation that is beneficial for downstream tasks. For the input, we denote its random variable as

, sample space as , and outcome as . Similarly, for the self-supervised signal, we denote its random variable/ sample space/ outcome as / / . Two sample spaces can be different: . We learn a representation (/ / ) from the input through a deterministic mapping : . The information required for downstream tasks is referred to as “task-relevant information”: / / . Note that SSL has no access to the task-relevant information. Lastly, we use to represent mutual information, to represent conditional mutual information, and to represent conditional entropy for random variables //. We provide high-level takeaways for our main results in Figure 1.

### 2.1 Redundancy Assumption and Determinism

The derivations throughout the paper rely on the following redundancy assumption and determinism lemma. First, we assume redundancy between the input and self-supervised signal :

###### Assumption 1 (Redundancy).

The input is redundant to the self-supervised signal for the task-relevant information. In other words, we assume the following conditional independence: or equivalently . We assume the redundancy also holds when we swap and , and hence or equivalently . By mutual redundancy, .

Assumption 1 states that the information required for the downstream tasks lies only in the shared information between the input and self-supervised signals. We provide an intuition by relating the assumption to Multiview learning xu2013survey ; sridharan2008information . Multiview learning extracts representations from data across different views, and it assumes each view provides the same task-relevant information. In SSL, we can regard the input and self-supervised signals as different views of the data. For instance, in contrastive visual representation learning hjelm2018learning ; chen2020simple , the input and the corresponding self-supervised signal are the same image with different image augmentations (images with different views).

Next, we provide a useful lemma using the fact that is a deterministic mapping:

###### Lemma 1 (Determinism).

If is Dirac, then the following conditional independence holds: and

, given by a Markov chain

111The Markov chain is naturally satisfied when is a deterministic mapping. If is random, the Markov chain needs to be further assumed to satisfy the conditional independence: and ..

This lemma simply states that contains no more information than .

### 2.2 Supervised Representation Learning

Under a supervised setting, to learn representations which contain only and no more than the information required for the downstream tasks, we consider the following objectives:

###### Definition 1 (Supervised Representation Learning).

Uncompressed and compressed supervised representation are defined as

 ZsupX=argmaxZXI(ZX;T)andZsupcomX=argminZXH(ZX|T)s.t.I(ZX;T)ismaximized.

Then, contains all task-relevant information .

###### Proof.

Adopting Data Processing Inequality cover2012elements in the Markov chain (Lemma 1), is maximized at . by Assumption 1. ∎

The definition shows the supervisedly learned representation / can extract relevant information for the downstream tasks. Next, we provide a justification that minimizing 222To discard task-irrelevant information, an alternative objective is minimizing , which represents the information between and that are irrelevant to . However, minimizing the conditional mutual information (i.e., ) requires a min-max optimization, which may cause instability in practice. Hence, we consider minimizing , which does not contain a min-max optimization. leads to compressed representations. Minimizing reduces the randomness from to , and the randomness is regarded as the incompressibility calude2013information . Hence, when satisfying the constraint “ is maximized”, minimizing leads to a more compressed representation (discarding superfluous information). Note that our analysis does not constrain the type of , which can be classification, regression, or clustering.

### 2.3 A Self-supervised Representation Learning Strategy

In Definition 1, we discuss uncompressed and compressed supervised representation learning objectives. To bridge the gap between supervised and self-supervised learning, we perform the following supervision decomposition (from the downstream tasks to the self-supervised signals):

###### Lemma 2 (Supervision Decomposition).

We consider the supervision decomposition from to :

 I(ZX;S)=I(ZX;T)+I(ZX;S|T)andH(ZX|S)=H(ZX|T)−I(ZX;S|T).

Also,

The decomposition allows us to 1) perform supervision on (i.e., self-supervised learning) instead of (i.e., supervised learning); 2) associate supervisedly- and self-supervisedly-learned representations; and 3) characterize the compression gap from supervised to self-supervised learning. Formally,

###### Definition 2 (Self-supervised Representation Learning).

Uncompressed and compressed self-supervised representation are defined as

 ZsslX=argmaxZXI(ZX;S)andZsslcomX=argminZXH(ZX|S)s.t.I(ZX;S)ismaximized.

Then, contains all the shared information between and .

###### Proof.

Adopting Data Processing Inequality cover2012elements in the Markov chain (Lemma 1), is maximized at . ∎

###### Theorem 1 (Inclusion).

Uncompressed and compressed self-supervised representation extract all task-relevant information, suggesting :

 IfI(ZX;S)ismaximizedthenI(ZX;T)ismaximizedandI(ZX;S|T)ismaximized.

In other words, compressed self-supervised representation is a subset of uncompressed self-supervised representation, and the later one is a subset of supervised representation: .

###### Proof.

Adopting Data Processing Inequality cover2012elements in (Lemma 1), is maximized at . Then, bringing the results in Definitions 1 and 2 into Lemma 2, we conclude is maximized if and only if and are both maximized. ∎

###### Theorem 2 (Compression Gap).

Compressed self-supervised representation cannot discard all task-irrelevant information, where a compression gap exists:

 ZsslcomX=argminZXH(ZX|S)s.t.I(ZX;S)ismaximized=argminZXH(ZX|T)s.t.I(ZX;T)ismaximizedandI(ZX;S|T)ismaximized

with is the information that cannot be discarded in SSL.

###### Proof.

In Theorem 1, we show that is maximized if and only if and are both maximized, where is maximized at . Following Lemma 2, , where is constant w.r.t. . We conclude the proof by plugging-in the result into Definition 2. ∎

As a summary, Definition 2 defines our compressed SSL strategy. Theorem 1 indicates that this strategy can extract as much task-relevant information as the supervised learned one. For how much task-irrelevant information can be discarded, Theorem 2 indicates a compression gap between the supervised and the self-supervised learning.

### 2.4 Relations with Contrastive and Predictive Representation Learning

#### Contrastive Learning

We define the contrastive learning objective as maximizing the mutual information between the learned representation and the self-supervised signal , which maximizes dependency/contrastiveness between and . Given Theorem 1, we have:

###### Corollary 1 (Contrastive learning optimally extracting task-relevant info).

If , then contains all task-relevant information.

The corollary suggests, even having no access to the downstream tasks, maximizing results in containing all the information required for the downstream tasks from /. To deploy the contrastive learning objective, recent methods propose to maximize lower bounds of mutual information belghazi2018mine ; oord2018representation ; poole2019variational ; song2019understanding or its variants such as JS-divergence poole2019variational ; hjelm2018learning between the joint density and the product of the marginal density. We denote these methods as with representing the parameters when computing . In this work, we suggest contrastive predictive coding (CPC) oord2018representation ; tian2019contrastive

, which is a mutual information lower bound with lower variance

poole2019variational ; song2019understanding :

 LCL:=maxZS=FS(S),ZX=FX(X),GE(zs1,zx1),⋯,(zsn,zxn)∼Pn(ZS,ZX)[1nn∑i=1loge⟨G(zxi),G(zsi)⟩1n∑nj=1e⟨G(zxi),G(zsj)⟩], (1)

where is a deterministic mapping and is a project head that projects a representation in

into a lower-dimensional vector. If the input and self-supervised signals share the same sample space, i.e.,

, we can impose (e.g., self-supervised visual representation learning chen2020simple ). The projection head, , can be an identity, a linear, or a non-linear mapping. Last, we note that modeling eq. (1) or other contrastive learning objectives belghazi2018mine ; poole2019variational often require large batch size (e.g., in eq. (1)) hjelm2018learning ; he2019momentum ; chen2020simple to ensure both low variance and bias (w.r.t. the true ). Empirical work tschannen2019mutual has suggested that large variance in contrastive learning objectives may lead to worsen performance for the downstream tasks.

#### Forward Predictive Learning

We define the forward predictive learning as maximizing the log conditional likelihood from the learned representation to the self-supervised signal , which encourages to reconstruct

. By the chain rule,

, where is irrelevant to . Hence, maximizing is equivalent to maximizing . Given Theorem 1, we have:

###### Corollary 2 (Forward Predictive learning optimally extracting task-relevant info).

If , then contains all task-relevant information.

The corollary suggests, if can perfectly reconstruct for any , then contains all the information required for the downstream tasks from /. A common approach to avoid intractability in Corollary 2 is assuming a variational distribution with representing the parameters when computing . Now, we re-arrange . Hence, is a lower bound of . The bound is tight when . can be any distribution such as Gaussian or Laplacian and

can be a linear model, a kernel method, or a neural network. For example, MocoGAN

tulyakov2018mocogan assumes is Laplacian (i.e., reconstruction loss) and is a deconvolutional network long2015fully . Transformer-XL dai2019transformer assumes is a categorical distribution (i.e., cross entropy loss) and vaswani2017attention . If we let be Gaussian with

as an identity matrix, the objective becomes:

 LFP:=maxZX=FX(X),REs,zx∼PS,ZX[−∥s−R(zx)∥22], (2)

where is a deterministic mapping to reconstruct from

. Note that we ignore the constants derived from the Gaussian distribution. Last, in most real-world applications, the self-supervised signal

has a much higher dimension than the representation . Hence, modeling a conditional generative model will be challenging. For example, considering as image and as dimensional vector.

#### Inverse Predictive Learning

We define the inverse predictive learning as maximizing the log conditional likelihood from the self-supervised signal to the learned representation , which encourages to reconstruct . Given Theorem 2 together with , we have:

###### Corollary 3 (Inverse Predictive learning sub-optimally discarding task-irrelevant info).

Suppose . Then, discards all the information, excluding , irrelevant for the downstream tasks.

The corollary suggests, if can perfectly reconstruct for any under the constraint that is maximized, then discards the information, excluding , irrelevant for the downstream tasks. Similar to the forward predictive learning, we use as a lower bound of . In our deployment, we take the advantage of the design in eq. (1) and let be Gaussian with being an identity matrix:

 LIP:=maxZS=FS(S),ZX=FX(X)Ezs,zx∼PZS,ZX[−∥zx−zs∥22]. (3)

Note that optimizing eq. (3) alone results in a degenerated solution, e.g., learning and to be the same constant. As suggested in Corollary 3, we consider a constrained optimization instead of an unconstrained one.

#### Composing Self-supervised Learning Objectives

We have connected the SSL strategy presented in Definition 2 to contrastive learning objective in Corollary 1 and predictive learning objectives in Corollaries 2 and 3. Bringing their practical aspects together (eq. (1), (2), and (3)), we can pave the way to a larger space of composing SSL objectives:

 LSSL=λCLLCL+λFPLFP+λIPLIP, (4)

where , , and are hyper-parameters.

## 3 Controlled Experiments

#### Visual Representation Learning

Our goal is to construct a set of controlled experiments that satisfy Assumption 1 and could empirically support Theorem 1 and 2.

Experimental Setup. We use Omniglot dataset lake2015human in the experiments. The training set contains images from characters, and the test set contains characters. There are no characters overlap between the training and test set. Each character contains twenty examples drawn from twenty different people. We regard image as input () and generate self-supervised signal () by first sampling an image from the same character as the input image and then applying translation/ rotation to it. Furthermore, we represent task-relevant information () by one-hot label encoding. Under this self-supervised signal construction, the exclusive information in or are drawing styles (i.e., by different people) and image augmentations, and only their shared information contribute to . To formally show the later, if representing the label for /, then and are Dirac. Hence, and , satisfying Assumption 1.

We train the feature mapping with SSL objectives (see eq. (4)), set , let to be symmetrical to , and have to be an identity mapping. On the test set, we fix the mapping and randomly select

examples per character as the labeled examples. Then, we classify the rest of the examples using the 1-nearest neighbor classifier based on feature (i.e.,

) cosine similarity. The random performance on this task stands at

. One may refer to Supplementary for more details.

Results & Discussions. In Figure 3, we provide empirical analysis to support Theorem 1 and 2. We report / / / for during training and report / as the upper bound of / . For the objectives, we consider (contrastive learning only) for Theorem 1/ Corollary 1 and (contrastive and inverse predictive learning) for Theorem 2/ Corollary 3. In Figure 3 (a) and (b), we observe a positive correlation between and . Hence, it implies the SSL objectives can extract task-relevant information. Moreover, comparing to only, has larger

values given the same epoch or the same

. This result indicates can facilitate the representation to extract information from the self-supervised signal (). Figure 3 (c) suggests positive correlation between and . Figure 3 (d) suggests tends to converge after epochs of training. Note that can be regarded as the incompressibility calude2013information of given . Comparing to only, has smaller values given the same number of epochs or the same . This result implies can facilitate the representation to be more compressed.

In Figure 4, we evaluate the generalization ability on the test set for different SSL objectives. Figure 4 (a)/(b) suggest that, comparing to , 1) reaches better test accuracy; 2) requires shorter training epochs to reach the best performance; and 3) suffers from overfitting with long-epoch training. Combining both of them () brings their advantages together. We also find that adding in the objective can boost model performance. According to Theorem 2 and Corollary 3, the improved performance suggests a more compressed representation results in better performance for the downstream tasks. Nonetheless, in Figure 4 (c), we find the performance is sensitive to the hyper-parameter for combining . We would also like to examine whether combining and together can lead to improved performance in SOTA SSL framework. In Figure 4 (d), we provide experiment with SimCLR chen2020simple on CIFAR10 krizhevsky2009learning , where refers to the exact same setup as in SimCLR (which considers only ). By considering in SimCLR, when changing , we observe a similar trend with our Omniglot experiment.

#### Visual-Textual Representation Learning

So far, we have provided empirical support for Theorem 12 and compared different SSL objectives on the visual representation learning task. Under this task, the input and self-supervised signals lie in the same domain and have the same content (i.e., images of the same character) but different styles and image augmentation. We now consider having the input and self-supervised signals lie in very different modalities - vision and text.

Experimental Setup. We provide experiments using Microsoft COCO (MS COCO) dataset coco that contains k multi-labeled images with million labeled instances from objects. Each image has annotated captions describing the relationships between objects in the scenes.

We regard image as input () and its textual descriptions as self-supervised signal (), and we use (+) as our SSL objective. We use ResNet50 he2016deep image encoder for

(trained from scratch or fine-tuned on ImageNet

deng2009imagenet pre-trained weights), BERT-uncased devlin2018bert text encoder for (trained from scratch or BookCorpus zhu2015aligning /Wikipedia pre-trained weights), and a linear layer for . After performing self-supervised visual-textual representation learning, we consider the downstream multi-label classification task across categories. We evaluate learned visual representation () using downstream linear evaluation protocol of oord2018representation ; henaff2019data ; tian2019contrastive ; hjelm2018learning ; bachman2019learning ; tschannen2019mutual . Specifically, a linear classifier is trained from the self-supervisedly learned (fixed) representation to the labels on the training set. Commonly used metrics for multi-label classification are reported on MS COCO validation set: Micro ROC-AUC, Hamming Loss, and Subset Accuracy. One may refer to Supplementary for more details on these metrics.

Results & Discussions. First, Figure 5 (a) suggests that the SSL strategy can work when the input and self-supervised signals lie in different modalities. For example, a random guess for the subset accuracy would be , and the setting under Raw BERT + Raw ResNet achieves . We also see that using pre-trained ResNet can further improve the self-supervisedly learned representation, while using pre-trained BERT does not give us obvious benefits. Next, Figure 5 (b) suggests that the self-supervisedly learned representations can be further improved by combining and : . In Figure 5 (c)/(d), we have a similar observation as the self-supervised visual representation learning experiment: the hyper-parameter is sensitive to the performance.

## 4 Related Work

Our work aims at providing theoretical insights for the empirical success of self-supervised learning. The most related work is Unsupervised Contrastive Learning Theory arora2019theoretical that assumes two similar data (i.e., one stands for the input and the other stands for the corresponding self-supervised signal) have the same latent class, and a downstream classification task is comprised of a subset of the latent classes. Then, the work presented 1) provable guarantees for the downstream classification using contrastively learned representations; and 2) generalization bound such that the learned representations can reduce (labeled) sample complexity on downstream tasks. Our work differs in two ways: 1) we present a different assumption that only the shared information between the input and self-supervised signals contribute to the downstream tasks; and 2) we do not constrain the type of the downstream tasks to be classification, where they could be regression, clustering, etc.

Multi-view learning xu2013survey also closely relates to our work. Specifically, we can regard the input and self-supervised signals as two different views of data, and self-supervised learning aims at learning useful representations across views. Sridharan et. al. sridharan2008information pose the underlying assumption for multi-view learning: either view alone is sufficient for the downstream tasks (see Assumption 1 in sridharan2008information ). Their assumption is synonymous to our Assumption 1. Note that they focus on semi-supervised setting while we focus on unsupervised setting. Another recent work federici2020learning combines multi-view learning and information bottleneck tishby2000information method to balance the trade-off between extracting joint multi-view information and discarding non-joint multi-view information.

## 5 Conclusion

In this paper, we studied self-supervised learning via an information-theoretical perspective. We designed a self-supervised learning framework to extract task-relevant information and discard task-irrelevant information. We also connected this framework with prior self-supervised learning methods, specifically for contrastive and predictive learning objectives. To support our theoretical analysis empirically, we designed controlled experiments on visual representation learning and visual-textual representation learning. We believe this work sheds light on the advantages of self-supervised learning and may help better understand when and why self-supervised learning is likely to work. In the future, we plan to investigate, compare, and combine different deployments of contrastive learning, forward predictive learning, and inverse predictive learning objectives. Another area of interest for future exploration is multi-modality self-supervised learning.

## Acknowledgement

This work was supported in part by the DARPA grants FA875018C0150 HR00111990016, NSF IIS1763562, NSF Awards #1750439 #1722822, National Institutes of Health, and Apple. We would also like to acknowledge NVIDIA’s GPU support.

## 6 Proofs for Lemmas

###### Lemma 3 (Determinism, restating Lemma 1).

If is Dirac, then the following conditional independence holds: and , given by a Markov chain .

###### Proof.

When is a deterministic function of , for any in the sigma-algebra induced by we have , which implies and . ∎

Bringing the redundancy assumption and determinism lemma together, we get:

###### Lemma 4 (Representation Redundancy).

The representation is redundant to the self-supervised signal for the task-relevant information, meaning .

###### Proof.

By redundancy Assumption, . Also, . ∎

###### Lemma 5 (Supervision Decomposition, restating Lemma 2).

We consider the supervision decomposition from to :

 I(ZX;S)=I(ZX;T)+I(ZX;S|T)andH(ZX|S)=H(ZX|T)−I(ZX;S|T).

Also,

###### Proof.

Plug in (see Lemma 4) into chain rules of mutual information: and . Likewise, plug in (see redundancy Assumption) into chain rules for and . ∎

## 7 Information Diagram Road Map

To ease the understanding of the paper, we provide an information-diagram version of our road map for our derivations. Note that information diagram provides easy-to-understand relationships between information measurements. We encourage the readers to refer to the main text for formal proofs and statements of the results.

Our derivations are based on the following assumption and lemmas. The core assumption is that input and self-supervised signal are mutually redundant for downstream tasks. The assumption suggests the exclusive information in input and self-supervised signal is what we can discard. Next, using the fact that is deterministic from , we characterize conditional independence by a Markov chain . This lemma simply states that post-processing (i.e., to ) cannot introduce additional information. Last, based on the redundancy assumption and determinism lemma, we present supervision decomposition that is used for transiting supervision from the downstream task to the self-supervised signal.

After depicting our theories and their derivations, we connect our SSL framework and prior work zhang2016colorful ; devlin2018bert ; oord2018representation ; bachman2019learning ; chen2020simple ; tian2019contrastive ; hjelm2018learning ; he2019momentum , discussing practical implementation for different SSL objectives.

## 8 More on Visual Representation Learning Experiments

In the main text, we design controlled experiments on self-supervised visual representation learning to empirically support our theorem and examine different compositions of SSL objectives. In this section, we will discuss 1) the architecture design; 2) different deployments of contrastive/ forward predictive learning; and 3) different self-supervised signal construction strategy. We argue that these three additional set of experiments may be interesting future work.

### 8.1 Architecture Design

The input image has size . For image augmentations, we adopt 1) rotation with degrees from to ; 2) translation from pixels to pixels; 3) scaling both width and height from to ; 4) scaling width from to while fixing the height; and 5) resizing the image to . Then, a deep network takes a image and outputs a dim. feature vector. The deep network has the structure: . has 3x3 kernel size with output channels, has 2x2 kernel size, and is a to weight matrix. is symmetric to , which has . has the exact same number of parameters as . Note that we use the same network designs in and estimations. To reproduce the results in our experimental section, please refer to our released code (https://github.com/yaohungt/Demystifying_Self_Supervised_Learning).

### 8.2 Different Deployments for Contrastive and Predictive Learning Objectives

In the main text, for practical deployments, we suggest Contrastive Predictive Coding (CPC) oord2018representation for and assume Gaussian distribution for the variational distributions in / . The practical deployments can be abundant by using different mutual information approximations for and having different distribution assumptions for / . In the following, we discuss a few examples.

Contrastive Learning. Other than CPC oord2018representation , another popular contrastive learning objective is JS bachman2019learning , which is the lower bound of Jensen-Shannon divergence between and (a variational bound of mutual information). Its objective can be written as

where we use to denote .

Predictive Learning. Gaussian distribution may be the simplest distribution form that we can imagine, which leads to Mean Square Error (MSE) reconstruction loss. Here, we use forward predictive learning as an example, and we discuss the case when lies in discrete sample space. Specifically, we let be factorized multivariate Bernoulli:

 maxZX=FX(X),REPS,ZX[p∑i=1si⋅log[R(zx)]i+(1−si)⋅log[1−R(zx)]i]. (5)

This objective leads to Binary Cross Entropy (BCE) reconstruction loss.

If we assume each reconstruction loss corresponds to a particular distribution form, then by ignoring which variatioinal distribution we choose, we are free to choose arbitrary reconstruction loss. For instance, by switching and in eq. (5), the objective can be regarded as Reverse Binary Cross Entropy Loss (RevBCE) reconstruction loss. In our experiments, we find RevBCE works the best among {MSE, BCE, and RevBCE}. Therefore, in the main text, we choose RevBCE as the example reconstruction loss as .

More Experiments. We provide an additional set of experiments by having {CPC, JS} for and {MSE, BCE, RevBCE} reconstruction loss for in Figure 6. From the results, we find different formulation of objectives bring very different test generalization performance. We argue that, given a particular task, it is challenging but important to find the best deployments for contrastive and predictive learning objectives.

### 8.3 Different Self-supervised Signal Construction Strategy

In the main text, we design a self-supervised signal construction strategy that the input () and the self-supervised signal () differ in {drawing styles, image augmentations}. This self-supervised signal construction strategy is different from the one that is commonly adopted in most self-supervised visual representation learning work tian2019contrastive ; bachman2019learning ; chen2020simple . Specifically, prior work consider the difference between input and the self-supervised signal only in image augmentations. We provide additional experiments in Fig. 7 to compare these two different self-supervised signal construction strategies.

We see that, comparing to the common self-supervised signal construction strategy tian2019contrastive ; bachman2019learning ; chen2020simple , the strategy introduced in our controlled experiments has much better generalization ability to test set. It is worth noting that, although our construction strategy has access to the label information (i.e., we sample the self-supervised signal image from the same character with the input image), our SSL objectives do not train with the labels. Nonetheless, since we implicitly utilize the label information in our self-supervised construction strategy, it will be unfair to directly compare our strategy and prior one. An interesting future research direction is examining different self-supervised signal construction strategy and even combine full/part of label information into self-supervised learning.

## 9 Metrics in Visual-Textual Representation Learning

• Subset Accuracy multilbl , also know as the Exact Match Ratio (MR), ignores all partially correct (consider them incorrect) outputs and extend accuracy from the single label case to the multi-label setting.

 MR=1nn∑i=11[Yi=Hi]
• Micro AUC ROC score aucroc computes the AUC (Area under the curve) of a receiver operating characteristic (ROC) curve.

• Hamming Loss multilbl is the fraction of wrong labels to the total number of labels.

 HL=1knn∑i=1k∑c=11[Yic≠Hic]