Demystifying_Self_Supervised_Learning
None
view repo
Self-supervised representation learning adopts self-defined signals as supervision and uses the learned representation for downstream tasks, such as masked language modeling (e.g., BERT) for natural language processing and contrastive visual representation learning (e.g., SimCLR) for computer vision applications. In this paper, we present a theoretical framework explaining that self-supervised learning is likely to work under the assumption that only the shared information (e.g., contextual information or content) between the input (e.g., non-masked words or original images) and self-supervised signals (e.g., masked-words or augmented images) contributes to downstream tasks. Under this assumption, we demonstrate that self-supervisedly learned representation can extract task-relevant and discard task-irrelevant information. We further connect our theoretical analysis to popular contrastive and predictive (self-supervised) learning objectives. In the experimental section, we provide controlled experiments on two popular tasks: 1) visual representation learning with various self-supervised learning objectives to empirically support our analysis; and 2) visual-textual representation learning to challenge that input and self-supervised signal lie in different modalities.
READ FULL TEXT VIEW PDF
Deep supervised learning has achieved great success in the last decade.
...
read it
Instance discriminative self-supervised representation learning has been...
read it
Contrastive representation learning has shown to be an effective way of
...
read it
In self-supervised visual representation learning, a feature extractor i...
read it
Self-supervised learning techniques have shown their abilities to learn
...
read it
Sensory input from multiple sources is crucial for robust and coherent h...
read it
We create a framework for bootstrapping visual representation learning f...
read it
None
Self-supervised learning (SSL) zhang2016colorful ; srivastava2015unsupervised ; devlin2018bert ; oord2018representation ; bachman2019learning ; chen2020simple ; tian2019contrastive ; hjelm2018learning ; he2019momentum ; kong2019mutual ; arora2019theoretical learns representations using a proxy objective (i.e., SSL objective) between inputs and self-defined signals. Empirical evidences suggest that the learned representations can generalize well to a wide range of downstream tasks, even when there is no clear connection between the SSL objective and the downstream tasks. For example, BERT devlin2018bert
defines a prediction loss (i.e., a SSL objective) from non-masked words (i.e., inputs) to masked words (i.e., self-supervised signals). Then, one takes BERT as word features extractor and adopts the word features to various natural language processing applications, spanning sentiment analysis, question answering, dialogue system, and named-entity recognition
young2018recent . Despite showing success in practice, there are only a few work arora2019theoretical providing theoretical insights into SSL. In particular, Arora et al. arora2019theoretical presented provable guarantees on the performance for downstream classification task when using contrastive learning objectives in SSL. Our work shares a similar goal of demystifying SSL, but approaching it from an Information Theory cover2012elements perspective to understand when and why self-supervised learning is likely to work.In this paper, we argue that a good representation learning procedure is the one that learns representations that are maximally compressed and include only the information required for the downstream tasks. In other words, the representations should maximally extract task-relevant and discard task-irrelevant information. To connect these compressed representation learning procedure and SSL (which has no access to downstream tasks), we rely on a core assumption: only the shared information between the input and self-supervised signals contributes to the downstream tasks. To see that this assumption is likely hold in practice, we again take BERT devlin2018bert as an example. In BERT devlin2018bert , the information shared across masked and non-masked words is referred to as contextual information. Our assumption states that the contextual information contributes to the downstream tasks and not the exclusive information in masked words or non-masked words. Another example is visual representation learning in SimCLR chen2020simple , where the authors apply different image augmentations on a given image, treating one of them as input and the other one as the corresponding self-supervised signal. Our assumption states that only the shared information (i.e., the content of the image) between the augmented images contributes to the downstream tasks, which is in accord that the image augmentations (e.g., changing the style of an image) should not affect the labels of images.
Based on this assumption, we develop an unsupervised compressed representation learning strategy. In particular, we extract task-relevant information by maximizing the mutual information between the learned representations and the self-supervised signals. Then, we discard task-irrelevant information by minimizing the conditional entropy of the learned representations given the self-supervised signals. We show this strategy 1) includes prior arts for SSL on contrastive agrawal2015learning ; arandjelovic2017look ; jayaraman2015learning ; oord2018representation ; bachman2019learning ; chen2020simple ; tian2019contrastive ; hjelm2018learning ; he2019momentum ; kong2019mutual ; ozair2019wasserstein ; arora2019theoretical ; henaff2019data and predictive learning zhang2016colorful ; pathak2016context ; vondrick2016generating ; tulyakov2018mocogan ; srivastava2015unsupervised ; peters2018deep ; devlin2018bert ; dai2019transformer ; bai2018empirical approaches; 2) paves the way to a larger space of composing SSL objectives; and 3) leads us a discussion on limitations and challenges of using these objectives. For instance, we can combine both contrastive and predictive learning approaches as our SSL objective, being aware that the contrastive objective requires larger batch size and the predictive objective is hard to optimize if the self-supervised signals are high-dimensional.
We first conduct controlled experiments on visual representation learning to 1) verify that the self-supervisedly learned representation could extract task-relevant and discard task-irrelevant information; and 2) compare different compositions of SSL objectives. Then, we perform self-supervised visual-textual representation learning in a challenging setting that input and self-supervised signals lie in very different modalities. We make our experiments publicly available at https://github.com/yaohungt/Demystifying_Self_Supervised_Learning.
In this section we aim to show self-supervised learning (SSL) can learn a representation that is beneficial for downstream tasks. For the input, we denote its random variable as
, sample space as , and outcome as . Similarly, for the self-supervised signal, we denote its random variable/ sample space/ outcome as / / . Two sample spaces can be different: . We learn a representation (/ / ) from the input through a deterministic mapping : . The information required for downstream tasks is referred to as “task-relevant information”: / / . Note that SSL has no access to the task-relevant information. Lastly, we use to represent mutual information, to represent conditional mutual information, and to represent conditional entropy for random variables //. We provide high-level takeaways for our main results in Figure 1.The derivations throughout the paper rely on the following redundancy assumption and determinism lemma. First, we assume redundancy between the input and self-supervised signal :
The input is redundant to the self-supervised signal for the task-relevant information. In other words, we assume the following conditional independence: or equivalently . We assume the redundancy also holds when we swap and , and hence or equivalently . By mutual redundancy, .
Assumption 1 states that the information required for the downstream tasks lies only in the shared information between the input and self-supervised signals. We provide an intuition by relating the assumption to Multiview learning xu2013survey ; sridharan2008information . Multiview learning extracts representations from data across different views, and it assumes each view provides the same task-relevant information. In SSL, we can regard the input and self-supervised signals as different views of the data. For instance, in contrastive visual representation learning hjelm2018learning ; chen2020simple , the input and the corresponding self-supervised signal are the same image with different image augmentations (images with different views).
Next, we provide a useful lemma using the fact that is a deterministic mapping:
If is Dirac, then the following conditional independence holds: and
, given by a Markov chain
111The Markov chain is naturally satisfied when is a deterministic mapping. If is random, the Markov chain needs to be further assumed to satisfy the conditional independence: and ..This lemma simply states that contains no more information than .
Under a supervised setting, to learn representations which contain only and no more than the information required for the downstream tasks, we consider the following objectives:
Uncompressed and compressed supervised representation are defined as
Then, contains all task-relevant information .
Adopting Data Processing Inequality cover2012elements in the Markov chain (Lemma 1), is maximized at . by Assumption 1. ∎
The definition shows the supervisedly learned representation / can extract relevant information for the downstream tasks. Next, we provide a justification that minimizing 222To discard task-irrelevant information, an alternative objective is minimizing , which represents the information between and that are irrelevant to . However, minimizing the conditional mutual information (i.e., ) requires a min-max optimization, which may cause instability in practice. Hence, we consider minimizing , which does not contain a min-max optimization. leads to compressed representations. Minimizing reduces the randomness from to , and the randomness is regarded as the incompressibility calude2013information . Hence, when satisfying the constraint “ is maximized”, minimizing leads to a more compressed representation (discarding superfluous information). Note that our analysis does not constrain the type of , which can be classification, regression, or clustering.
In Definition 1, we discuss uncompressed and compressed supervised representation learning objectives. To bridge the gap between supervised and self-supervised learning, we perform the following supervision decomposition (from the downstream tasks to the self-supervised signals):
We consider the supervision decomposition from to :
Also,
The decomposition allows us to 1) perform supervision on (i.e., self-supervised learning) instead of (i.e., supervised learning); 2) associate supervisedly- and self-supervisedly-learned representations; and 3) characterize the compression gap from supervised to self-supervised learning. Formally,
Uncompressed and compressed self-supervised representation are defined as
Then, contains all the shared information between and .
Adopting Data Processing Inequality cover2012elements in the Markov chain (Lemma 1), is maximized at . ∎
Uncompressed and compressed self-supervised representation extract all task-relevant information, suggesting :
In other words, compressed self-supervised representation is a subset of uncompressed self-supervised representation, and the later one is a subset of supervised representation: .
Adopting Data Processing Inequality cover2012elements in (Lemma 1), is maximized at . Then, bringing the results in Definitions 1 and 2 into Lemma 2, we conclude is maximized if and only if and are both maximized. ∎
Compressed self-supervised representation cannot discard all task-irrelevant information, where a compression gap exists:
with is the information that cannot be discarded in SSL.
As a summary, Definition 2 defines our compressed SSL strategy. Theorem 1 indicates that this strategy can extract as much task-relevant information as the supervised learned one. For how much task-irrelevant information can be discarded, Theorem 2 indicates a compression gap between the supervised and the self-supervised learning.
We now associate our self-supervised representation learning strategy (Definition 2) with prior SSL objectives, especially for contrastive agrawal2015learning ; arandjelovic2017look ; jayaraman2015learning ; oord2018representation ; bachman2019learning ; chen2020simple ; tian2019contrastive ; hjelm2018learning ; he2019momentum ; kong2019mutual ; ozair2019wasserstein ; arora2019theoretical ; henaff2019data and predictive zhang2016colorful ; pathak2016context ; vondrick2016generating ; tulyakov2018mocogan ; srivastava2015unsupervised ; peters2018deep ; devlin2018bert ; dai2019transformer ; bai2018empirical learning objectives. We illustrate important remarks in Figure 2.
We define the contrastive learning objective as maximizing the mutual information between the learned representation and the self-supervised signal , which maximizes dependency/contrastiveness between and . Given Theorem 1, we have:
If , then contains all task-relevant information.
The corollary suggests, even having no access to the downstream tasks, maximizing results in containing all the information required for the downstream tasks from /. To deploy the contrastive learning objective, recent methods propose to maximize lower bounds of mutual information belghazi2018mine ; oord2018representation ; poole2019variational ; song2019understanding or its variants such as JS-divergence poole2019variational ; hjelm2018learning between the joint density and the product of the marginal density. We denote these methods as with representing the parameters when computing . In this work, we suggest contrastive predictive coding (CPC) oord2018representation ; tian2019contrastive
, which is a mutual information lower bound with lower variance
poole2019variational ; song2019understanding :(1) |
where is a deterministic mapping and is a project head that projects a representation in
into a lower-dimensional vector. If the input and self-supervised signals share the same sample space, i.e.,
, we can impose (e.g., self-supervised visual representation learning chen2020simple ). The projection head, , can be an identity, a linear, or a non-linear mapping. Last, we note that modeling eq. (1) or other contrastive learning objectives belghazi2018mine ; poole2019variational often require large batch size (e.g., in eq. (1)) hjelm2018learning ; he2019momentum ; chen2020simple to ensure both low variance and bias (w.r.t. the true ). Empirical work tschannen2019mutual has suggested that large variance in contrastive learning objectives may lead to worsen performance for the downstream tasks.We define the forward predictive learning as maximizing the log conditional likelihood from the learned representation to the self-supervised signal , which encourages to reconstruct
. By the chain rule,
, where is irrelevant to . Hence, maximizing is equivalent to maximizing . Given Theorem 1, we have:If , then contains all task-relevant information.
The corollary suggests, if can perfectly reconstruct for any , then contains all the information required for the downstream tasks from /. A common approach to avoid intractability in Corollary 2 is assuming a variational distribution with representing the parameters when computing . Now, we re-arrange . Hence, is a lower bound of . The bound is tight when . can be any distribution such as Gaussian or Laplacian and
can be a linear model, a kernel method, or a neural network. For example, MocoGAN
tulyakov2018mocogan assumes is Laplacian (i.e., reconstruction loss) and is a deconvolutional network long2015fully . Transformer-XL dai2019transformer assumes is a categorical distribution (i.e., cross entropy loss) andis a Transformer network
vaswani2017attention . If we let be Gaussian withas an identity matrix, the objective becomes:
(2) |
where is a deterministic mapping to reconstruct from
. Note that we ignore the constants derived from the Gaussian distribution. Last, in most real-world applications, the self-supervised signal
has a much higher dimension than the representation . Hence, modeling a conditional generative model will be challenging. For example, considering as image and as dimensional vector.We define the inverse predictive learning as maximizing the log conditional likelihood from the self-supervised signal to the learned representation , which encourages to reconstruct . Given Theorem 2 together with , we have:
Suppose . Then, discards all the information, excluding , irrelevant for the downstream tasks.
The corollary suggests, if can perfectly reconstruct for any under the constraint that is maximized, then discards the information, excluding , irrelevant for the downstream tasks. Similar to the forward predictive learning, we use as a lower bound of . In our deployment, we take the advantage of the design in eq. (1) and let be Gaussian with being an identity matrix:
(3) |
Note that optimizing eq. (3) alone results in a degenerated solution, e.g., learning and to be the same constant. As suggested in Corollary 3, we consider a constrained optimization instead of an unconstrained one.
We have connected the SSL strategy presented in Definition 2 to contrastive learning objective in Corollary 1 and predictive learning objectives in Corollaries 2 and 3. Bringing their practical aspects together (eq. (1), (2), and (3)), we can pave the way to a larger space of composing SSL objectives:
(4) |
where , , and are hyper-parameters.
Our goal is to construct a set of controlled experiments that satisfy Assumption 1 and could empirically support Theorem 1 and 2.
Experimental Setup. We use Omniglot dataset lake2015human in the experiments. The training set contains images from characters, and the test set contains characters. There are no characters overlap between the training and test set. Each character contains twenty examples drawn from twenty different people. We regard image as input () and generate self-supervised signal () by first sampling an image from the same character as the input image and then applying translation/ rotation to it. Furthermore, we represent task-relevant information () by one-hot label encoding. Under this self-supervised signal construction, the exclusive information in or are drawing styles (i.e., by different people) and image augmentations, and only their shared information contribute to . To formally show the later, if representing the label for /, then and are Dirac. Hence, and , satisfying Assumption 1.
We train the feature mapping with SSL objectives (see eq. (4)), set , let to be symmetrical to , and have to be an identity mapping. On the test set, we fix the mapping and randomly select examples per character as the labeled examples. Then, we classify the rest of the examples using the 1-nearest neighbor classifier based on feature (i.e., ) cosine similarity. The random performance on this task stands at
Results & Discussions.
In Figure 3, we provide empirical analysis to support Theorem 1 and 2. We report / / / for during training and report / as the upper bound of / . For the objectives, we consider (contrastive learning only) for Theorem 1/ Corollary 1 and (contrastive and inverse predictive learning) for Theorem 2/ Corollary 3. In Figure 3 (a) and (b), we observe a positive correlation between and . Hence, it implies the SSL objectives can extract task-relevant information. Moreover, comparing to only, has larger values given the same epoch or the same
Comparisons for different compositions of SSL objectives on self-supervised visual representation training. We report mean and its standard error from
random trials.In Figure 4, we evaluate the generalization ability on the test set for different SSL objectives. Figure 4 (a)/(b) suggest that, comparing to , 1) reaches better test accuracy; 2) requires shorter training epochs to reach the best performance; and 3) suffers from overfitting with long-epoch training. Combining both of them () brings their advantages together. We also find that adding in the objective can boost model performance. According to Theorem 2 and Corollary 3, the improved performance suggests a more compressed representation results in better performance for the downstream tasks. Nonetheless, in Figure 4 (c), we find the performance is sensitive to the hyper-parameter for combining . We would also like to examine whether combining and together can lead to improved performance in SOTA SSL framework. In Figure 4 (d), we provide experiment with SimCLR chen2020simple on CIFAR10 krizhevsky2009learning , where refers to the exact same setup as in SimCLR (which considers only ). By considering in SimCLR, when changing , we observe a similar trend with our Omniglot experiment.
So far, we have provided empirical support for Theorem 1/ 2 and compared different SSL objectives on the visual representation learning task. Under this task, the input and self-supervised signals lie in the same domain and have the same content (i.e., images of the same character) but different styles and image augmentation. We now consider having the input and self-supervised signals lie in very different modalities - vision and text.
Experimental Setup. We provide experiments using Microsoft COCO (MS COCO) dataset coco that contains k multi-labeled images with million labeled instances from objects. Each image has annotated captions describing the relationships between objects in the scenes.
We regard image as input () and its textual descriptions as self-supervised signal (), and we use (+) as our SSL objective. We use ResNet50 he2016deep image encoder for (trained from scratch or fine-tuned on ImageNet
Comparisons for different settings on self-supervised visual-textual representation training. We report metrics on MS COCO validation set with mean and standard deviation from
random trials. Micro ROC-AUC / Subset Accuracy are the higher the better and Hamming Loss is the lower the better.Results & Discussions. First, Figure 5 (a) suggests that the SSL strategy can work when the input and self-supervised signals lie in different modalities. For example, a random guess for the subset accuracy would be , and the setting under Raw BERT + Raw ResNet achieves . We also see that using pre-trained ResNet can further improve the self-supervisedly learned representation, while using pre-trained BERT does not give us obvious benefits. Next, Figure 5 (b) suggests that the self-supervisedly learned representations can be further improved by combining and : . In Figure 5 (c)/(d), we have a similar observation as the self-supervised visual representation learning experiment: the hyper-parameter is sensitive to the performance.
Our work aims at providing theoretical insights for the empirical success of self-supervised learning. The most related work is Unsupervised Contrastive Learning Theory arora2019theoretical that assumes two similar data (i.e., one stands for the input and the other stands for the corresponding self-supervised signal) have the same latent class, and a downstream classification task is comprised of a subset of the latent classes. Then, the work presented 1) provable guarantees for the downstream classification using contrastively learned representations; and 2) generalization bound such that the learned representations can reduce (labeled) sample complexity on downstream tasks. Our work differs in two ways: 1) we present a different assumption that only the shared information between the input and self-supervised signals contribute to the downstream tasks; and 2) we do not constrain the type of the downstream tasks to be classification, where they could be regression, clustering, etc.
Multi-view learning xu2013survey also closely relates to our work. Specifically, we can regard the input and self-supervised signals as two different views of data, and self-supervised learning aims at learning useful representations across views. Sridharan et. al. sridharan2008information pose the underlying assumption for multi-view learning: either view alone is sufficient for the downstream tasks (see Assumption 1 in sridharan2008information ). Their assumption is synonymous to our Assumption 1. Note that they focus on semi-supervised setting while we focus on unsupervised setting. Another recent work federici2020learning combines multi-view learning and information bottleneck tishby2000information method to balance the trade-off between extracting joint multi-view information and discarding non-joint multi-view information.
On empirical side, we explain why contrastive agrawal2015learning ; arandjelovic2017look ; jayaraman2015learning ; oord2018representation ; bachman2019learning ; chen2020simple ; tian2019contrastive ; hjelm2018learning ; he2019momentum ; kong2019mutual ; ozair2019wasserstein ; arora2019theoretical ; henaff2019data and predictive learning zhang2016colorful ; pathak2016context ; vondrick2016generating ; tulyakov2018mocogan ; srivastava2015unsupervised ; peters2018deep ; devlin2018bert ; dai2019transformer ; bai2018empirical approaches represent good self-supervised learning objectives, showing that these objectives can (unsupervisedly) extract task-relevant information.
In this paper, we studied self-supervised learning via an information-theoretical perspective. We designed a self-supervised learning framework to extract task-relevant information and discard task-irrelevant information. We also connected this framework with prior self-supervised learning methods, specifically for contrastive and predictive learning objectives. To support our theoretical analysis empirically, we designed controlled experiments on visual representation learning and visual-textual representation learning. We believe this work sheds light on the advantages of self-supervised learning and may help better understand when and why self-supervised learning is likely to work. In the future, we plan to investigate, compare, and combine different deployments of contrastive learning, forward predictive learning, and inverse predictive learning objectives. Another area of interest for future exploration is multi-modality self-supervised learning.
This work was supported in part by the DARPA grants FA875018C0150 HR00111990016, NSF IIS1763562, NSF Awards #1750439 #1722822, National Institutes of Health, and Apple. We would also like to acknowledge NVIDIA’s GPU support.
2009 IEEE conference on computer vision and pattern recognition
, pages 248–255. Ieee, 2009.Recent trends in deep learning based natural language processing.
ieee Computational intelligenCe magazine, 13(3):55–75, 2018.Colorful image colorization.
In European conference on computer vision, pages 649–666. Springer, 2016.If is Dirac, then the following conditional independence holds: and , given by a Markov chain .
When is a deterministic function of , for any in the sigma-algebra induced by we have , which implies and . ∎
Bringing the redundancy assumption and determinism lemma together, we get:
The representation is redundant to the self-supervised signal for the task-relevant information, meaning .
By redundancy Assumption, . Also, . ∎
We consider the supervision decomposition from to :
Also,
Plug in (see Lemma 4) into chain rules of mutual information: and . Likewise, plug in (see redundancy Assumption) into chain rules for and . ∎
To ease the understanding of the paper, we provide an information-diagram version of our road map for our derivations. Note that information diagram provides easy-to-understand relationships between information measurements. We encourage the readers to refer to the main text for formal proofs and statements of the results.
At first, we introduce compressed supervised representation learning by minimizing and maximizing . This supervisedly learned representation contains only and no more than the task-relevant information, and hence is believed to be optimally compressed (for downstream tasks). Then, to connect with self-supervised learning, we perform a supervision transition from the downstream task to the self-supervised signal. Under some derivations, we show that minimizing is discarding task-irrelevant information and maximizing is extracting task-relevant information, even when these two objectives have no access to downstream tasks. The resulting optimally learned representation contains only and no more than the shared information between /. Last, we demonstrate that extracts all task-relevant information from and is the information that cannot be discarded.
Our derivations are based on the following assumption and lemmas. The core assumption is that input and self-supervised signal are mutually redundant for downstream tasks. The assumption suggests the exclusive information in input and self-supervised signal is what we can discard. Next, using the fact that is deterministic from , we characterize conditional independence by a Markov chain . This lemma simply states that post-processing (i.e., to ) cannot introduce additional information. Last, based on the redundancy assumption and determinism lemma, we present supervision decomposition that is used for transiting supervision from the downstream task to the self-supervised signal.
After depicting our theories and their derivations, we connect our SSL framework and prior work zhang2016colorful ; devlin2018bert ; oord2018representation ; bachman2019learning ; chen2020simple ; tian2019contrastive ; hjelm2018learning ; he2019momentum , discussing practical implementation for different SSL objectives.
In the main text, we design controlled experiments on self-supervised visual representation learning to empirically support our theorem and examine different compositions of SSL objectives. In this section, we will discuss 1) the architecture design; 2) different deployments of contrastive/ forward predictive learning; and 3) different self-supervised signal construction strategy. We argue that these three additional set of experiments may be interesting future work.
The input image has size . For image augmentations, we adopt 1) rotation with degrees from to ; 2) translation from pixels to pixels; 3) scaling both width and height from to ; 4) scaling width from to while fixing the height; and 5) resizing the image to . Then, a deep network takes a image and outputs a dim. feature vector. The deep network has the structure: . has 3x3 kernel size with output channels, has 2x2 kernel size, and is a to weight matrix. is symmetric to , which has . has the exact same number of parameters as . Note that we use the same network designs in and estimations. To reproduce the results in our experimental section, please refer to our released code (https://github.com/yaohungt/Demystifying_Self_Supervised_Learning).
In the main text, for practical deployments, we suggest Contrastive Predictive Coding (CPC) oord2018representation for and assume Gaussian distribution for the variational distributions in / . The practical deployments can be abundant by using different mutual information approximations for and having different distribution assumptions for / . In the following, we discuss a few examples.
Contrastive Learning. Other than CPC oord2018representation , another popular contrastive learning objective is JS bachman2019learning , which is the lower bound of Jensen-Shannon divergence between and (a variational bound of mutual information). Its objective can be written as
where we use to denote .
Predictive Learning. Gaussian distribution may be the simplest distribution form that we can imagine, which leads to Mean Square Error (MSE) reconstruction loss. Here, we use forward predictive learning as an example, and we discuss the case when lies in discrete sample space. Specifically, we let be factorized multivariate Bernoulli:
(5) |
This objective leads to Binary Cross Entropy (BCE) reconstruction loss.
If we assume each reconstruction loss corresponds to a particular distribution form, then by ignoring which variatioinal distribution we choose, we are free to choose arbitrary reconstruction loss. For instance, by switching and in eq. (5), the objective can be regarded as Reverse Binary Cross Entropy Loss (RevBCE) reconstruction loss. In our experiments, we find RevBCE works the best among {MSE, BCE, and RevBCE}. Therefore, in the main text, we choose RevBCE as the example reconstruction loss as .
More Experiments. We provide an additional set of experiments by having {CPC, JS} for and {MSE, BCE, RevBCE} reconstruction loss for in Figure 6. From the results, we find different formulation of objectives bring very different test generalization performance. We argue that, given a particular task, it is challenging but important to find the best deployments for contrastive and predictive learning objectives.
In the main text, we design a self-supervised signal construction strategy that the input () and the self-supervised signal () differ in {drawing styles, image augmentations}. This self-supervised signal construction strategy is different from the one that is commonly adopted in most self-supervised visual representation learning work tian2019contrastive ; bachman2019learning ; chen2020simple . Specifically, prior work consider the difference between input and the self-supervised signal only in image augmentations. We provide additional experiments in Fig. 7 to compare these two different self-supervised signal construction strategies.
We see that, comparing to the common self-supervised signal construction strategy tian2019contrastive ; bachman2019learning ; chen2020simple , the strategy introduced in our controlled experiments has much better generalization ability to test set. It is worth noting that, although our construction strategy has access to the label information (i.e., we sample the self-supervised signal image from the same character with the input image), our SSL objectives do not train with the labels. Nonetheless, since we implicitly utilize the label information in our self-supervised construction strategy, it will be unfair to directly compare our strategy and prior one. An interesting future research direction is examining different self-supervised signal construction strategy and even combine full/part of label information into self-supervised learning.
Subset Accuracy multilbl , also know as the Exact Match Ratio (MR), ignores all partially correct (consider them incorrect) outputs and extend accuracy from the single label case to the multi-label setting.
Micro AUC ROC score aucroc computes the AUC (Area under the curve) of a receiver operating characteristic (ROC) curve.
Hamming Loss multilbl is the fraction of wrong labels to the total number of labels.
Comments
There are no comments yet.