Do Self-Supervised and Supervised Methods Learn Similar Visual Representations?

10/01/2021 ∙ by Tom George Grigg, et al. ∙ Apple Inc. 0

Despite the success of a number of recent techniques for visual self-supervised deep learning, there remains limited investigation into the representations that are ultimately learned. By using recent advances in comparing neural representations, we explore in this direction by comparing a constrastive self-supervised algorithm (SimCLR) to supervision for simple image data in a common architecture. We find that the methods learn similar intermediate representations through dissimilar means, and that the representations diverge rapidly in the final few layers. We investigate this divergence, finding that it is caused by these layers strongly fitting to the distinct learning objectives. We also find that SimCLR's objective implicitly fits the supervised objective in intermediate layers, but that the reverse is not true. Our work particularly highlights the importance of the learned intermediate representations, and raises important questions for auxiliary task design.



There are no comments yet.


page 2

page 3

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the last two decades, progress in solving visual tasks has primarily been driven by convolutional neural networks (CNNs)

[Deng et al., 2009, Dosovitskiy et al., 2021, He et al., 2016, Krizhevsky et al., 2012, LeCun et al., 1989, Russakovsky et al., 2014, Zeiler and Fergus, 2014] trained on large labelled datasets via sl. More recently, ssl algorithms have started to close the performance gap [Alayrac et al., 2020, Caron et al., 2020, 2021, Chen et al., 2020a, b, Grill et al., 2020, Zbontar et al., 2021]

. The success of these visual ssl algorithms raises important questions from a representation learning perspective: how are ssl methods building competitive respresentations without access to class labels? Do learned representations differ between sl and ssl? If so, can/should we encourage them to be similar? Do different ssl objectives learn qualitatively different representations? In this work, we begin to shed light in this direction by comparing the representations of CIFAR-10 (C10) induced in a ResNet-50 (R50) architecture by sl against those induced by SimCLR, a prominent contrastive ssl algorithm. We find that:

  • Post-residual representations are similar across methods, however residual (block-interior) representations are dissimilar; similar structure is recovered by solving different problems.

  • Initial residual layer representations are similar, indicating a shared set of primitives.

  • The methods strongly fit to their distinct objectives in the final few layers, where SimCLR learns augmentation invariance and sl fits the class structure.

  • sl does not implicitly learn augmentation invariance, but augmentation invariance does implicitly fit the class structure and induces linear separability.

  • The representational structures rapidly diverge in the final layers, suggesting that SimCLR’s peformance stems from class-informative intermediate representations, rather than implicit structural agreement between learned solutions to the sl and SimCLR objectives.

Figure 1:

CKA between all layers of R50 networks trained via SimCLR. We show all, odd and even layers in the left, middle and right plots respectively. In contrast to prior work, we compare across different initializations as a sanity check for solution stability.

2 Background

Multi-view visual SSL

Many recent ssl algorithms for visual data focus on a multi-view/augmentation invariance auxiliary objective, where the model learns to identify different views of the same input image (positive pairs), and distinguish views of different images (negative pairs). Here, we focus on SimCLR [Chen et al., 2020a] as a step towards understanding self-supervised representation learning more broadly. We leave the analysis of alternative visual SSL methods for future work.


SimCLR learns representations by contrasting different views of a single image to views of other images, where we sample from a family of augmentations . Views are constructed through application: , where is the parametric backbone, typically a CNN, and is the nce head, typically an MLP. SimCLR’s objective is then to minimize InfoNCE [Chen et al., 2020a, van den Oord et al., 2018]:


where , are different views of the same image, are different views of different images, is the temperature, and

is cosine similarity.

Comparing neural representation spaces

Comparing neural representations is challenging due to their distributed nature, potential misalignment, and high dimensionality. Prior work has demonstrated the utility of cka as a similarity index which elegantly addresses these challenges [Kornblith et al., 2019], enabling the analysis of a variety of neural architectures [Kornblith et al., 2019, Nguyen et al., 2021, Raghu et al., 2021].

Let , be and dimensional representation matrices whose rows are aligned111The th row in and correspond to the th sample for all .. Let , be the corresponding Gram matrices. The cka value is the normalized hsic [Gretton et al., 2008] of these Gram matrices:


We use the linear kernel due to its strong empirical performance and computational efficiency, simplifying the calculation of hsic to , where , , and is the centering matrix.

Experimental setup

We use a R50 [He et al., 2016] backbone for each model. For SimCLR, we follow the training in Chen et al. [2020a]. We group representations into residual (odd) and post-residual (even) layers, in line with the analysis of Kornblith et al. [2019]. Further details are outlined in Appendix A.

3 Results

Figure 2: (Left/Middle) CKA between the odd/even layers of networks trained via SimCLR and sl. For the evens, we mark the most similar supervised layer for each SimCLR layer with a white dot. (Right) For each layer in SimCLR, the similarity to its corresponding supervised layer (diag), and to the most similar supervised layer (max). We also denote the bg (see Section A.1).

3.1 Internal representational structure of SimCLR

We begin by using cka to study the internal representational similarity of SimCLR in Figure 1. This result mirrors the supervised analysis of Kornblith et al. [2019], indicating that sl and SimCLR utilize the residual architecture in a similar way, with residual blocks decoupling from each other. For completeness, we replicate the sl result for a R50 in Appendix B.

3.2 Comparing early and intermediate SimCLR and supervised representations

Next, we compare the representational structures induced by SimCLR and sl. In Figure 2, we plot the odd and even layer CKA matrices across the learning methods, we observe:

Common primitives

Residual representations are similar in the very early layers, perhaps due to both objectives inducing common primitives like Gabor filters [Vincent et al., 2010].

Dissimilar residual (Odd)

Beyond these initial layers, similarity between the residual representations substantially reduces, indicating that each method learns residuals that operate on the input in qualitatively different ways – likely a reflection of their distinct learning objectives.

Similar post-residual (Even)

Despite the dissimilarity of residuals, there is high similarity across the diagonal in the post-residual layers, indicating that the representations accumulated remain similar across learning methods; similar representations are learned in a dissimilar way.

Stalling behaviour

Finally, SimCLR appears to “stall” upon entering a new BG, remaining more similar to previous supervised layers, before “catching up” to the diagonal. This may be induced by SimCLR’s strong augmentations, requiring a broader distribution to be compressed after each BG.

Figure 3: (Left) Linear probe accuracies for learned representations in the SimCLR and sl models. (Middle) CKA between representations of differently augmented datasets at corresponding layers. (Right) CKA of learned representations with the class representations. We plot post-residual (even) layers only, denoting the block groups (BG) and NCE head (Head) where appropriate.

3.3 Late layer representational dissimilarity of SimCLR and supervised learning

Figure 2 (right) indicates that the representational structures learned by SimCLR and sl rapidly diverge in the final BG. Here, we analyze this behaviour.

Linear separability of classes

We first inspect whether this divergence leads to performance differences by computing the performance of linear probes trained at each layer in the SimCLR and supervised networks (Figure 3 (left)). We find a monotonic increase in linear separability of the classes for both methods. This suggests that despite the divergence in later layers, both representational structures continue to become more linearly separable. This raises the question: if the structures are diverging, but both are becoming more separable, what exactly is being learned?

Augmentation invariance

In Figure 3 (middle), we investigate what happens in the layers of both networks with respect to SimCLR’s augmentation invariance objective. Here, we augment each sample in the C10 test dataset with two augmentations sampled from the augmentation distribution used during training222ImageNet augmentations for sl, SimCLR augmentations for SimCLR., creating pairs of augmented test datasets. We measure the degree of invariance at each layer by plotting the CKA value between the representations of these augmented datasets. We observe that SimCLR’s representations become more augmentation invariant with depth, increasingly so in the final few layers of the network. This contrasts with SL where the representations start out similar under (weaker) augmentations, then diverge until the final block group, where we see a small increase in CKA – presumably due to classification. This result tells us (1) SimCLR does learn substantial augmentation invariance and (2) sl does not implicitly learn augmentation invariant representations. This is perhaps surprising from the perspective of classification as a form of augmentation invariance where the augmentation distribution is the class-conditioned data distribution. CKA heatmaps are presented in Appendix C.

Mapping to the classes

Next we look at the sl objective which, from a representation learning perspective, maps inputs to their assigned vertices on the simplex in the class representation space. In Figure 3 (right), we plot the CKA similarity between the class representations and the learned representations in the layers of the SimCLR and supervised networks. We observe a monotonic increase in CKA with the class structure for both methods throughout the backbone, offering insight into the increasing linear separability. It is however clear that sl accelerates much more rapidly towards the class structure in the final block group due to explicit optimization – likely explaining the divergence of SL and SimCLR in Figure 2. We also observe a decrease in similarity to the classes after the first layer of the NCE head, perhaps revealing the NCE head’s role as a buffer which allows the backbone to learn richer class-informative features rather than immediately fit to InfoNCE.

4 Conclusion

We have shown the utility of CKA for comparing across learning methods

, rather than architectures. Using this tool, we have demonstrated that SimCLR representations are similar to those of supervised learning in their intermediate layers. Interestingly, we see divergence in the final few layers where each methods fits to its own objectives. Here, SimCLR learns augmentation invariance, contrasting with supervised learning, which instead strongly projects to the class simplex. This suggests that it is not the similarity of the final representational structure that solves SimCLR’s objective that facilitates strong empirical performance. Rather, it is the similarity of the intermediate representations, i.e. the class-informative features that happen to be learned along the way.

These findings raise important questions for auxiliary task design: Can we build label-free tasks that share more intermediate features with supervised learning? Should we include inductive biases that look like “mapping to the simplex”, e.g. orthogonality? Is mapping to the simplex desirable? Or are self-supervised representations more robust in a multi-task/multi-distribution setting? We leave these questions for future work.


Appendix A Experimental Setup

Experimental setup

We choose the same ResNet50 architecture [He et al., 2016] for SimCLR and the supervised model. We follow the training procedure described in Chen et al. [2020a]: all models use the LARS optimizer [Huo et al., 2021] with linear warmup [Goyal et al., 2017] and a single cycle cosine annealed learning rate schedule [Goyal et al., 2017, Smith and Topin, 2017]

. SimCLR models are trained for 1300 epochs with a batch size of 4096 under SimCLR augmentations

[Chen et al., 2020a], whereas our supervised models are trained for 300 epochs using a batch size of 8192 under standard ImageNet augmentations333 RandomResizedCrop(224), RandomHorizontalFlip and channel-wise standardisation.. For SimCLR, we implement the original version of Chen et al. [2020a], where in particular the NCE head is a 3-layer MLP.

For each learning method, we train from 3 different random initializations, resulting in 6 models. Models are trained on the training set of CIFAR-10 (50,000 samples) [Krizhevsky and Hinton, 2009]

. Representations are produced on the test set of CIFAR-10 (10,000 samples) under each model’s corresponding test augmentation family. Representations are flattened, producing a single vector for each sample.

a.1 Even and Odd representations

For each bottleneck layer, we extract the following representations: class Bottleneck(nn.Module): # Other definitions

def forward(self, x: Tensor) -> Tensor: identity = x

out = self.conv1(x) out = self.bn1(out) out = self.relu(out)

out = self.conv2(out) out = self.bn2(out) out = self.relu(out)

out = self.conv3(out) ODD_REPRESENTATIONS[i] = out = self.bn3(out)

if self.downsample is not None: identity = self.downsample(x)

out += identity. EVEN_REPRESENTATIONS[i] = out = self.relu(out)

return out i.e. two representations per bottleneck.

A ResNet50 is built out of 4 block groups, each subsequent group increasing dimensionality (see Table 1). The total number of bottleneck layers across all groups is , resulting in 16 odd representations and 16 even representations that we use in our analysis.

Group Name Number of Bottlenecks Filters in each Bottleneck
BG1 3 , ,
BG2 4 , ,
BG3 6 , ,
BG4 3 , ,
Table 1: The filter properties and bottleneck multiplicities in each bg of a ResNet50.

Appendix B Internal representational structure for supervised learning

Figure 4: cka between all layers of ResNet-50 networks trained via supervision. We plot all layers in the left column, and even/odd layers on the middle/right.

Here, we replicate the results of Kornblith et al. [2019] in our experimental setup. In particular, in Figure 4 we use CKA to compare the learned representations of ResNet-50 architectures trained via supervised learning, as specified in Appendix A. We note that in contrast to their work, we compare across different initializations in order to check for solution stability.

Corroborating their findings, we observe high similarity across neighbouring post-residual (even) layers in the network, and greater dissimilarity between residual (odd) layers, which largely appear similar only to themselves. The similarity of even layers is explained by the residual connections propagating representations through the network. The dissimilarity of odd layers suggests that each sequential block performs a distinct modification to this propagated residual representation. The similarity within block groups (i.e. at the same dimensionality) is higher than across block groups for all layers. We further note that the results for SimCLR (

Figure 1) mirrors that of supervised learning, except there appears to be even greater disagreement across block groups.

Appendix C Full Augmentation Invariance CKA Heatmaps

In Figure 5 we provide the all-layers CKA comparisons of the representations of differently augmented test datasets in the same model. The diagonals of Figure 5 correspond to Figure 3 (middle). The augmentation invariance of the supervised model gradually fades starting from the bottom left corner, suggesting that it is initially due to the residual connections and weak augmentation strategy. The SimCLR plot is striking: substantial (but not total) invariance is learned in the NCE head, and this backpropagates into the final few layers of the backbone. However, the representations show limited robustness to augmentation right up until these last few layers of the network.

Figure 5: We apply the method-specific training augmentations to the CIFAR-10 test dataset and plot the CKA of the representations as they propagate through the supervised and SimCLR models.