Leveraging Hidden Structure in Self-Supervised Learning

by   Emanuele Sansone, et al.

This work considers the problem of learning structured representations from raw images using self-supervised learning. We propose a principled framework based on a mutual information objective, which integrates self-supervised and structure learning. Furthermore, we devise a post-hoc procedure to interpret the meaning of the learnt representations. Preliminary experiments on CIFAR-10 show that the proposed framework achieves higher generalization performance in downstream classification tasks and provides more interpretable representations compared to the ones learnt through traditional self-supervised learning.



There are no comments yet.


page 4

page 8


Improving out-of-distribution generalization via multi-task self-supervised pretraining

Self-supervised feature representations have been shown to be useful for...

Self-Supervised Learning with Kernel Dependence Maximization

We approach self-supervised learning of image representations from a sta...

Information Maximization Clustering via Multi-View Self-Labelling

Image clustering is a particularly challenging computer vision task, whi...

Evaluation of Out-of-Distribution Detection Performance of Self-Supervised Learning in a Controllable Environment

We evaluate the out-of-distribution (OOD) detection performance of self-...

Robot Gaining Accurate Pouring Skills through Self-Supervised Learning and Generalization

Pouring is one of the most commonly executed tasks in humans' daily live...

Aggregative Self-Supervised Feature Learning

Self-supervised learning (SSL) is an efficient approach that addresses t...

Information Competing Process for Learning Diversified Representations

Learning representations with diversified information remains an open pr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Self-supervised learning has gained popularity in the last years thanks to the remarkable successes achieved in many areas, including natural language processing and computer vision. In this work, we consider extending self-supervised learning to learn structured representations directly from still images.

The contributions consist of (i) a new self-supervised framework, which leverages the hidden structure contained in the data and (ii) a learning procedure to interpret the meaning of the learnt representations.

The paper starts with the description of the proposed framework in Section 2, continues with the discussion of the related work in Section 3 and concludes with the experiments in Section 4.

2 Proposed Framework

Figure 1: Visualization of the two critic functions in Eq. 3. Note that the critic functions and are applied to pairs of positive samples, i.e. and its augmented version , and pairs of negative samples, i.e. and

. Positive and negative pairs are used to estimate the first and the last two addends of Eq. 

3, respectively. Regarding the network architectures, we used a Conv4 and a Resnet-8 for

and a multi-layer perceptron network for

, using the settings of (Patacchiola & Storkey, 2020). For , we used a two-layer message passing neural network (MPNN) with a complete graph, following (Kipf et al., 2018) (more details are provided in the Appendix).

Figure 2: Strategy used to analyze the representations learnt by the encoder.

2.1 Objective

Let us consider an input tensor

, a latent tensor , obtained by the transformation of using an encoding function and a reshaping operator, and a latent structure tensor , obtained by the transformation of using function . For example, can be an observed image, contains the representations of entities in the image and represents the relations among these entities.111 stands for the relation type, including the case of no relation between two entities. We experimented only with .

If we define the random vectors/tensors

, associated to and , respectively, we can define the generalized mutual information over these random quantities (equivalently the total correlation of ):



is the Kullback-Leibler divergence and


are the joint probability and the marginal densities of the random quantities, respectively.

Now, let us assume that and are conditionally independent given . Note that this is quite reasonable in practice. For example, if we consider an image containing a set of objects and a set of relations (e.g. right/left of, above/beyond, smaller/bigger), these two sets exist independently from each other and are instantiated through and by the observed scene. Thanks to this assumption, Eq. 1 simplifies into the following form (see the Appendix for the derivation):


Now, we are ready to provide an estimator for the quantity in Eq. 2, by using existing variational bounds (Barber & Agakov, 2004). Specifically, we focus on the bound proposed in (Nguyen et al., 2010) for its tractability and unbiasedness, thus obtaining the following criterion (see the Appendix for the derivation):


where and are two critic functions. Specifically, we have that , where and are the concatenated parameters of . Furthermore, we have that , where , , , and are the concatenated parameters of . For simplicity, we visualize and describe all these terms, including the two critic functions, in Figure 1.

Therefore, the learning problem consists of a maximization of the lower bound in Eq. 3 with respect to parameters and . During training the encoder learns to structure the representation guided by the two critic functions to improve not only the mutual information between and but also the mutual information between and . In effect, this is an inductive bias towards learning more structured representations.

2.2 Interpreting the Learnt Representations

From Figure 1 we can see that the encoder provides a global representation of the data, which is subsequently splitted/reshaped to provide the latent tensor . It’s interesting to understand the role of the rows of once the network is trained, to test (i) whether the network learns to bind each row of to a particular portion of the input and (ii) whether such portion corresponds to a particular entity of the scene.

To do so, we need to identify which entries of the input tensor (i.e. which pixels) mostly affect the different rows of . Specifically, we introduce a soft-binary mask for each input sample and each row of , where is an element-wise standard logistic function and is the parameter tensor of the mask associated to a specific image and representation , and then multiply the mask with the associated input. The final problem consists of minimizing for with respect to the parameters of the mask .

Similarly to previous case, we need to consider a tractable estimator for the mutual information objective. Differently from previous case, we devise an upper bound to be consistent with the minimization problem (see the Appendix for the whole derivation), namely:


where is an auxiliary distribution defined over the latent tensor representation, is a multivariate Gaussian density with mean and identity covariance matrix and is a constant with respect to mask parameters. The bound in Eq. 4 becomes tight when matches the true marginal . It’s also interesting to mention that, in theory, the bound in Eq. 4 admits a trivial solution with all masks set to zero. However, in practice, we didn’t observe the convergence to such trivial solutions, probably due the highly nonlinearity of the minimization problem.

Figure 2 shows the building block of the minimization procedure. Once the strategy concludes, we can inspect the latent representations of the encoder by plotting the masks for each image and each latent vector , as discussed in the experimental section.

3 Related Work

We organize the existing literature in terms of objectives, network architectures and data augmentations/pretext tasks.

Objectives. Mutual information is one of the most common criteria used in self-supervised learning (Linsker, 1988; Becker & Hinton, 1992). Estimating and optimizing mutual information from samples is notoriously difficult (McAllester & Stratos, 2020)

, especially when dealing with high-dimensional data. Most recent approaches focus on devising variational lower bounds on mutual information 

(Barber & Agakov, 2004), thus casting the problem of representation learning as a maximization of these lower bounds. Indeed, several popular objectives, like the mutual information neural estimation (MINE) (Belghazi et al., 2018), deep InfoMAX (Hjelm et al., 2018)

, noise contrastive estimation (InfoNCE) 

(O. Henaff, 2020) to name a few, all belong to the family of variational bounds (Poole et al., 2019)

. All of these estimators have different properties in terms of bias-variance trade-off 

(Tschannen et al., 2019; Song & Ermon, 2020). Our work generalizes over previous objectives based on mutual information, by considering structure as an additional latent variable. Indeed, we focus on learning a structured representation, which aims at better supporting downstream reasoning tasks and improving its interpretability. Therefore, our work is orthogonal to these previous studies, as different estimators can be potentially employed for this extended definition of mutual information.

It’s important to mention that new self-supervised objectives have been emerged recently (Zbontar et al., 2021; Grill et al., 2020)

as a way to avoid using negative samples, which are typically required by variational bound estimators. However, these objectives consider to learn distributed representations and disregard the notion of hidden structure, which is currently analyzed in this work.

Architectures. Graph neural networks are one of the most common architectures used in self-supervised learning on relational data (Kipf & Welling, 2016; Kipf et al., 2018; Davidson et al., 2018; Velickovic et al., 2019). While these works provide representations of the underlying graph data, thus supporting downstream tasks like graph/node classification and link prediction/graph completion, they typically assume that node/entities/symbols are known a priori. This work relaxes this assumption and devises a criterion able to integrate traditional representation learning with graph representation learning in a principled way.

Autoencoding architectures are also widely spread in representation learning. However, traditional VAE-based architectures (Kingma & Welling, 2014; Rezende et al., 2014) are typically not enough to learn disentangled representations (Locatello et al., 2019). Recent work on autoencoders focuses on introducing structured priors by learning both object representations (K.Greff et al., 2017) and their relations (van Steenkiste et al., 2018; Goyal et al., 2021; Stanić et al., 2021) directly from raw perceptual data in dynamic environments. Autoencoders have been also applied to synthetic cases to perform unsupervised scene decomposition from still images (Engelcke et al., 2019, 2021; Locatello et al., 2020), thus providing representations disentangling objects in a better way. Commonly to these works, we aim at learning object representations together with their relational information. Differently from these works, we compute an objective criterion, which does not require using a reconstruction term at the pixel level (thus avoiding learning noisy regularities from the raw data) and also avoids using a decoder, thus increasing the computational efficiency.

Data augmentations/Pretext tasks.Data augmentation strategies are typically used to produce transformed versions of data, which are then used to define positive and negative instances (Hjelm et al., 2018; O. Henaff, 2020; Patacchiola & Storkey, 2020). Different views (Bachman et al., 2019; Tian et al., 2019) or different data modalities (Gu et al., 2020) can be also exploited to augment the learning problem. Furthermore, there is a huge amount of related work devoted to the design of pretext tasks, namely predicting augmentations at the patch level (Dosovitskiy et al., 2014), predicting relative location of patches (Doersch et al., 2015), generating missing patches (Pathak et al., 2016), solving jigsaw puzzles (Noroozi & Favaro, 2016), learning to count (Noroozi et al., 2017), learning to spot artifacts (Jenni & Favaro, 2018), predicting image rotations (Komodakis & Gidaris, 2018), predicting information among different channels (Zhang et al., 2017) and predicting color information from gray images (Zhang et al., 2016; Larsson et al., 2016). All of these strategies are complementary to our work. However, we believe that our generalized objective can enable the development of new pretext tasks leveraging also relational information.

4 Experiments

(a) Conv-4
(b) Resnet-8
Figure 3:

Training accuracy obtained by a downstream linear classifier on CIFAR-10, during the unsupervised training of the backbones (Conv-4 on the left and Resnet-8 on the right). We trained only the linear classifier for

epoch every unsupervised iterations. Blue, green and red curves correspond to the performance obtained by the method in (Patacchiola & Storkey, 2020), i.e. , our method with only relational information, i.e. , and our complete method, i.e. , respectively.

Method Conv-4 Resnet-8 Conv-4 Resnet-8
Supervised 78.70.1 86.00.4 31.30.2 36.80.3
Random 33.21.9 36.80.2 11.00.4 13.60.3
* 54.20.1 56.90.4 27.30.3 28.20.5
(ours) 55.90.8 60.00.1 26.20.8 28.40.1
(ours) 56.30.8 59.50.2 27.80.4 29.60.5
*(Patacchiola & Storkey, 2020)
Table 1: Test accuracies for linear evaluation with different backbones, viz. Conv-4 and Resnet-8. Self-supervised training on CIFAR-10 for epochs and supervised linear training on CIFAR-10 and CIFAR-100 for epochs.

The experiments are organized in a way to answer the following three research questions:

  • Q1 (Surrogate): Is the proposed objective a good unsupervised surrogate for traditional supervised learning?

  • Q2 (Generalization): How well does the learnt representations generalize to downstream supervised tasks? And how well do they transfer?

  • Q3 (Interpretation): Can we interpret what the network has learnt to represent?

Given the limited amount of available GPU resources,222All experiments are run on 4 GPUs NVIDIA GeForce GTX 1080 Ti. we run experiments on small backbones, namely Conv-4 and Resnet-8, and train them in an unsupervised way on CIFAR-10 (Krizhevsky et al., 2009) for epochs based on the code and the settings of (Patacchiola & Storkey, 2020) (additional details are available in the Appendix). All experiments are averaged over three repeated trials. As a baseline, we consider the recent self-supervised framework in (Patacchiola & Storkey, 2020), which does not leverage structure and thus maximizes . For our framework, we consider two case, namely (i) the one using only structure, i.e. maximizing , and (ii) the overall one, i.e. maximizing .

Surrogate. We evaluate the downstream training performance of a linear classifier during the unsupervised training of the backbone. Specifically, we train a classifier on the training set of CIFAR-10 for epoch every unsupervised iterations and report its training accuracy. Figure 3 provides the evolution of the training accuracies for the two backbones. From the figures, we see that the baseline performs well during the early stages of training, but it is significantly outperformed by our purely structured-based strategy in the subsequent stages. Furthermore, the overall strategy is able to combine the positive aspects of the other strategies, thus providing a more principled approach.

Generalization. We evaluate generalization (CIFAR-10) and transfer (CIFAR-100) performance using a linear classifier. In this set of experiments, we also include two additional baselines, namely (i) a random one, i.e. an encoder initialized with random weights, and (ii) a supervised one, i.e. an encoder trained in a supervised fashion. Table 1 summarizes the performance of all strategies. From these results, we see that our proposed framework allows for significant improvement both in terms of generalization and transfer over the baselines.

Figure 4: Interpretation of latent representations for different images (). Rows tend to individualize on specific portions of the input and in some cases they are semantically meaningful interpretation (as highlighted by orange boxes).

Interpretation. Last but not least, we run the proposed interpretation procedure on the latent representations of the Conv-4 backbone for iterations and a batch of samples. Figure 4 shows the different masks associated to each row and each training instance. We observe that some rows learn to specialize on particular portions of the input, thus carrying semantically meaningful interpretations.

All these experiments provide an initial evidence on the benefits of learning structured representations in an unsupervised way. Indeed, we believe that performance, both in terms of generalization and interpretation, can be improved by introducing other inductive biases. For example, one could replace the global average pooling in the last layer of a backbone with an attentional layer, like slot attention (Locatello et al., 2020), in order to better cluster the input data and increase the level of specialization of the latent representations. Other possibilities consist of using mutual information bounds different from the one considered in this work. We leave all these analyses to future study.


This research was partially supported by TAILOR, a project funded by EU Horizon 2020 research and innovation programme under GA No 952215.


  • Bachman et al. (2019) Bachman, P., Hjelm, R. D., and Buchwalter, W. Learning Representations by Maximizing Mutual Information Across Views. In arXiv, 2019.
  • Barber & Agakov (2004) Barber, D. and Agakov, F. The IM Algorithm: A Variational Approach to Information Maximization. In NeurIPS, 2004.
  • Becker & Hinton (1992) Becker, S. and Hinton, G. Self-Organizing Neural Network that Discovers Surfaces in Random-Dot Stereograms. Nature, 355(6356):161–163, 1992.
  • Belghazi et al. (2018) Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. Mutual Information Neural Estimation. In ICML, 2018.
  • Davidson et al. (2018) Davidson, T. R., Falorsi, L., Cao, N. D., Kipf, T., and Tomczak, J. M. Hyperspherical Variational Auto-Encoders. In UAI, 2018.
  • Doersch et al. (2015) Doersch, C., Gupta, A., and Efros, A. A. Unsupervised Visual Representation Learning by Context Prediction. In ICCV, 2015.
  • Dosovitskiy et al. (2014) Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., and Brox, T.

    Discriminative Unsupervised Feature Learning with Convolutional Neural Networks.

    In NeurIPS, 2014.
  • Engelcke et al. (2019) Engelcke, M., Kosiorek, A., Parker, J. O., and Posner, I. GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations. In ICLR, 2019.
  • Engelcke et al. (2021) Engelcke, M., Parker, J. O., and Posner, I. GENESIS-V2: Inferring Unordered Object Representations without Iterative Refinement. In arXiv, 2021.
  • Goyal et al. (2021) Goyal, A., Lamb, A., Hoffmann, J., Sodhani, S., Levine, S., Bengio, Y., and Schölkopf, B. Recurrent Independent Mechanisms. In ICLR, 2021.
  • Grill et al. (2020) Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., Doersch, C., Pires, B. A., Guo, Z. D., Azar, M. G., et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. In NeurIPS, 2020.
  • Gu et al. (2020) Gu, J., Kuen, J., Joty, S., Cai, J., Morariu, V., Zhao, H., and Sun, T. Self-Supervised Relationship Probing. NeurIPS, 2020.
  • Hjelm et al. (2018) Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. Learning Deep Representations by Mutual Information Estimation and Maximization. In ICLR, 2018.
  • Jang et al. (2017) Jang, E., Gu, S., and Poole, B. Categorical Reparameterization with Gumbel-Softmax. In ICLR, 2017.
  • Jenni & Favaro (2018) Jenni, S. and Favaro, P. Self-Supervised Feature Learning by Learning to Spot Artifacts. In CVPR, 2018.
  • K.Greff et al. (2017) K.Greff, van Steenkiste, S., and Schmidhuber, J.

    Neural Expectation Maximization.

    In NeurIPS, 2017.
  • Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. In ICLR, 2014.
  • Kipf & Welling (2016) Kipf, T. and Welling, M. Variational Graph Auto-Encoders. In arXiv, 2016.
  • Kipf et al. (2018) Kipf, T., Fetaya, E., Wang, K. C., Welling, M., and Zemel, R. Neural Relational Inference for Interacting Systems. In ICML, 2018.
  • Komodakis & Gidaris (2018) Komodakis, N. and Gidaris, S. Unsupervised Representation Learning by Predicting Image Rotations. In ICLR, 2018.
  • Krizhevsky et al. (2009) Krizhevsky, A., Hinton, G., et al. Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto, 2009.
  • Larsson et al. (2016) Larsson, G., Maire, M., and Shakhnarovich, G.

    Learning Representations for Automatic Colorization.

    In ECCV, 2016.
  • Linsker (1988) Linsker, R. Self-Organization in a Perceptual Network. Computer, 21(3):105–117, 1988.
  • Locatello et al. (2019) Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., and Bachem, O.

    Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations.

    In ICML, 2019.
  • Locatello et al. (2020) Locatello, F., Weissenborn, D., Unterthiner, T., Mahendran, A., Heigold, G., Uszkoreit, J., Dosovitskiy, A., and Kipf, T. Object-Centric Learning with Slot Attention. In NeurIPS, 2020.
  • Maddison et al. (2017) Maddison, C. J., Mnih, A., and Teh, Y. W.

    The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables.

    In ICLR, 2017.
  • McAllester & Stratos (2020) McAllester, D. and Stratos, K. Formal Limitations on the Measurement of Mutual Information. In AISTATS, 2020.
  • Nguyen et al. (2010) Nguyen, X., Wainwright, M. J., and Jordan, M. I. Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
  • Noroozi & Favaro (2016) Noroozi, M. and Favaro, P. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. In ECCV, 2016.
  • Noroozi et al. (2017) Noroozi, M., Pirsiavash, H., and Favaro, P. Representation Learning by Learning to Count. In ICCV, 2017.
  • O. Henaff (2020) O. Henaff, O. Data-Efficient Image Recognition with Contrastive Predictive Coding. In ICML, 2020.
  • Patacchiola & Storkey (2020) Patacchiola, M. and Storkey, A. Self-Supervised Relational Reasoning for Representation Learning. NeurIPS, 2020.
  • Pathak et al. (2016) Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., and Efros, A. A. Context Encoders: Feature Learning by Inpainting. In CVPR, 2016.
  • Poole et al. (2019) Poole, B., Ozair, S., Oord, A. V. D., Alemi, A., and Tucker, G. On Variational Bounds of Mutual Information. In ICML, 2019.
  • Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D.

    Stochastic Backpropagation and Approximate Inference in Deep Generative Models.

    In ICML, 2014.
  • Song & Ermon (2020) Song, J. and Ermon, S. Understanding the Limitations of Variational Mutual Information Estimators. ICLR, 2020.
  • Stanić et al. (2021) Stanić, A., van Steenkiste, S., and Schmidhuber, J. Hierarchical Relational Inference. In AAAI, 2021.
  • Tian et al. (2019) Tian, Y., Krishnan, D., and Isola, P. Contrastive Multiview Coding. In arXiv, 2019.
  • Tschannen et al. (2019) Tschannen, M., Djolonga, J., Rubenstein, P. K., Gelly, S., and Lucic, M. On Mutual Information Maximization for Representation Learning. In ICLR, 2019.
  • van Steenkiste et al. (2018) van Steenkiste, S., Chang, M., Greff, K., and Schmidhuber, J. Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions. In ICLR, 2018.
  • Velickovic et al. (2019) Velickovic, P., Fedus, W., Hamilton, W., Liò, P., Bengio, Y., and Hjelm, R. D. Deep Graph Infomax. In ICLR, 2019.
  • Zbontar et al. (2021) Zbontar, J., Jing, L., Misra, I., LeCun, Y., and Deny, S. Barlow Twins: Self-Supervised Learning via Redundancy Reduction. In ICML, 2021.
  • Zhang et al. (2016) Zhang, R., Isola, P., and Efros, A. A. Colorful Image Colorization. In ECCV, 2016.
  • Zhang et al. (2017) Zhang, R., Isola, P., and Efros, A. A. Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction. In CVPR, 2017.

Appendix A Simplification of Generalized Mutual Information

Appendix B Derivation of Lower Bound

We recall the derivation of the lower bound only for the term , as the other one can be obtained by applying the same methodology.


where is an auxiliary density. Now, if we consider , we obtain that:


By exploiting the inequality for all , we obtain that:

Thus concluding the derivation.

Appendix C Details of Network Architectures

We use the graph neural network encoder proposed in (Kipf et al., 2018) and report some details for completeness.

The message passing neural network, i.e. , is used to predict the relational tensor:

where and are MLPs with one hidden layer and is a vector whose elements are generated i.i.d from a distribution (Maddison et al., 2017; Jang et al., 2017).

In the experiments on CIFAR-10 (Krizhevsky et al., 2009), we use and .

Appendix D Derivation of Upper Bound

where the first inequality is obtained by noting that , and the second one is based on Jensen’s inequality.

Appendix E Hyperparameters in Experiments

  • Number of self-supervised training epochs

  • Batch size

  • Number of data augmentations

  • and number of relation types

  • Number of supervised training epochs for linear evaluation .

  • Adam Optimizer with learning rate .

Appendix F Interpretations for the Whole Batch of Samples

See Figure 5

Figure 5: Interpretation of latent representations for different images (better zoom-in).