1 Introduction
Selfsupervised learning has gained popularity in the last years thanks to the remarkable successes achieved in many areas, including natural language processing and computer vision. In this work, we consider extending selfsupervised learning to learn structured representations directly from still images.
The contributions consist of (i) a new selfsupervised framework, which leverages the hidden structure contained in the data and (ii) a learning procedure to interpret the meaning of the learnt representations.
2 Proposed Framework
. Positive and negative pairs are used to estimate the first and the last two addends of Eq.
3, respectively. Regarding the network architectures, we used a Conv4 and a Resnet8 forand a multilayer perceptron network for
, using the settings of (Patacchiola & Storkey, 2020). For , we used a twolayer message passing neural network (MPNN) with a complete graph, following (Kipf et al., 2018) (more details are provided in the Appendix).2.1 Objective
Let us consider an input tensor
, a latent tensor , obtained by the transformation of using an encoding function and a reshaping operator, and a latent structure tensor , obtained by the transformation of using function . For example, can be an observed image, contains the representations of entities in the image and represents the relations among these entities.^{1}^{1}1 stands for the relation type, including the case of no relation between two entities. We experimented only with .If we define the random vectors/tensors
, associated to and , respectively, we can define the generalized mutual information over these random quantities (equivalently the total correlation of ):(1) 
where
is the KullbackLeibler divergence and
,are the joint probability and the marginal densities of the random quantities, respectively.
Now, let us assume that and are conditionally independent given . Note that this is quite reasonable in practice. For example, if we consider an image containing a set of objects and a set of relations (e.g. right/left of, above/beyond, smaller/bigger), these two sets exist independently from each other and are instantiated through and by the observed scene. Thanks to this assumption, Eq. 1 simplifies into the following form (see the Appendix for the derivation):
(2) 
Now, we are ready to provide an estimator for the quantity in Eq. 2, by using existing variational bounds (Barber & Agakov, 2004). Specifically, we focus on the bound proposed in (Nguyen et al., 2010) for its tractability and unbiasedness, thus obtaining the following criterion (see the Appendix for the derivation):
(3) 
where and are two critic functions. Specifically, we have that , where and are the concatenated parameters of . Furthermore, we have that , where , , , and are the concatenated parameters of . For simplicity, we visualize and describe all these terms, including the two critic functions, in Figure 1.
Therefore, the learning problem consists of a maximization of the lower bound in Eq. 3 with respect to parameters and . During training the encoder learns to structure the representation guided by the two critic functions to improve not only the mutual information between and but also the mutual information between and . In effect, this is an inductive bias towards learning more structured representations.
2.2 Interpreting the Learnt Representations
From Figure 1 we can see that the encoder provides a global representation of the data, which is subsequently splitted/reshaped to provide the latent tensor . It’s interesting to understand the role of the rows of once the network is trained, to test (i) whether the network learns to bind each row of to a particular portion of the input and (ii) whether such portion corresponds to a particular entity of the scene.
To do so, we need to identify which entries of the input tensor (i.e. which pixels) mostly affect the different rows of . Specifically, we introduce a softbinary mask for each input sample and each row of , where is an elementwise standard logistic function and is the parameter tensor of the mask associated to a specific image and representation , and then multiply the mask with the associated input. The final problem consists of minimizing for with respect to the parameters of the mask .
Similarly to previous case, we need to consider a tractable estimator for the mutual information objective. Differently from previous case, we devise an upper bound to be consistent with the minimization problem (see the Appendix for the whole derivation), namely:
(4) 
where is an auxiliary distribution defined over the latent tensor representation, is a multivariate Gaussian density with mean and identity covariance matrix and is a constant with respect to mask parameters. The bound in Eq. 4 becomes tight when matches the true marginal . It’s also interesting to mention that, in theory, the bound in Eq. 4 admits a trivial solution with all masks set to zero. However, in practice, we didn’t observe the convergence to such trivial solutions, probably due the highly nonlinearity of the minimization problem.
Figure 2 shows the building block of the minimization procedure. Once the strategy concludes, we can inspect the latent representations of the encoder by plotting the masks for each image and each latent vector , as discussed in the experimental section.
3 Related Work
We organize the existing literature in terms of objectives, network architectures and data augmentations/pretext tasks.
Objectives. Mutual information is one of the most common criteria used in selfsupervised learning (Linsker, 1988; Becker & Hinton, 1992). Estimating and optimizing mutual information from samples is notoriously difficult (McAllester & Stratos, 2020)
, especially when dealing with highdimensional data. Most recent approaches focus on devising variational lower bounds on mutual information
(Barber & Agakov, 2004), thus casting the problem of representation learning as a maximization of these lower bounds. Indeed, several popular objectives, like the mutual information neural estimation (MINE) (Belghazi et al., 2018), deep InfoMAX (Hjelm et al., 2018), noise contrastive estimation (InfoNCE)
(O. Henaff, 2020) to name a few, all belong to the family of variational bounds (Poole et al., 2019). All of these estimators have different properties in terms of biasvariance tradeoff
(Tschannen et al., 2019; Song & Ermon, 2020). Our work generalizes over previous objectives based on mutual information, by considering structure as an additional latent variable. Indeed, we focus on learning a structured representation, which aims at better supporting downstream reasoning tasks and improving its interpretability. Therefore, our work is orthogonal to these previous studies, as different estimators can be potentially employed for this extended definition of mutual information.It’s important to mention that new selfsupervised objectives have been emerged recently (Zbontar et al., 2021; Grill et al., 2020)
as a way to avoid using negative samples, which are typically required by variational bound estimators. However, these objectives consider to learn distributed representations and disregard the notion of hidden structure, which is currently analyzed in this work.
Architectures. Graph neural networks are one of the most common architectures used in selfsupervised learning on relational data (Kipf & Welling, 2016; Kipf et al., 2018; Davidson et al., 2018; Velickovic et al., 2019). While these works provide representations of the underlying graph data, thus supporting downstream tasks like graph/node classification and link prediction/graph completion, they typically assume that node/entities/symbols are known a priori. This work relaxes this assumption and devises a criterion able to integrate traditional representation learning with graph representation learning in a principled way.
Autoencoding architectures are also widely spread in representation learning. However, traditional VAEbased architectures (Kingma & Welling, 2014; Rezende et al., 2014) are typically not enough to learn disentangled representations (Locatello et al., 2019). Recent work on autoencoders focuses on introducing structured priors by learning both object representations (K.Greff et al., 2017) and their relations (van Steenkiste et al., 2018; Goyal et al., 2021; Stanić et al., 2021) directly from raw perceptual data in dynamic environments. Autoencoders have been also applied to synthetic cases to perform unsupervised scene decomposition from still images (Engelcke et al., 2019, 2021; Locatello et al., 2020), thus providing representations disentangling objects in a better way. Commonly to these works, we aim at learning object representations together with their relational information. Differently from these works, we compute an objective criterion, which does not require using a reconstruction term at the pixel level (thus avoiding learning noisy regularities from the raw data) and also avoids using a decoder, thus increasing the computational efficiency.
Data augmentations/Pretext tasks.Data augmentation strategies are typically used to produce transformed versions of data, which are then used to define positive and negative instances (Hjelm et al., 2018; O. Henaff, 2020; Patacchiola & Storkey, 2020). Different views (Bachman et al., 2019; Tian et al., 2019) or different data modalities (Gu et al., 2020) can be also exploited to augment the learning problem. Furthermore, there is a huge amount of related work devoted to the design of pretext tasks, namely predicting augmentations at the patch level (Dosovitskiy et al., 2014), predicting relative location of patches (Doersch et al., 2015), generating missing patches (Pathak et al., 2016), solving jigsaw puzzles (Noroozi & Favaro, 2016), learning to count (Noroozi et al., 2017), learning to spot artifacts (Jenni & Favaro, 2018), predicting image rotations (Komodakis & Gidaris, 2018), predicting information among different channels (Zhang et al., 2017) and predicting color information from gray images (Zhang et al., 2016; Larsson et al., 2016). All of these strategies are complementary to our work. However, we believe that our generalized objective can enable the development of new pretext tasks leveraging also relational information.
4 Experiments
Training accuracy obtained by a downstream linear classifier on CIFAR10, during the unsupervised training of the backbones (Conv4 on the left and Resnet8 on the right). We trained only the linear classifier for
epoch every unsupervised iterations. Blue, green and red curves correspond to the performance obtained by the method in (Patacchiola & Storkey, 2020), i.e. , our method with only relational information, i.e. , and our complete method, i.e. , respectively.CIFAR10  CIFAR10CIFAR100  
Method  Conv4  Resnet8  Conv4  Resnet8 
Supervised  78.70.1  86.00.4  31.30.2  36.80.3 
Random  33.21.9  36.80.2  11.00.4  13.60.3 
*  54.20.1  56.90.4  27.30.3  28.20.5 
(ours)  55.90.8  60.00.1  26.20.8  28.40.1 
(ours)  56.30.8  59.50.2  27.80.4  29.60.5 
*(Patacchiola & Storkey, 2020) 
The experiments are organized in a way to answer the following three research questions:

Q1 (Surrogate): Is the proposed objective a good unsupervised surrogate for traditional supervised learning?

Q2 (Generalization): How well does the learnt representations generalize to downstream supervised tasks? And how well do they transfer?

Q3 (Interpretation): Can we interpret what the network has learnt to represent?
Given the limited amount of available GPU resources,^{2}^{2}2All experiments are run on 4 GPUs NVIDIA GeForce GTX 1080 Ti. we run experiments on small backbones, namely Conv4 and Resnet8, and train them in an unsupervised way on CIFAR10 (Krizhevsky et al., 2009) for epochs based on the code and the settings of (Patacchiola & Storkey, 2020) (additional details are available in the Appendix). All experiments are averaged over three repeated trials. As a baseline, we consider the recent selfsupervised framework in (Patacchiola & Storkey, 2020), which does not leverage structure and thus maximizes . For our framework, we consider two case, namely (i) the one using only structure, i.e. maximizing , and (ii) the overall one, i.e. maximizing .
Surrogate. We evaluate the downstream training performance of a linear classifier during the unsupervised training of the backbone. Specifically, we train a classifier on the training set of CIFAR10 for epoch every unsupervised iterations and report its training accuracy. Figure 3 provides the evolution of the training accuracies for the two backbones. From the figures, we see that the baseline performs well during the early stages of training, but it is significantly outperformed by our purely structuredbased strategy in the subsequent stages. Furthermore, the overall strategy is able to combine the positive aspects of the other strategies, thus providing a more principled approach.
Generalization. We evaluate generalization (CIFAR10) and transfer (CIFAR100) performance using a linear classifier. In this set of experiments, we also include two additional baselines, namely (i) a random one, i.e. an encoder initialized with random weights, and (ii) a supervised one, i.e. an encoder trained in a supervised fashion. Table 1 summarizes the performance of all strategies. From these results, we see that our proposed framework allows for significant improvement both in terms of generalization and transfer over the baselines.
Interpretation. Last but not least, we run the proposed interpretation procedure on the latent representations of the Conv4 backbone for iterations and a batch of samples. Figure 4 shows the different masks associated to each row and each training instance. We observe that some rows learn to specialize on particular portions of the input, thus carrying semantically meaningful interpretations.
All these experiments provide an initial evidence on the benefits of learning structured representations in an unsupervised way. Indeed, we believe that performance, both in terms of generalization and interpretation, can be improved by introducing other inductive biases. For example, one could replace the global average pooling in the last layer of a backbone with an attentional layer, like slot attention (Locatello et al., 2020), in order to better cluster the input data and increase the level of specialization of the latent representations. Other possibilities consist of using mutual information bounds different from the one considered in this work. We leave all these analyses to future study.
Acknowledgements
This research was partially supported by TAILOR, a project funded by EU Horizon 2020 research and innovation programme under GA No 952215.
References
 Bachman et al. (2019) Bachman, P., Hjelm, R. D., and Buchwalter, W. Learning Representations by Maximizing Mutual Information Across Views. In arXiv, 2019.
 Barber & Agakov (2004) Barber, D. and Agakov, F. The IM Algorithm: A Variational Approach to Information Maximization. In NeurIPS, 2004.
 Becker & Hinton (1992) Becker, S. and Hinton, G. SelfOrganizing Neural Network that Discovers Surfaces in RandomDot Stereograms. Nature, 355(6356):161–163, 1992.
 Belghazi et al. (2018) Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. Mutual Information Neural Estimation. In ICML, 2018.
 Davidson et al. (2018) Davidson, T. R., Falorsi, L., Cao, N. D., Kipf, T., and Tomczak, J. M. Hyperspherical Variational AutoEncoders. In UAI, 2018.
 Doersch et al. (2015) Doersch, C., Gupta, A., and Efros, A. A. Unsupervised Visual Representation Learning by Context Prediction. In ICCV, 2015.

Dosovitskiy et al. (2014)
Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., and Brox, T.
Discriminative Unsupervised Feature Learning with Convolutional Neural Networks.
In NeurIPS, 2014.  Engelcke et al. (2019) Engelcke, M., Kosiorek, A., Parker, J. O., and Posner, I. GENESIS: Generative Scene Inference and Sampling with ObjectCentric Latent Representations. In ICLR, 2019.
 Engelcke et al. (2021) Engelcke, M., Parker, J. O., and Posner, I. GENESISV2: Inferring Unordered Object Representations without Iterative Refinement. In arXiv, 2021.
 Goyal et al. (2021) Goyal, A., Lamb, A., Hoffmann, J., Sodhani, S., Levine, S., Bengio, Y., and Schölkopf, B. Recurrent Independent Mechanisms. In ICLR, 2021.
 Grill et al. (2020) Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., Doersch, C., Pires, B. A., Guo, Z. D., Azar, M. G., et al. Bootstrap Your Own Latent: A New Approach to SelfSupervised Learning. In NeurIPS, 2020.
 Gu et al. (2020) Gu, J., Kuen, J., Joty, S., Cai, J., Morariu, V., Zhao, H., and Sun, T. SelfSupervised Relationship Probing. NeurIPS, 2020.
 Hjelm et al. (2018) Hjelm, R. D., Fedorov, A., LavoieMarchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. Learning Deep Representations by Mutual Information Estimation and Maximization. In ICLR, 2018.
 Jang et al. (2017) Jang, E., Gu, S., and Poole, B. Categorical Reparameterization with GumbelSoftmax. In ICLR, 2017.
 Jenni & Favaro (2018) Jenni, S. and Favaro, P. SelfSupervised Feature Learning by Learning to Spot Artifacts. In CVPR, 2018.

K.Greff et al. (2017)
K.Greff, van Steenkiste, S., and Schmidhuber, J.
Neural Expectation Maximization.
In NeurIPS, 2017.  Kingma & Welling (2014) Kingma, D. P. and Welling, M. AutoEncoding Variational Bayes. In ICLR, 2014.
 Kipf & Welling (2016) Kipf, T. and Welling, M. Variational Graph AutoEncoders. In arXiv, 2016.
 Kipf et al. (2018) Kipf, T., Fetaya, E., Wang, K. C., Welling, M., and Zemel, R. Neural Relational Inference for Interacting Systems. In ICML, 2018.
 Komodakis & Gidaris (2018) Komodakis, N. and Gidaris, S. Unsupervised Representation Learning by Predicting Image Rotations. In ICLR, 2018.
 Krizhevsky et al. (2009) Krizhevsky, A., Hinton, G., et al. Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto, 2009.

Larsson et al. (2016)
Larsson, G., Maire, M., and Shakhnarovich, G.
Learning Representations for Automatic Colorization.
In ECCV, 2016.  Linsker (1988) Linsker, R. SelfOrganization in a Perceptual Network. Computer, 21(3):105–117, 1988.

Locatello et al. (2019)
Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B.,
and Bachem, O.
Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations.
In ICML, 2019.  Locatello et al. (2020) Locatello, F., Weissenborn, D., Unterthiner, T., Mahendran, A., Heigold, G., Uszkoreit, J., Dosovitskiy, A., and Kipf, T. ObjectCentric Learning with Slot Attention. In NeurIPS, 2020.

Maddison et al. (2017)
Maddison, C. J., Mnih, A., and Teh, Y. W.
The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables.
In ICLR, 2017.  McAllester & Stratos (2020) McAllester, D. and Stratos, K. Formal Limitations on the Measurement of Mutual Information. In AISTATS, 2020.
 Nguyen et al. (2010) Nguyen, X., Wainwright, M. J., and Jordan, M. I. Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
 Noroozi & Favaro (2016) Noroozi, M. and Favaro, P. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. In ECCV, 2016.
 Noroozi et al. (2017) Noroozi, M., Pirsiavash, H., and Favaro, P. Representation Learning by Learning to Count. In ICCV, 2017.
 O. Henaff (2020) O. Henaff, O. DataEfficient Image Recognition with Contrastive Predictive Coding. In ICML, 2020.
 Patacchiola & Storkey (2020) Patacchiola, M. and Storkey, A. SelfSupervised Relational Reasoning for Representation Learning. NeurIPS, 2020.
 Pathak et al. (2016) Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., and Efros, A. A. Context Encoders: Feature Learning by Inpainting. In CVPR, 2016.
 Poole et al. (2019) Poole, B., Ozair, S., Oord, A. V. D., Alemi, A., and Tucker, G. On Variational Bounds of Mutual Information. In ICML, 2019.

Rezende et al. (2014)
Rezende, D. J., Mohamed, S., and Wierstra, D.
Stochastic Backpropagation and Approximate Inference in Deep Generative Models.
In ICML, 2014.  Song & Ermon (2020) Song, J. and Ermon, S. Understanding the Limitations of Variational Mutual Information Estimators. ICLR, 2020.
 Stanić et al. (2021) Stanić, A., van Steenkiste, S., and Schmidhuber, J. Hierarchical Relational Inference. In AAAI, 2021.
 Tian et al. (2019) Tian, Y., Krishnan, D., and Isola, P. Contrastive Multiview Coding. In arXiv, 2019.
 Tschannen et al. (2019) Tschannen, M., Djolonga, J., Rubenstein, P. K., Gelly, S., and Lucic, M. On Mutual Information Maximization for Representation Learning. In ICLR, 2019.
 van Steenkiste et al. (2018) van Steenkiste, S., Chang, M., Greff, K., and Schmidhuber, J. Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions. In ICLR, 2018.
 Velickovic et al. (2019) Velickovic, P., Fedus, W., Hamilton, W., Liò, P., Bengio, Y., and Hjelm, R. D. Deep Graph Infomax. In ICLR, 2019.
 Zbontar et al. (2021) Zbontar, J., Jing, L., Misra, I., LeCun, Y., and Deny, S. Barlow Twins: SelfSupervised Learning via Redundancy Reduction. In ICML, 2021.
 Zhang et al. (2016) Zhang, R., Isola, P., and Efros, A. A. Colorful Image Colorization. In ECCV, 2016.
 Zhang et al. (2017) Zhang, R., Isola, P., and Efros, A. A. SplitBrain Autoencoders: Unsupervised Learning by CrossChannel Prediction. In CVPR, 2017.
Appendix A Simplification of Generalized Mutual Information
Appendix B Derivation of Lower Bound
We recall the derivation of the lower bound only for the term , as the other one can be obtained by applying the same methodology.
(5) 
where is an auxiliary density. Now, if we consider , we obtain that:
(6) 
By exploiting the inequality for all , we obtain that:
Thus concluding the derivation.
Appendix C Details of Network Architectures
We use the graph neural network encoder proposed in (Kipf et al., 2018) and report some details for completeness.
The message passing neural network, i.e. , is used to predict the relational tensor:
where and are MLPs with one hidden layer and is a vector whose elements are generated i.i.d from a distribution (Maddison et al., 2017; Jang et al., 2017).
In the experiments on CIFAR10 (Krizhevsky et al., 2009), we use and .
Appendix D Derivation of Upper Bound
where the first inequality is obtained by noting that , and the second one is based on Jensen’s inequality.
Appendix E Hyperparameters in Experiments

Number of selfsupervised training epochs

Batch size

Number of data augmentations

and number of relation types

Number of supervised training epochs for linear evaluation .

Adam Optimizer with learning rate .
Appendix F Interpretations for the Whole Batch of Samples
See Figure 5
Comments
There are no comments yet.