1 Introduction
A common challenge occuring across machine learning is to extract useful, structured representations from unlabelled data (such as images). There are at present two broad approaches to this problem: unsupervised learning and selfsupervised learning. The delineation between unsupervised and selfsupervised learning is sometimes regarded as unclear
(HernandezGarcia, 2020), but nonetheless we tentatively outline the broad characteristics of these approaches below.Unsupervised learning can be traced back at least to the Boltzmann machine
(Ackley et al., 1985) and the Helmholtz machine (Dayan et al., 1995; Hinton et al., 1995). This work emphasises two key characteristics of most unsupervised learning models: first, it should model the probability density of the data and second it would use latent variables that are ideally interpretable. Modern work in unsupervised learning typically uses variational autoencoders (VAEs)
(Kingma & Welling, 2013; Rezende et al., 2014). VAEs (like the Helmholtz machine) learn a probabilitistic encoder which maps from the data to a latent representation, and learn a decoder which maps from the latent representation back to the data domain. This highlights perhaps the key issue with VAEs: the need to reconstruct the data, which may be highly complex (e.g. images) (Dorta et al., 2018) which may force the latent space to encode details of the image that are be irrelevant for forming a good representation of e.g. object identity (Chen et al., 2016b).Selfsupervised learning is an alternative class of methods that learn good representations without needing to reconstruct the data. One common approach to selfsupervised learning is to define a “pretext” classification task (Dosovitskiy et al., 2015; Noroozi & Favaro, 2016; Doersch et al., 2015; Gidaris et al., 2018). For instance, we might take a number of photos, rotate them, and then ask the model to determine the rotation applied (Gidaris et al., 2018)
. The rotation can be identified by looking at the objects in the image (e.g. grass is typically on the bottom, while aircraft are typically in the sky), and thus a representation useful for determining the orientation may also extract useful information for other highlevel tasks. We are interested in an alternative class of objectives known as InfoNCE (NCE standing for noise contrastive estimation)
(Oord et al., 2018). These methods take two inputs (e.g. two different patches from the same underlying image), encode them to form two latent representations, and maximize the mutual information between them. As the shared information should concern highlevel properties such as objects, but not lowlevel details of each patch, this should again extract a useful representation.Here, we develop a new family of selfsupervised variational autoencoders (SSVAEs). We show that the SSVAE objective (the ELBO) is exactly equal to the mutual information under one particular choice of prior, and exactly equal to the parametric mutual information estimator used in InfoNCE under an alternative prior (up to constants). This provides a principled justification for the use of simplified parametric estimators of mutual information in InfoNCE methods, which resolves conceptual issues around the use of the mutual information in semisupervised learning. In particular, recent work has shown that mutual information maximization alone is not sufficient to explain the good representations learned by InfoNCE
(Tschannen et al., 2019). In particular, as the mutual information is invariant under arbitrary invertible transformations, maximizing mutual information could give highly entangled representations (Tschannen et al., 2019). In practice, InfoNCE is likely to learn good representations due to the choice of a highly simplified parametric model for estimating the mutual information
(Oord et al., 2018), which has recently been shown to be closely related to kernel methods (Li et al., 2021). Our approach gives insights into how these simplified estimators could be understood as priors on the latent space.2 Background
2.1 Variational Autoencoders
Usually in a variational autoencoder (Kingma & Welling, 2013; Rezende et al., 2014), we have observed data, , and latents, , and we specify a prior, , a likelihood, , and an approximate posterior, . We then jointly optimize parameters of the prior, likelihood, and approximate posterior using the ELBO,
(1) 
which bounds the model evidence or marginal likelihood, (as can be shown using Jensen’s inequality). The approximate posterior, , is often known as the encoder as it maps from data to latents, while the likelihood, , is often known as the decoder, as it maps from latents back to the data domain.
2.2 InfoNCE
In InfoNCE (Oord et al., 2018), there are two data items, and . Oord et al. (2018) initially describes a timeseries setting where was a context giving the recent history of past data and was data for the next time step. But Oord et al. (2018) also consider other contexts, such as where and are different augmentations from the same underlying image. The InfoNCE objective was originally motivated as maximizing the mutual information between latent representations,
(2) 
Note, we are using rather than for consistency with VAE derivations in the methods, and denotes the distribution induced by taking true data,
, and encoding them with neural networks,
and . As the mutual information is difficult to estimate, they use a bound based on a classifer that distinguishes the positive sample (i.e. the paired with the corresponding ) from negative samples (i.e. drawn from the marginal distribution and unrelated to or to the underlying data; see Poole et al., 2019 for further details).(3) 
The original InfoNCE paper uses,
(4) 
where is a weight matrix trained jointly with the parameters of and using the above objective. Taking the limit as goes to infinity, the bound becomes tight (Oord et al., 2018), and can be written as (Wang & Isola, 2020; Li et al., 2021),
(5) 
3 Methods
We begin by looking at the unstructured SSVAEs with a single latent and observed variable. This gives useful intuition but does not recover InfoNCE. We then go on to look at the structured SSVAE with two latent and two observed variables which recovers InfoNCE.
3.1 Unstructured SSVAEs
In a standard variational autoencoder, we specify parametric forms (e.g. using neural networks) for the prior, , the likelihood, and the approximate posterior, . However, in an SSVAE, we specify only the prior, , and the approximate posterior, . The likelihood, , is given implicitly. In a simple model with one latent variable, , and one observation,
, the likelihood is given by Bayes theorem,
or equivalently  (6) 
where is the true distribution over data, which is fixed, independent of parameters, and in general different from the model’s distribution,
. The joint distribution here is specified by combining
with the parameteric encoder, . The marginal, , can thus be written in terms of those distributions,(7) 
Substituting the likelihood (Eq. 6) into the ELBO (Eq. 1), we get,
(8) 
where is our parametric form for the prior and is given implicitly by Eq. 7.
Remember that is constant with respect to the parameters as is the true, fixed data distribution. This term can thus be treated as a constant for the purposes of optimizing the parameters of and . More importantly, we cannot directly evaluate the density ratio , as we cannot evaluate (Eq. 7
). Instead, and inspired by InfoNCE, it is possible to estimate this ratio using a classifier that distinguishes samples of
from those of .To understand how this objective behaves, consider taking the expectation of the ELBO (Eq. 8) over the true data distribution ,
(9) 
Optimizing the ELBO thus matches the marginal distributions in latent space between (Eq. 7) and our parametric prior, . In essence all we are doing is to find a mapping, , from to such that, averaging over from the data, the resulting ’s have a distribution close to . However, it is not at all clear that this will give us a good representation. For instance, if is Gaussian, and if noise in the data, , is Gaussian, then it may be easier to get Gaussian ’s by extracting noise, rather than (as we would like), extracting highlevel structure. That said, it may still be possible to do something useful by applying identifiability results inspired by ICA (e.g. Khemakhem et al., 2020).
3.2 Structured SSVAEs
The previous section argued that an SSVAE with just one latent and observed variable is unlikely to give useful representations. Instead, consider a generative model with two observed variables, and , and two latent variables, and . The approximate posterior is given in terms of neural network encoders for and separately,
(10) 
The generative model has structure ,
(11) 
where may be a specific, parametric form (e.g. a Gaussian), and the decoders, and are given implicitly in terms of the encoders, and and the true marginal distributions in the data, and ,
or equivalently  (12a)  
or equivalently  (12b) 
where,
(13) 
Now, we compute the model evidence (note we delay applying Jensen’s inequality to get the ELBO),
(14) 
Substituting for the approximate posterior (Eq. 10) and prior (Eq. 11),
(15)  
Substituting Eq. (12) and remembering that and are parameterindependent constants  
(16) 
Finally, applying Jensen’s inequality we get the ELBO,
(17) 
3.3 Deterministic encoders
Note that if we have a deterministic encoder,
(18) 
where is the Kronecker delta, then we have and and the ELBO bound is tight,
(19) 
so, the ELBO is equal to the model evidence (up to constant factors).
3.4 Understanding the SSVAE objective
To understand how this objective behaves, consider its expectation under the data distribution, ,
(20) 
where,
(21) 
Adding and subtracting ,
(22)  
allows us to identify two KLdivergence terms,  
(23) 
The first term is a mutual information. The objective therefore maximizes the mutual information between and under (Eq. 21). At the same time, the second term is the KLdivergence between the parametric prior, , and . Thus, the objective also pushes towards the parametric prior, .
3.5 Recovering a maximum mutualinformation objective
We can recover a mutualinformation objective by giving an implicit definition of the prior over latent variables,
(24) 
in which case the KLdivergence between and is zero, and we are left with just the mutual information (Eq. 2),
(25) 
3.6 Recovering the exact form for the InfoNCE estimator
Recent work has argued that the good representation arising from InfoNCE cannot be from maximizing mutual information alone, because the mutual information is invariant under arbitrary invertible transformations (Tschannen et al., 2019; Li et al., 2021). Instead, the good properties must arise somehow out of the simple mutual information estimator in Eq. (4). Remarkably, this exact estimator (at least in the infinite limit) can be recovered by making a specific choice of prior in the latent space. We choose the prior on implicitly, as , and we choose the distribution over conditioned on
to be given by an energy based model that depends on
and an arbitrary coupling function, , which could be given by Eq. (4) as in the original InfoNCE, or could be more general,(26)  
(27) 
The normalizing constant, , is
(28) 
To obtain the ELBO objective in Eq. (17), we need the ratio,
(29) 
Remarkably, this is exactly equal to the infinite limit of the InfoNCE objective in Eq. (5), so the full expected ELBO becomes,
(30) 
as derived in (Wang & Isola, 2020) and used in (Li et al., 2021).
4 Related work
Perhaps the closest prior work is Zimmermann et al. (2021)
, which also identifies an interpretation of InfoNCE as inference in a principled generative model. Unlike this work, we identify a connection between the InfoNCE objective and the ELBO or model evidence. Moreover, their proof requires complex geometric properties, whereas our’s merely involves straightforward manipulations of probability distributions. In addition, their approach requires four restrictive assumptions. First, they assume deterministic encoder, e.g
. In contrast, all our theory applies to stochastic encoders. While we do explicitly consider deterministic encoders in Sec. 3.3, this is only to show that with deterministic encoders, the ELBO bound is tight — all the derivations outside of this very small section (which includes all our key derivations) use fully general encoders, and . Second, they assume that is invertible, i.e. that there exists a deterministic decoder , which is not necessary in our framework. Third, they assume that the latent space is unit hypersphere, while in our framework the latent space can be any set. Fourth, they assume the ground truth marginal of the latents of the generative process is uniform, whereas our framework accepts any choice of groundtruth marginal. As such, our framework has considerably more flexibility to include rich priors on complex, structured latent spaces.Other work looked at the specific case of isolating content from style (von Kügelgen et al., 2021). This work used a similar derivation to that in Zimmermann et al. (2021) with slightly different assumptions. While they still required deterministic, invertible encoders, they relax e.g. uniformity in the latent space. But because they are working in the specific case of style and content variables, they make a number of additional assumptions on those variables. Importantly, they again do not connect the InfoNCE objective with the ELBO or model evidence.
Very different methods use noisecontrastive methods to update a VAE prior (Aneja et al., 2020). Importantly, they still use an explicit decoder.
There is a large class of work that seeks to use VAEs to extract useful, disentangled representations (e.g. Burgess et al., 2018; Chen et al., 2018; Kim & Mnih, 2018; Mathieu et al., 2019; Joy et al., 2020). Again, this work differs from our work in that it uses explicit decoders and thus does not identify an explicit link to selfsupervised learning.
Likewise, there is work on using GANs to learn interpretable latent spaces (e.g. Chen et al., 2016a). Importantly, GANs learn a decoder (mapping from the random latent space to the data domain). Moreover, GANs use a classifier to estimate a density ratio. However, GANs estimate this density ratio for the data, and , whereas InfoNCE, like the methods described here, uses a classifier to estimate a density ratio on the latent space, and .
There is work on reinterpreting classifiers as energybased probabilistic generative models (e.g. Grathwohl et al., 2019), which is related if we view SSL methods as being analogous to a classifier. Our work is very different, if for no other reason than because it is not possible to sample data from an SSVAE (even using a method like MCMC), because the decoder is written in terms of the unknown true data distribution. It would be interesting to consider the relationship to these models, but this is out of scope for the present work.
5 Conclusions
In conclusion, we have seen that the ELBO in an SSVAE is equal to the mutual information with one choice of prior, and equal to the InfoNCE parametric mutual information estimator with an alternative prior (up to constants). As such, we not only unify semisupervised and unsupervised learning methods, we also provide a principled framework for using simple parametric models in the latent space to enforce disentangled representations.
References
 Ackley et al. (1985) David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.
 Aneja et al. (2020) Jyoti Aneja, Alexander Schwing, Jan Kautz, and Arash Vahdat. Ncpvae: Variational autoencoders with noise contrastive priors. arXiv preprint arXiv:2010.02917, 2020.
 Burgess et al. (2018) Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in vae. arXiv preprint arXiv:1804.03599, 2018.
 Chen et al. (2018) Ricky TQ Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders. arXiv preprint arXiv:1802.04942, 2018.
 Chen et al. (2016a) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. arXiv preprint arXiv:1606.03657, 2016a.
 Chen et al. (2016b) Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731, 2016b.
 Dayan et al. (1995) Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The helmholtz machine. Neural computation, 7(5):889–904, 1995.

Doersch et al. (2015)
Carl Doersch, Abhinav Gupta, and Alexei A Efros.
Unsupervised visual representation learning by context prediction.
In
Proceedings of the IEEE international conference on computer vision
, pp. 1422–1430, 2015. 
Dorta et al. (2018)
Garoe Dorta, Sara Vicente, Lourdes Agapito, Neill DF Campbell, and Ivor
Simpson.
Structured uncertainty prediction networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 5477–5485, 2018. 
Dosovitskiy et al. (2015)
Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg, Martin
Riedmiller, and Thomas Brox.
Discriminative unsupervised feature learning with exemplar convolutional neural networks.
IEEE transactions on pattern analysis and machine intelligence, 38(9):1734–1747, 2015.  Gidaris et al. (2018) Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
 Grathwohl et al. (2019) Will Grathwohl, KuanChieh Wang, JörnHenrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like one. arXiv preprint arXiv:1912.03263, 2019.
 HernandezGarcia (2020) Alex HernandezGarcia. Rethinking supervised learning: insights from biological learning and from calling it by its name. arXiv preprint arXiv:2012.02526, 2020.
 Hinton et al. (1995) Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The” wakesleep” algorithm for unsupervised neural networks. Science, 268(5214):1158–1161, 1995.
 Joy et al. (2020) Tom Joy, Sebastian Schmon, Philip Torr, N Siddharth, and Tom Rainforth. Capturing label characteristics in vaes. In International Conference on Learning Representations, 2020.

Khemakhem et al. (2020)
Ilyes Khemakhem, Diederik Kingma, Ricardo Monti, and Aapo Hyvarinen.
Variational autoencoders and nonlinear ica: A unifying framework.
In
International Conference on Artificial Intelligence and Statistics
, pp. 2207–2217. PMLR, 2020.  Kim & Mnih (2018) Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In International Conference on Machine Learning, pp. 2649–2658. PMLR, 2018.
 Kingma & Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Li et al. (2021) Yazhe Li, Roman Pogodin, Danica J Sutherland, and Arthur Gretton. Selfsupervised learning with kernel dependence maximization. arXiv preprint arXiv:2106.08320, 2021.
 Mathieu et al. (2019) Emile Mathieu, Tom Rainforth, Nana Siddharth, and Yee Whye Teh. Disentangling disentanglement in variational autoencoders. In International Conference on Machine Learning, pp. 4402–4412. PMLR, 2019.
 Noroozi & Favaro (2016) Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pp. 69–84. Springer, 2016.
 Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
 Poole et al. (2019) Ben Poole, Sherjil Ozair, Aaron Van Den Oord, Alex Alemi, and George Tucker. On variational bounds of mutual information. In International Conference on Machine Learning, pp. 5171–5180. PMLR, 2019.

Rezende et al. (2014)
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
In International conference on machine learning, pp. 1278–1286. PMLR, 2014.  Tschannen et al. (2019) Michael Tschannen, Josip Djolonga, Paul K Rubenstein, Sylvain Gelly, and Mario Lucic. On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625, 2019.
 von Kügelgen et al. (2021) Julius von Kügelgen, Yash Sharma, Luigi Gresele, Wieland Brendel, Bernhard Schölkopf, Michel Besserve, and Francesco Locatello. Selfsupervised learning with data augmentations provably isolates content from style. arXiv preprint arXiv:2106.04619, 2021.
 Wang & Isola (2020) Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pp. 9929–9939. PMLR, 2020.
 Zimmermann et al. (2021) Roland S Zimmermann, Yash Sharma, Steffen Schneider, Matthias Bethge, and Wieland Brendel. Contrastive learning inverts the data generating process. arXiv preprint arXiv:2102.08850, 2021.