InfoNCE is a variational autoencoder

by   Laurence Aitchison, et al.

We show that a popular self-supervised learning method, InfoNCE, is a special case of a new family of unsupervised learning methods, the self-supervised variational autoencoder (SSVAE). SSVAEs circumvent the usual VAE requirement to reconstruct the data by using a carefully chosen implicit decoder. The InfoNCE objective was motivated as a simplified parametric mutual information estimator. Under one choice of prior, the SSVAE objective (i.e. the ELBO) is exactly equal to the mutual information (up to constants). Under an alternative choice of prior, the SSVAE objective is exactly equal to the simplified parametric mutual information estimator used in InfoNCE (up to constants). Importantly, the use of simplified parametric mutual information estimators is believed to be critical to obtain good high-level representations, and the SSVAE framework naturally provides a principled justification for using prior information to choose these estimators.


page 1

page 2

page 3

page 4


VMI-VAE: Variational Mutual Information Maximization Framework for VAE With Discrete and Continuous Priors

Variational Autoencoder is a scalable method for learning latent variabl...

Learning Representations by Maximizing Mutual Information in Variational Autoencoder

Variational autoencoders (VAE) have ushered in a new era of unsupervised...

Discovering Influential Factors in Variational Autoencoder

In the field of machine learning, it is still a critical issue to identi...

A Mutual Information Maximization Perspective of Language Representation Learning

We show state-of-the-art word representation learning methods maximize a...

MIM: Mutual Information Machine

We introduce the Mutual Information Machine (MIM), an autoencoder model ...

Student Collaboration Improves Self-Supervised Learning: Dual-Loss Adaptive Masked Autoencoder for Brain Cell Image Analysis

Self-supervised learning leverages the underlying data structure as the ...

Non-parametric Estimation of Mutual Information with Application to Nonlinear Optical Fibers

This paper compares and evaluates a set of non-parametric mutual informa...

1 Introduction

A common challenge occuring across machine learning is to extract useful, structured representations from unlabelled data (such as images). There are at present two broad approaches to this problem: unsupervised learning and self-supervised learning. The delineation between unsupervised and self-supervised learning is sometimes regarded as unclear

(Hernandez-Garcia, 2020), but nonetheless we tentatively outline the broad characteristics of these approaches below.

Unsupervised learning can be traced back at least to the Boltzmann machine

(Ackley et al., 1985) and the Helmholtz machine (Dayan et al., 1995; Hinton et al., 1995)

. This work emphasises two key characteristics of most unsupervised learning models: first, it should model the probability density of the data and second it would use latent variables that are ideally interpretable. Modern work in unsupervised learning typically uses variational autoencoders (VAEs)

(Kingma & Welling, 2013; Rezende et al., 2014). VAEs (like the Helmholtz machine) learn a probabilitistic encoder which maps from the data to a latent representation, and learn a decoder which maps from the latent representation back to the data domain. This highlights perhaps the key issue with VAEs: the need to reconstruct the data, which may be highly complex (e.g. images) (Dorta et al., 2018) which may force the latent space to encode details of the image that are be irrelevant for forming a good representation of e.g. object identity (Chen et al., 2016b).

Self-supervised learning is an alternative class of methods that learn good representations without needing to reconstruct the data. One common approach to self-supervised learning is to define a “pretext” classification task (Dosovitskiy et al., 2015; Noroozi & Favaro, 2016; Doersch et al., 2015; Gidaris et al., 2018). For instance, we might take a number of photos, rotate them, and then ask the model to determine the rotation applied (Gidaris et al., 2018)

. The rotation can be identified by looking at the objects in the image (e.g. grass is typically on the bottom, while aircraft are typically in the sky), and thus a representation useful for determining the orientation may also extract useful information for other high-level tasks. We are interested in an alternative class of objectives known as InfoNCE (NCE standing for noise contrastive estimation)

(Oord et al., 2018). These methods take two inputs (e.g. two different patches from the same underlying image), encode them to form two latent representations, and maximize the mutual information between them. As the shared information should concern high-level properties such as objects, but not low-level details of each patch, this should again extract a useful representation.

Here, we develop a new family of self-supervised variational autoencoders (SSVAEs). We show that the SSVAE objective (the ELBO) is exactly equal to the mutual information under one particular choice of prior, and exactly equal to the parametric mutual information estimator used in InfoNCE under an alternative prior (up to constants). This provides a principled justification for the use of simplified parametric estimators of mutual information in InfoNCE methods, which resolves conceptual issues around the use of the mutual information in semi-supervised learning. In particular, recent work has shown that mutual information maximization alone is not sufficient to explain the good representations learned by InfoNCE

(Tschannen et al., 2019). In particular, as the mutual information is invariant under arbitrary invertible transformations, maximizing mutual information could give highly entangled representations (Tschannen et al., 2019)

. In practice, InfoNCE is likely to learn good representations due to the choice of a highly simplified parametric model for estimating the mutual information

(Oord et al., 2018), which has recently been shown to be closely related to kernel methods (Li et al., 2021). Our approach gives insights into how these simplified estimators could be understood as priors on the latent space.

2 Background

2.1 Variational Autoencoders

Usually in a variational autoencoder (Kingma & Welling, 2013; Rezende et al., 2014), we have observed data, , and latents, , and we specify a prior, , a likelihood, , and an approximate posterior, . We then jointly optimize parameters of the prior, likelihood, and approximate posterior using the ELBO,


which bounds the model evidence or marginal likelihood, (as can be shown using Jensen’s inequality). The approximate posterior, , is often known as the encoder as it maps from data to latents, while the likelihood, , is often known as the decoder, as it maps from latents back to the data domain.

2.2 InfoNCE

In InfoNCE (Oord et al., 2018), there are two data items, and . Oord et al. (2018) initially describes a time-series setting where was a context giving the recent history of past data and was data for the next time step. But Oord et al. (2018) also consider other contexts, such as where and are different augmentations from the same underlying image. The InfoNCE objective was originally motivated as maximizing the mutual information between latent representations,


Note, we are using rather than for consistency with VAE derivations in the methods, and denotes the distribution induced by taking true data,

, and encoding them with neural networks,

and . As the mutual information is difficult to estimate, they use a bound based on a classifer that distinguishes the positive sample (i.e. the paired with the corresponding ) from negative samples (i.e. drawn from the marginal distribution and unrelated to or to the underlying data; see Poole et al., 2019 for further details).


The original InfoNCE paper uses,


where is a weight matrix trained jointly with the parameters of and using the above objective. Taking the limit as goes to infinity, the bound becomes tight (Oord et al., 2018), and can be written as (Wang & Isola, 2020; Li et al., 2021),


3 Methods

We begin by looking at the unstructured SSVAEs with a single latent and observed variable. This gives useful intuition but does not recover InfoNCE. We then go on to look at the structured SSVAE with two latent and two observed variables which recovers InfoNCE.

3.1 Unstructured SSVAEs

In a standard variational autoencoder, we specify parametric forms (e.g. using neural networks) for the prior, , the likelihood, and the approximate posterior, . However, in an SSVAE, we specify only the prior, , and the approximate posterior, . The likelihood, , is given implicitly. In a simple model with one latent variable, , and one observation,

, the likelihood is given by Bayes theorem,

or equivalently (6)

where is the true distribution over data, which is fixed, independent of parameters, and in general different from the model’s distribution,

. The joint distribution here is specified by combining

with the parameteric encoder, . The marginal, , can thus be written in terms of those distributions,


Substituting the likelihood (Eq. 6) into the ELBO (Eq. 1), we get,


where is our parametric form for the prior and is given implicitly by Eq. 7.

Remember that is constant with respect to the parameters as is the true, fixed data distribution. This term can thus be treated as a constant for the purposes of optimizing the parameters of and . More importantly, we cannot directly evaluate the density ratio , as we cannot evaluate (Eq. 7

). Instead, and inspired by InfoNCE, it is possible to estimate this ratio using a classifier that distinguishes samples of

from those of .

To understand how this objective behaves, consider taking the expectation of the ELBO (Eq. 8) over the true data distribution ,


Optimizing the ELBO thus matches the marginal distributions in latent space between (Eq. 7) and our parametric prior, . In essence all we are doing is to find a mapping, , from to such that, averaging over from the data, the resulting ’s have a distribution close to . However, it is not at all clear that this will give us a good representation. For instance, if is Gaussian, and if noise in the data, , is Gaussian, then it may be easier to get Gaussian ’s by extracting noise, rather than (as we would like), extracting high-level structure. That said, it may still be possible to do something useful by applying identifiability results inspired by ICA (e.g. Khemakhem et al., 2020).

3.2 Structured SSVAEs

The previous section argued that an SSVAE with just one latent and observed variable is unlikely to give useful representations. Instead, consider a generative model with two observed variables, and , and two latent variables, and . The approximate posterior is given in terms of neural network encoders for and separately,


The generative model has structure ,


where may be a specific, parametric form (e.g. a Gaussian), and the decoders, and are given implicitly in terms of the encoders, and and the true marginal distributions in the data, and ,

or equivalently (12a)
or equivalently (12b)



Now, we compute the model evidence (note we delay applying Jensen’s inequality to get the ELBO),


Substituting for the approximate posterior (Eq. 10) and prior (Eq. 11),

Substituting Eq. (12) and remembering that and are parameter-independent constants

Finally, applying Jensen’s inequality we get the ELBO,


3.3 Deterministic encoders

Note that if we have a deterministic encoder,


where is the Kronecker delta, then we have and and the ELBO bound is tight,


so, the ELBO is equal to the model evidence (up to constant factors).

3.4 Understanding the SSVAE objective

To understand how this objective behaves, consider its expectation under the data distribution, ,




Adding and subtracting ,

allows us to identify two KL-divergence terms,

The first term is a mutual information. The objective therefore maximizes the mutual information between and under (Eq. 21). At the same time, the second term is the KL-divergence between the parametric prior, , and . Thus, the objective also pushes towards the parametric prior, .

3.5 Recovering a maximum mutual-information objective

We can recover a mutual-information objective by giving an implicit definition of the prior over latent variables,


in which case the KL-divergence between and is zero, and we are left with just the mutual information (Eq. 2),


3.6 Recovering the exact form for the InfoNCE estimator

Recent work has argued that the good representation arising from InfoNCE cannot be from maximizing mutual information alone, because the mutual information is invariant under arbitrary invertible transformations (Tschannen et al., 2019; Li et al., 2021). Instead, the good properties must arise somehow out of the simple mutual information estimator in Eq. (4). Remarkably, this exact estimator (at least in the infinite limit) can be recovered by making a specific choice of prior in the latent space. We choose the prior on implicitly, as , and we choose the distribution over conditioned on

to be given by an energy based model that depends on

and an arbitrary coupling function, , which could be given by Eq. (4) as in the original InfoNCE, or could be more general,


The normalizing constant, , is


To obtain the ELBO objective in Eq. (17), we need the ratio,


Remarkably, this is exactly equal to the infinite limit of the InfoNCE objective in Eq. (5), so the full expected ELBO becomes,


as derived in (Wang & Isola, 2020) and used in (Li et al., 2021).

4 Related work

Perhaps the closest prior work is Zimmermann et al. (2021)

, which also identifies an interpretation of InfoNCE as inference in a principled generative model. Unlike this work, we identify a connection between the InfoNCE objective and the ELBO or model evidence. Moreover, their proof requires complex geometric properties, whereas our’s merely involves straightforward manipulations of probability distributions. In addition, their approach requires four restrictive assumptions. First, they assume deterministic encoder, e.g 

. In contrast, all our theory applies to stochastic encoders. While we do explicitly consider deterministic encoders in Sec. 3.3, this is only to show that with deterministic encoders, the ELBO bound is tight — all the derivations outside of this very small section (which includes all our key derivations) use fully general encoders, and . Second, they assume that is invertible, i.e. that there exists a deterministic decoder , which is not necessary in our framework. Third, they assume that the latent space is unit hypersphere, while in our framework the latent space can be any set. Fourth, they assume the ground truth marginal of the latents of the generative process is uniform, whereas our framework accepts any choice of ground-truth marginal. As such, our framework has considerably more flexibility to include rich priors on complex, structured latent spaces.

Other work looked at the specific case of isolating content from style (von Kügelgen et al., 2021). This work used a similar derivation to that in Zimmermann et al. (2021) with slightly different assumptions. While they still required deterministic, invertible encoders, they relax e.g. uniformity in the latent space. But because they are working in the specific case of style and content variables, they make a number of additional assumptions on those variables. Importantly, they again do not connect the InfoNCE objective with the ELBO or model evidence.

Very different methods use noise-contrastive methods to update a VAE prior (Aneja et al., 2020). Importantly, they still use an explicit decoder.

There is a large class of work that seeks to use VAEs to extract useful, disentangled representations (e.g. Burgess et al., 2018; Chen et al., 2018; Kim & Mnih, 2018; Mathieu et al., 2019; Joy et al., 2020). Again, this work differs from our work in that it uses explicit decoders and thus does not identify an explicit link to self-supervised learning.

Likewise, there is work on using GANs to learn interpretable latent spaces (e.g. Chen et al., 2016a). Importantly, GANs learn a decoder (mapping from the random latent space to the data domain). Moreover, GANs use a classifier to estimate a density ratio. However, GANs estimate this density ratio for the data, and , whereas InfoNCE, like the methods described here, uses a classifier to estimate a density ratio on the latent space, and .

There is work on reinterpreting classifiers as energy-based probabilistic generative models (e.g. Grathwohl et al., 2019), which is related if we view SSL methods as being analogous to a classifier. Our work is very different, if for no other reason than because it is not possible to sample data from an SSVAE (even using a method like MCMC), because the decoder is written in terms of the unknown true data distribution. It would be interesting to consider the relationship to these models, but this is out of scope for the present work.

5 Conclusions

In conclusion, we have seen that the ELBO in an SSVAE is equal to the mutual information with one choice of prior, and equal to the InfoNCE parametric mutual information estimator with an alternative prior (up to constants). As such, we not only unify semi-supervised and unsupervised learning methods, we also provide a principled framework for using simple parametric models in the latent space to enforce disentangled representations.