On Mutual Information in Contrastive Learning for Visual Representations

05/27/2020 ∙ by Mike Wu, et al. ∙ Stanford University 0

In recent years, several unsupervised, "contrastive" learning algorithms in vision have been shown to learn representations that perform remarkably well on transfer tasks. We show that this family of algorithms maximizes a lower bound on the mutual information between two or more "views" of an image; typical views come from a composition of image augmentations. Our bound generalizes the InfoNCE objective to support negative sampling from a restricted region of "difficult" contrasts. We find that the choice of (1) negative samples and (2) "views" are critical to the success of contrastive learning, the former of which is largely unexplored. The mutual information reformulation also simplifies and stabilizes previous learning objectives. In practice, our new objectives yield representations that outperform those learned with previous approaches for transfer to classification, bounding box detection, instance segmentation, and keypoint detection. The mutual information framework provides a unifying and rigorous comparison of approaches to contrastive learning and uncovers the choices that impact representation learning.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

While supervised learning algorithms have given rise to human-level performance in several visual tasks

Russakovsky et al. (2015); Redmon et al. (2016); He et al. (2017), they require exhaustive labelled data, posing a barrier to widespread adoption. In recent years, we have seen the growth of several approaches to un-supervised learning from the vision community Wu et al. (2018); Zhuang et al. (2019); He et al. (2019); Chen et al. (2020a, b)

where the aim is to uncover vector representations that are “semantically” meaningful as measured by performance on a variety of downstream visual tasks e.g., classification or object detection. In the last two years, this class of algorithms has already achieved remarkable results, quickly closing the gap to supervised methods

He et al. (2016); Simonyan and Zisserman (2014).

The core machinery behind these unsupervised algorithms is a basic concept: treat every example as its own label and perform classification as in the usual setting, the intuition being that a good representation should be able to discriminate between different examples. Later algorithms build on this basic concept either through (1) technical innovations to circumvent numerical instability Wu et al. (2018), (2) storage innovations to hold a large number of examples in memory He et al. (2019), (3) choices of data augmentation Zhuang et al. (2019); Tian et al. (2019)

, or (4) improvements in compute or hyperparameter choices

Chen et al. (2020a).

However, as a reader it is surprisingly difficult to compare these algorithms beyond high level intuition. As we get into the details of each algorithm, it is hard to rigorously justify the design. For instance, why does additional clustering of a small neighborhood around every example Zhuang et al. (2019) improve performance? Why does introducing CIELAB image augmentations Tian et al. (2019) yield better representations? More generally, why is per-example classification a good idea? From a practitioners point of view, the choices made by each new algorithm may appear arbitrary. Furthermore, it is unclear which directions are more promising for the next generation of algorithms. Ideally, we wish to have a theoretical framework that provides a systematic understanding of the full class of algorithms and can suggest new directions of research.

In this paper, we describe such a framework based on mutual information between “views.” In doing so, we find several insights regarding the individual algorithms. Specifically, our contributions are:

  • We present an information-theoretic description that can characterize IR Wu et al. (2018), LAZhuang et al. (2019), CMCTian et al. (2019), and more. To do so, we derive a new lower bound on mutual information that supports sampling examples from a restricted variational distribution.

  • We identify two fundamental choices in this class of algorithms: (1) how to choose data augmentations (or “views”) and (2) how to choose negative samples. Together these two are the crux of why instance-classification yields useful representations.

  • By formulating them as bounds on mutual information, we simplify and stabilize existing contrastive learning objectives.

  • By varying how we choose negative examples (a previously unexplored direction), we find consistent improvements in multiple transfer tasks, outperforming IR, LA, and CMC.

2 Background

We provide a review of four representation learning algorithms as described by their respective authors. We will revisit several of these with the lens of mutual information in Sec. 3.2.

Instance Discrimination (IR) IR, introduced by Wu et. al. Wu et al. (2018), was the first algorithm to treat each example as its own class. Let for in enumerate the images in a training dataset. The function

is a neural network mapping images to vectors of reals. The IR objective is given by


Here, is a hyperparameter used to smooth the softmax, which otherwise may have issues with gradient saturation. The denominator in requires computing a forward pass with for data points which can be prohibitively expensive. Wu et. al. suggest two approaches to ameliorate the cost. First, they use a memory bank to store the representations for every image. The

-th representation is updated using a linear combination of the stored representation and a new representation every epoch to prevent them from growing stale:

where . The notation retrieves the representation for the -th element from the memory bank. Second, the authors approximate where and is a user-specified constant to adjust the approximation.

Local Aggregation (LA) The optimum for IR spreads representations uniformly across the surface of a sphere since then it is equally easy to discriminate any example from the others. However, such uniformity may be undesirable: images of the same class should intuitively be closer in representation than images of other classes. With this as motivation, LA Zhuang et al. (2019) biases a representation that pulls “nearby” images closer while “pushing” other images away. Its objective is


where is a set of indices containing . Given the -th image , the background neighbor set contains the closest examples to in embedding space. Second, the close neighbor set contains elements that belong to the same cluster as where clusters are defined by k- nearest neighbors. Intuitively, the elements of should be “closer” to than the elements of , which acts as a baseline level of “closeness.” Throughout training, the elements of and

change. In practice, LA outperforms IR by 6% on the transfer task of ImageNet classification.

Contrastive Multiview Coding (CMC) CMC Tian et al. (2019) adapts IR to decompose an input image into the luminance (L) and AB-color channels. Then, CMC is the sum of two IR objectives where the memory banks for each modality are swapped, encouraging the representation of the luminance of an image to be “close” to the representation of the AB-color of that image, and vice versa:


where represents the memory bank storing representations for a single modality. In practice, CMC outperforms IR by almost 10% in ImageNet classification.

A Simple Framework for Contrastive Learning (SimCLR)

SimCLR Chen et al. (2020a) performs an expansive experimental study of how data augmentation, architectures, and computational resources effect the IR objective. In particular, SimCLR finds better performance without a memory bank and by adding a nonlinear transformation on the representation before the dot product. That is,


where is usually an MLP. Note no memory banks are used in Eq. 4. Instead, the other elements in the same minibatch are used as negative samples. Combining these insights with significantly more compute, SimCLR achieves the state-of-the-art on transfer to ImageNet classification. In this work, we find similar takaways as Chen et al. (2020a) and offer some theoretical support.

(a) IR (ImageNet)
(b) IR (CIFAR10)
(c) CMC (ImageNet)
(d) CMC (CIFAR10)
Figure 1: Effect of view set choice on representation quality for IR and CMC on ImageNet and CIFAR10. Empty or trivial view sets (e.g. flipping) consistently lead to poor performance.

2.1 An Unexpected Reliance on Data Augmentation

As is standard practice, IR, LA, and CMC preprocess images with a composition of random cropping, noise jitter, horizontal flipping, and grayscale conversion. In their original papers, these algorithms do not emphasize the importance of data augmentation — and understandably so, as there is no apriori good reason to do otherwise. Indeed, in the objectives from Eq. 13, data augmentation does not explicitly appear; the encoder is instead assumed to act on the image directly, although in reality it acts on a transformed image. With Thm. 2.1, we make the conjecture that without data augmentation, contrastive learning would not enjoy the success it has found in practice. In fact, without the inductive bias of neural networks, instance discrimination without augmentation is trivial.

For intuition, consider that in the IR objective, which for a L normalized encoder , pushes points

towards a uniform distribution on the surface of a sphere. We note that IR without data augmentation does not specify where exactly to place the

-th data point. So, one can place the -th point anywhere on the sphere as long as points are equidistant from one another and still maximize Eq. 1. That is, IR is “permutation-invariant”. Next, notice that the performance of a representation on a transfer task depends on the task itself. For example, in vision, features describing objects in an image are more useful to classification, detection, etc. There certainly exists a permutation such that examples of the same class are placed next to each other on the sphere. But if the optimal (with respect to IR) is invariant under any permutation of the encoded points , then IR cannot possibly know which permutation to select for given that it does not know the transfer task. We show that in this sense, the data augmentations we choose inject our own prior knowledge into training, helping to break this invariance to permutations in a desirable manner. In the following theorem, we express this intuition more formally, assuming an idealized setting where there is no inductive bias introduced by the optimizer nor the neural network parameterizing :

Theorem 2.1.

Given a dataset of

realizations of a random variable

, denoted , we define a set of data augmentation functions where and is a set of indices, including an identity index 0 such that for all . In other words, . Fix any parameters which minimize the Instance Discrimination objective with respect to the data, , and fix any permutation of .

Then optimal solutions are invariant under the permutation if and only if augmentations applied to distinct tasks can’t collide, and in particular if no augmentations are used at all. That is, for the optimal , an alternative encoder , which maps to . is also optimal if and only if for all . In particular, this holds if .

Thm. 2.1 implies that in the idealized setting, data augmentation is the crux of learning a good representation. Moreover, not all augmentations are equally good: only those that collide for different examples contribute to reducing the permutation invariance. In practice, the augmentations we choose to use bias the objective towards a subset of the optimal minima, in particular the ones good for visual transfer tasks. In the non-idealized setting, optimizing IR with no data augmentations may already produce non-trivial representations since neural network architecture and implicit regularization from SGD bias towards certain optima. But, we would still expect to see much worse representations than if we had used data augmentation to begin with.

Proving Thm. 2.1 requires machinery that we develop in the reminder of the paper. Its proof and most others are in the supplement. But to begin, we can find experimental evidence to support our conjecture. We train IR and CMC on both the ImageNet and CIFAR datasets with different subsets of augmentations. For instance, we may optimize IR but only use horizontal flips as augmentation. Or, we may use no augmentations at all. For each image in the test set, we can measure the quality of the representation learned by predicting the label of its closest image in the training set by L distance in the representation space, as done in Zhuang et al. (2019)

. A better representation would properly place unseen images of a class nearby to ones seen in training, thereby classifying the unseen image correctly.

Fig. 1 clearly shows that without augmentations (the blue line), the representations learned by IR and CMC are significantly worse (though not trivial) in terms of classification accuracy than with all augmentations (the brown line). Further, we confirm that not all augmentations lead to good representations. In particular, view sets defined by horizontal flipping and/or grayscale are disjoint for any two images. As we would expect from Thm. 2.1, the representations learned do not perform differently than using no augmentations at all. On the other hand, cropping, which certainly introduces collisions, accounts for most of the benefit amongst the basic image augmentations.

Having discovered the importance of image transformations, we restate IR to explicitly include data augmentation. Define a view function, that maps the realization of a random variable and an index to another realization. In standard contrastive learning, each index in is associated with a composition of augmentations (e.g. cropping with scale 0.2 plus jitter with level 0.4 plus rotation with degree 45, etc.), and a uniform distribution over . The IR objective becomes


where the representation of the -th image is updated at the -th epoch by the equation

We will show that Eq. 5 is equivalent to a bound on mutual information between views, explaining the importance of data augmentations. To do so, we first rehearse a commonly used lower bound on MI.

3 Equivalence of Instance Discrimination and Mutual Information

Similar connections between mutual information and representation learning have been suggested for a family of masked language models Kong et al. (2019). However, the connection has not been deeply explored and a closer look in the visual domain uncovers several insights surrounding contrastive learning.

3.1 A Lower Bound on Mutual Information

Mutual information (MI) measures the statistical dependence between two random variables, and . Formally, we write where

is the joint distribution over


. In general, mutual information is computationally intractable to compute. This is especially true in machine learning settings as

and are frequently high dimensional. In lieu of infeasible integrals, there have been several approaches to lower bound mutual information Poole et al. (2019)

, of which a popular one is noise-contrastive estimation

Gutmann and Hyvärinen (2010), also called InfoNCE Oord et al. (2018):


where is a witness function representing the “compatibility” of two vectors. We use and to designate encoders that map realizations of and to vectors. We call the output of each encoder a representation. The value of the witness function alone is an unnormalized quantity. Given , the second term in Eq. 6 serves to normalize with respect to other plausible realizations of . We use to denote a set of negative samples

used to estimate the marginal probability. A common choice for the distribution over sets is

. That is, i.i.d. samples drawn from the marginal distribution .

3.2 Equivalence of IR, CMC, and SimCLR to InfoNCE

Consider lower bounding the mutual information between two weighted view sets with InfoNCE. Given the -th image and the memory bank with update rate , we estimate the -th entry as


The left expression enumerates over all elements of the index set while the right expression sums over , the -th sampled index from , over epochs of training. Because the contribution of any view to the memory bank entry exponentially decays, the first sum can be tightly approximated by the second where is a function of . If , then .

To generalize the above sum to views of index sets , define

for a set of weights . We can associate an index set with each example as the indices sampled from over epochs of training. Fixing weights and , any index , and an index set for each (with ), we use the above to define the witness function by . We thus express the InfoNCE bound


If we squint our eyes, Eq. 8 and Eq. 5 share similar structure. The next lemma formalizes this intuition to show an equivalence of InfoNCE with IR, CMC and SimCLR.

Lemma 3.1.

Let be the size of the dataset, a marginal distribution over images , and a memory bank. Then the following are equivalent: , , , .

The memory bank makes Eq. 8 quite involved, and the comparatively simple MI formulation above leads us to question its value: are weighted views more useful than the individual views? We consider the special case when . Here, the -th entry stores only the encoding of the single view chosen in the last epoch. As such, we can simplify and remove the memory bank altogether:


We will show in our experiments that this simplified form of IR performs as well as the original.

3.3 Mutual Information and Data Augmentation

In Thm. 2.1

, we discussed the notion that not all the optima of IR are good for transfer learning. Having drawn connections between IR and MI, we can revisit this statement from an information-theoretic angle. Lemma 

3.1 shows that contrastive learning is the same as maximizing MI for the special case where and are the same random variable. As is trivial, optimizing it should not be fruitful. But we can understand “views” as a device for information bottleneck that force the encoders to estimate MI under missing information: the stronger the bottleneck, the more robust the representation as it learns to be invariant. Critically, if the data augmentations are lossy, then the mutual information between two views of an image is not obvious, making the objective nontrivial. At the same time, if we were to train IR with lossless data augmentations, we should not expect any learning beyond inductive biases introduced by the neural network .

(a) Noise Level
(b) Crop Intervals
(c) Crop Min.
(d) Crop Max.
(e) Combined
Figure 2: Nearest neighbor classification accuracy versus MI on CIFAR10 under different view sets. View sets that hide too little (e.g. grayscale, flip) or too much information (e.g. crops that preserve too little) result in poor transfer performance. Subfigure (e) combines (a) through (d) for scale.

Fig. 2 depicts a more careful experiment studying the relationship been views and MI. Each point in Fig. 2 represents an IR model trained on CIFAR10 with different views: no augmentations (black point), grayscale images (gray point), flipped images (brown point), L-ab filters (red point), color jitter (green line) where Noise= means adding higher levels of noise for larger , and cropping (blue and purple lines) where the bounds represent the minimum and maximum crop sizes with 0 being no image at all and 1 retaining the full image. By Lemma 3.1, we estimate MI (x-axis of Fig. 2) using plus a constant. We see a parabolic relationship between MI and transfer performance: views that preserve more information lead to both a higher MI (trivially) and poorer representations. Similarly, views that hide too much information lead to very low MI and again, poor representations. It is a sweet spot in between where we find good classification performance.

Finally, we note that Lemma 3.1 provides a more critical comparison between IR and CMC: as the two are functionally identical, the only differences are in how each defines their views. We make three observations: First, the view set for CMC is partitioned into two disjoint sets with a one-to-one correspondence between elements of each set (since an image is decomposed into an L and AB filter) — further, as L and AB capture almost disjoint information, CMC imposes a strong information bottleneck between any two views. In fact, Fig. 2e shows the L-ab view set to be at the apex of the curve between MI and accuracy. Second, the notion of “view” as an L or AB filter of an image versus “view” as cropping or adding jitter are one and the same. Whereas the original paper Tian et al. (2019) focuses on the former exclusively, we find this more general interpretation to be useful — as Fig. 1c and d show, performance still dramatically varies without cropping. Third, Fig. 1c and d also show that without any augmentations, CMC still maintains a baseline performance much higher than IR does. The MI framework makes it easy to understand why this might be: even with no augmentations, the view set for CMC is nontrivial as it at least contains L and AB filters.

3.4 Simplifying Contrastive Learning

Showing an equivalence to mutual information can help us (1) pick hyperparameters and (2) simplify the contrastive learning pipeline altogether. We give two examples below.

(a) Stability ()
(b) Stability ()
(c) +IR ()
(d) +LA ()
(e) +IR ()
(f) +CMC ()
Figure 3: NN classification accuracy comparing the IR objective with InfoNCE in stability (a,b), and different for the memory bank of IR, LA (e-h) on Imagenet () and CIFAR10 ().

The memory bank is not critical to representation learning. The mutual information framework in Sec. 3.2 suggested a simpler version of IR with in which the memory bank can be replaced by sampling another random view. We can compare these formulations experimentally by measuring transfer accuracy on classification. Fig. 3e-g shows results varying from 0 (no bank) to near 1 (very slowly update). We find performance when and when (the standard approach) are equal across algorithms and datasets. This suggests that we do not require a memory bank but merely can choose two random views every iteration. SimCLR Chen et al. (2020a) has suggested a similar finding. Our contribution is to show that the MI framework made such a simplification obvious.

Softmax “hacks” are unnecessary. Because the original IR paper was a million category classification problem, its implementation required innovations to be tractable. For instance, IR approximates the denominator in Eq. 1 with a Monte Carlo estimate scaled . Such practices were propagated to LA and CMC in the official implementations online. While this setup works well for ImageNet, it is not clear how to set the constants for other datasets. For small datasets like CIFAR10, such large constants introduce numerical instability in themselves. However, once we draw the relationship between IR and MI, we immediately see that there is no need to actually compute the softmax (and no need for ) — rather InfoNCE only requires logsumexp, a much more stable operation. Fig. 3c and d show the effect of switching from the original IR code to InfoNCE. While we expected to find the large impact in CIFAR10, we also find that even in ImageNet (for which those constants were chosen), InfoNCE improves performance.

(a) ImageNet (IR)
(b) CIFAR10 (IR)
(c) ImageNet (CMC)
(d) CIFAR10 (CMC)
Figure 4: Nearest neighbor (NN) classification accuracy throughout training: (a,b) show the IR family of models on ImageNet and CIFAR10; (c,d) show similar results for the CMC family.

4 Equivalence of Local Aggregation and Mutual Information

Having shown an equivalence between IR, SimCLR, CMC and InfoNCE, we might wish to do the same for LA. However, LA has two distinguishing features — close and background neighbor sets — that are not obviously related to MI. In the next few paragraphs, we show how to describe LA with MI, uncovering several insights and new algorithms along the way. Namely, we introduce a generalization of InfoNCE that supports sampling from a restricted variational distribution.

4.1 A Variational Lower Bound on Mutual Information

Recall that InfoNCE draws negative samples independently from but this choice may not be desirable. We may wish to choose negative samples from a different distribution over sets or we may even wish to conditionally sample negatives, where . While previous literature presents InfoNCE with an arbitrary variational distribution Oord et al. (2018); Kong et al. (2019) that would justify either of these choices, we have not found a proof supporting this. One of the contributions of this paper is to formally define a class of variational distributions such that Eq. 6 remains a valid lower bound if we replace with . We begin with the following theorem:

Theorem 4.1.

Fix a distribution over . Fix any in and any . Define and suppose that is -integrable with mean . Picking two thresholds in , let Suppose that , and define for any Borel . Then .


It suffices to show the inequalities . The first holds since . The second holds since , and taking the expectation of both sides of this inequality gives the desired result upon observing that and that . ∎

Next, we apply Thm. 4.1 to InfoNCE to construct the variational InfoNCE, or VINCE, estimator and prove that for large enough , it is a lower bound on InfoNCE and thus, mutual information.

Corollary 4.1.1.

Let be independent from , the latter i.i.d. according to a distribution . For define with . For any Borel , define a variational distribution over by

Define . Then .


Let be any monotonic increasing function. Note that each for in satisfies the conditions on in Thm 4.1. Since , by Thm 4.1, . If , then . As , the bound is tight. ∎

A Toy Example

Thm. 4.1 and Coro. 4.1.1 imply an ordering to the MI bounds considered. We next explore the tightness of these bounds in a toy example with known MI. Consider two random variables and distributed such that we can pick independent samples and where and Then, let . That is, introduce a new random variable as the first dimension of the sum and as the second. The mutual information between and can be analytically computed as as is jointly Gaussian with covariance .

Method Estimate of MI
VINCE (90%) e
VINCE (75%) e
VINCE (50%) ee
VINCE (25%) ee
VINCE (10%) ee
VINCE (5%) ee
Table 1: Looseness of VINCE

We fit InfoNCE and VINCE with varying values for and measure the estimated mutual information. Table 1 shows the results. For rows with VINCE, the percentage shown in the left column represents the percentage of training examples (sorted by L distance in representation space to ) that are “valid” negative samples. For instance, 50% would indicate that negative samples can only be drawn from the 50% of examples whose representation is closest to in distance. Therefore, a smaller percentage is a higher . From Table 1, we find that a higher results in a looser estimate (as expected from Coro. 4.1.1). It might strike the reader as peculiar to use VINCE for representation learning — it is a looser bound than InfoNCE, and in general we seek tightness with bounds. However, we will argue that learning a good representation and tightening the bound are two subtly related but fundamentally distinct problems.

5 Equivalence of LA and VINCE

Focusing first on the background neighbor set in LA, we can connect it to MI through VINCE. Consider sampling from 111For infinite we drop the parameter, writing instead of . as defined in Thm 4.1 with being the -th image. Assuming a larger threshold , we are forced to sample negatives that more closely resemble , the representation of the -th image. With Fig. 2, we have already seen that a tighter bound (i.e. lossless views) does not mean better representations. Similarly, by using VINCE, we trade a looser bound for a more challenging problem: encoders must now distinguish between more similar objects, forcing the representation to be more semantically meaningful. Replacing with

immediately suggests new contrastive learning algorithms that “interpolate” between IR and LA.

Lemma 5.1.

Let and refer to the background and close neighbor sets, respectively. We define the Ball Discrimination (BALL) objective as . Similarly, define the Annulus Discrimination (ANN) objective as . Then, BALL and ANN both lower bound MI between weighted view sets. That is, and are both equivalent to . In the former, we draw elements of from a variational distribution . In the latter, we sample elements of sampled from different with finite .

The primary difference between BALL and IR is that negative samples are drawn from a restricted domain of the marginal distribution “centered” at the current data point . Thus, we cannot equate the BALL estimator to InfoNCE; we must rely on VINCE, which provides the machinery to describe using a conditional distribution with smaller support. While the BALL and ANN objectives are both equivalent to VINCE, only the latter has a finite threshold used to define the close neighbor set . Further, while ANN and LA both use a close neighbor set, they differ in that LA pulls the representations of elements in closer whereas ANN does not. Yet, BALL and LA are quite similar as they sample negatives from the same distribution.

(a) IR
(b) BALL
(c) ANN
(d) LA
(e) BALL+
(f) ANN+
Figure 5: Comparison of five different contrastive learning algorithms. Black points represent the current image; gray points represent other views; red points represent negative samples; blue points represent elements of an extended view set for LA. The dark gray areas represents a valid negative sampling distribution. BALL+ and ANN+ shrink the dark gray area throughout training.

Next, consider a variation of BALL where the background neighbor is dependent on the training iteration. We initialize to the entire marginal distribution and anneal it throughout training to a smaller subset — this is equivalent to initializing the threshold to and increasing it every gradient step. Doing so is well-defined as the variational distribution implied by is static at any iteration. We call this BALL+. Similarly, we define ANN+ where the close neighbor set is annealed as well.

It remains to show that LA lower bounds mutual information. First, consider a simplified version of LA where we assume that the close neighbor set contains only the current image. That is, . We call this LA. It is straightforward to show that LA is equivalent to BALL.

Lemma 5.2.

Fix . Then .

Now, we pose LA as a generalization of LA where elements of the close neighbor set can be thought of as views of , the current image. To start, define the view set of as . Consider an enlarged view set , recalling i.e. the image is a member of its own close neighbor set. So the enlarged view set contains all views of and all views of images in the close neighbor set of . Then, optimizing is akin to optimizing where views for image are sampled from the enlarged view set. The proof of the next Lemma follows a similar intuition to bound LA by LA.

Lemma 5.3.

For any containing , we have .

We can now say something concrete about IR and LA. Namely, both are lower bounds on MI between weighted view sets. However, LA makes two different choices. First, LA expands the view set for each image to include neighboring images in representation space. For a good representation (hence why LA needs to initialized from IR), images in the close neighborhood make for semantically meaningful views. A larger view set then contributes to a more abstract representation. Second, LA chooses more difficult negative samples, again requiring a stronger representation to properly differentiate examples. Fig. 5 shows a summary of the many algorithms presented in this section.

6 Experiments

The theory suggests that representation quality in increasing order to be IR, BALL, ANN, then LA. To verify this, we fit each of these algorithms on ImageNet and CIFAR10. Fig. 4 show the nearest neighbor classification accuracy on a test set throughout training. Table 2

shows transfer classification performance: accuracy of logistic regression trained using the frozen representations learned by each of the unsupervised algorithms. We follow the training paradigms in prior works

Wu et al. (2018); Zhuang et al. (2019); Tian et al. (2019); Chen et al. (2020a) and standardize hyperparameters across models (see Sec. B.3).

Model Top1 Top5
(a) CIFAR10: Pred. Acc
Model Top1 Top5
(b) CIFAR10: Pred. Acc
Model Top1 Top5
(c) ImageNet: Pred. Acc
Model Top1 Top5
(d) ImageNet: Pred. Acc
(e) COCO: Mask R-CNN, R-C4, 1x schedule
(f) COCO: Mask R-CNN, R-FPN, 1x schedule
Model AP AP AP
(g) COCO: Keypt R-CNN, R-FPN
Model AP AP AP
(h) VOC: Faster R-CNN, R-C4
Model AP AP AP
(i) LVIS: Mask R-CNN, R-FPN
Table 2: Evaluation of the representations using six visual transfer tasks: object classification on ImageNet and CIFAR10 (a-d); object detection and instance segmentation on COCO (e,f); keypoint detection on COCO (g); object detection on Pascal VOC 2007 (h); and instance segmentation on LVIS (i). In all cases, the backbone network is a frozen pretrained ResNet-18 (R).

Takeaway #1 We confirm that the ordering of performance for IR, BALL, ANN, and LA is as expected for ImageNet and CIFAR10. In particular, the performance curves in Fig. 4 show that BALL and ANN account for half the performance gains of LA over IR, agreeing with our analysis that both of the choices of negative sampling and view set are important for building strong representations.

Takeaway #2 To show that the mutual information framework generalizes to other contrastive learning models, we compare CMC to ball and annulus extensions of CMC, denoted by CMC-BALL and CMC-ANN. As Fig. 4(c,d) and Table 2 find similar patterns to Takeaway #1 and for CMC-based models. In particular, we can surpass the performance of CMC (and IR) by choosing harder negative samples with VINCE in both ImageNet and CIFAR10.

Takeaway #3 BALL+ and ANN+ out-perform all other algorithms. For example, on CIFAR10, ANN+ surpasses LA by 3% while CMC-ANN+ surpasses CMC by over 2%. We see similar gains in ImageNet where CMC-ANN+ achieves an accuracy of 50.5%, a difference of 2% with CMC and LA. These positive results exemplify the power of carefully choosing negative samples.

The representations learned through contrastive learning are posited to be general and useful for many transfer tasks. Inspired by He et al. (2019), we test our algorithms on several downstream visual tasks other than classification: object detection, instance segmentation, and keypoint detection. Like He et al. (2019), we use the Detectron2 Wu et al. (2019) pipeline. However, we use a frozen ResNet18 backbone222As ResNet50 and larger are the commonly used backbones, our results will not be state-of-the-art. However, the ordering in performance between algorithms is meaningful and our primary interest. to focus on representation quality (that is, parameters are not finetuned). Table 2 show commonly reported metrics for the COCO’17, Pascal VOC ’07 Everingham et al. (2010), and LVIS Gupta et al. (2019) datasets. The results uncover the same pattern: ANN+ consistently performs better than other methods.

7 Conclusion

We have presented an interpretation of representation learning based on mutual information between image views. This formulation led to more systematic understanding of a family of existing approaches. It further enabled simplifications of these approaches and new, better performing techniques. In particular, we uncovered that the choices of views and negative sample distribution strongly influence the performance of contrastive learning. By choosing more difficult negative samples, we surpassed high-performing algorithms like LA and CMC across several popular visual tasks.

This framework suggests several new directions. Future research could investigate automatically generating views or learning the parameters of the variational distributions. Visual algorithms like IR and LA no longer look very different from masked language modeling, as both families are unified under mutual information. Future work could pursue generalizations to multimodal domains.

Broader Impact

Contrastive algorithms are becoming a popular method for unsupervised learning, as its reach spreads into multiple domains beyond vision e.g. audio. A good theoretical framework from which to understand and compare variations in contrastive learning is useful to validate experimental results and guide future directions. As the field continues to move, we hope the framework presented in this paper serves as a useful perspective. There are several outstanding questions that this paper does not address: by exposing the importance of negative sampling and views, there are questions surrounding how to

automatically make these choices. Being able to do so would lead to impact in generalizing contrastive learning to other domains like language and audio where the term “augmentation” is not as well-defined. We hope this work serves as a motivator for new research in this direction.


  • [1] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: Appendix D, §1, §1, §2, §3.4, §6.
  • [2] X. Chen, H. Fan, R. Girshick, and K. He (2020) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: §1.
  • [3] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge.

    International journal of computer vision

    88 (2), pp. 303–338.
    Cited by: §6.
  • [4] A. Gupta, P. Dollar, and R. Girshick (2019) LVIS: a dataset for large vocabulary instance segmentation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 5356–5364. Cited by: §6.
  • [5] M. Gutmann and A. Hyvärinen (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    pp. 297–304. Cited by: §3.1.
  • [6] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §1, §1, §6.
  • [7] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §B.3, §1.
  • [9] J. Johnson, M. Douze, and H. Jégou (2019) Billion-scale similarity search with gpus. IEEE Transactions on Big Data. Cited by: §B.3.
  • [10] L. Kong, C. d. M. d’Autume, W. Ling, L. Yu, Z. Dai, and D. Yogatama (2019) A mutual information maximization perspective of language representation learning. arXiv preprint arXiv:1910.08350. Cited by: §3, §4.1.
  • [11] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §3.1, §4.1.
  • [12] B. Poole, S. Ozair, A. v. d. Oord, A. A. Alemi, and G. Tucker (2019) On variational bounds of mutual information. arXiv preprint arXiv:1905.06922. Cited by: §A.6, §3.1.
  • [13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1.
  • [14] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §1.
  • [15] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
  • [16] Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §B.3, 1st item, §1, §1, §2, §3.3, §6.
  • [17] Y. Wu, A. Kirillov, F. Massa, W. Lo, and R. Girshick (2019) Detectron2. Note: https://github.com/facebookresearch/detectron2 Cited by: §6.
  • [18] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: 1st item, §1, §1, §2, §6.
  • [19] C. Zhuang, A. L. Zhai, and D. Yamins (2019) Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6002–6012. Cited by: §B.3, 1st item, §1, §1, §1, §2.1, §2, §6.

Appendix A Proofs

a.1 Proof of Lemma 3.1


We first show an equivalence between IR and InfoNCE. As CMC and SimCLR are closely derived from IR, these equivalences follow in a straightforward manner. Define the index sets for each as in Sec. 3.2. Then for arbitrary

The second equality holds by construction, given definitions of InfoNCE for weighted sets of views.

Given this, the equivalence of InfoNCE to CMC follows from (1) that CMC is the sum of IR terms and (2) that being a bound is independent of the choice of view set. Similarly, the equivalence of InfoNCE to SimCLR follows from the fact that the bound holds regardless of how is parameterized. For instance, let where . That is, the encoder is a composition of functions. Then SimCLR (with the computational changes like large batch sizes abstracted away) is equivalent to IR and thus to InfoNCE. We recognize that SimCLR, unlike IR, does not use a memory bank. But as noted in the main text, there is a simpler formulation of IR equivalent to InfoNCE where we replace the memory bank with the drawing of a second random view of . As SimCLR draws negative samples from the same minibatch as , which are chosen i.i.d., the equivalence holds. ∎

a.2 Proof of Lemma 5.1


We begin with the BALL objective. Expand and cancel denominators.

where . Define index sets and a view function over sets as in Sec. 3.2. The third equality holds by rewriting a memory bank entry as a weighted sum. The proof for ANN follows identically but the variational distribution must be a function of , which is used to exclude the close neighbor set from the distribution of valid negative samples. ∎

a.3 Proof of Lemma 5.2


. ∎

a.4 Proof of Lemma 5.3


Recall that the images in the set are the most semantically close to the current image (by construction). As such, we view each as with some (semantic) noise added. For example, if is an image of a dog, may be another dog with similar visual features.

More formally, fix an index (in the training dataset). Then for all , there exists some noise such that some where , the current image with some noise added. Then,

with samples to approximate the expectation. That is, the elements of can be seen as a Monte Carlo approximation of an expectation with respect to the marginal distribution over noisy images of , denoted . Now,