Contrastive representation learning (CRL) is a family of methods that learns an encoding function so that, in the encoding space, any set of augmented images produced from a same image (positive samples) are made to attract with each other, while the augmented images of different origins(negative samples) are made to repel from each other Hjelm et al. (2019); Bachman et al. (2019); Hénaff et al. (2020); Tian et al. (2019); Chen et al. (2020). Oftentimes, the augmentations used in CRL are chosen to be those that are believed to maintain the "content"111If is a target signal, we may for example assume , as in Hu et al. (2017) features of the inputs, while altering the "style" features to be possibly discarded in the encoding process von Kügelgen et al. (2021)
. However, how can we be so sure that a heuristically chosen set of augmentations does not affect the features that are important in the downstream tasks? For example, consider applying a cropping augmentation
to a dataset consisting of MNIST images located at random position in blank ambient space(Figure1).
In this case, since , training an encoder such that would also force by the transitivity of "". In a semi-supervised setting, such a problem of wrong clustering may be avoided by considering a stochastic with a distribution satisfying , as in Hu et al. (2017).
In our study, we provide a partial solution to this problem in a self-supervised setting. In particular, we formalize the representation as the output of a stochastic function parametrized by an encoder function and a stochastic augmentation , and maximize in a tug-of-war between the data corruption introduced by and the information preserved by . Although the infoMax in the context of has been discussed in previous literatures (Tian et al., 2019; Bachman et al., 2019; Tschannen et al., 2020; Wu et al., 2020), it has not been investigated thoroughly while giving a freedom to the distribution of . We will empirically demonstrate that we can learn a competitive representation by training together with in this framework. Our formulation of also provides another way to interpret simCLR Chen et al. (2020) as a special case in which
is fixed to be the uniform distribution.
2 InfoMax problem with Augmentataion Channel
Existing perspectives of CLR are based on (discussed more in depth in related works, Section 4
). In this work, we revisit the infoMax problem from a different perspective in a framework of self-supervised learning that explicitly separates theaugmentation channel in the encoding map . Consider the generation process illustrated in the Figure 2.
In this process, is produced from by applying a random augmenation sampled from some distribution . is then encoded into through the distribution parametrized by some encoder . Thus, the distribution of can be written as
Using to denote the family of distributions that can be written in this form, we consider the InfoMax problem . In this definition of the map , the support of determines the maximum amount of information that can be preserved. For example, if all members of strongly corrupts , would be small for all choice of . Meanwhile, if the identity transformation is included in , then can be achieved by setting . However, as in training methods based on noise regularization Miyato et al. (2018); Rothfuss et al. (2019); Hu et al. (2017), the identity mapping is often not included in the augmentation set because it does not help regularize the model.
The infoMax problem in our framework has a deep connection with modern self supervised learning, as it can provide another derivation of simCLR that does not use a variational approximation.
Suppose that where is a similarity function on the range of and is a constant dependent only on . Then
Also, when is uniformly distributed on a compact set of view-transformations, the mean approximation of and Jensen’s inequality on the part of (2) recovers the simCLR loss.
For the proof of Prop 1, please see Appendix 5.1. We shall note that the condition of this statement is fulfilled in natural cases, such as when is Gaussian or Gaussian on the sphere. In the proof of Prop 1, the numerator and the denominator correspond directly to and . If takes its value on the sphere , enlarging would encourage to be uniformly distributed over the sphere. These observations support the theory proposed in Wang and Isola (2020). The table in Appendix 5.2 summarizes our algorithm for optimizing the objective (2) with respect to both and .
We show that, by training together with the encoder based on the objective (2), we can learn a better representation than the original simCLR. We conducted an experiment on a dataset derived from MNIST mentioned at the introduction (Figure1). To construct this dataset, we first prepared a blank image of size , which is times greater in both dimensions than the original MNIST images (). We then created our dataset by placing each MNIST image randomly at one of grid locations in the aforementioned blank image. We set to be a random augmentation that crops a image at one of locations ranging over the
dimensional image with stride size. On this dataset, any crop that does not intersect with the digit produces the same empty image, which is useless in discriminating the image instances. For computational ease, we trained our encoder based on the Jensen-lower bound of (2). We shall also note that, in our setup, our corresponds to the composition of the projection head and the encoder in the context of the recent works of contrastive learning. We evaluated the representation of both and . Also, without any additional constraint, sometimes collapsed to the "the most discriminating" crop on the training set, resulting in a representation that does not generalize on the downstream classification task. To resolve this problem, we adopted the maximum entropy principle Haarnoja et al. (2017) and optimized our objective (2) together with small entropy regularization , seeking the highest entropy that maximises the objective (2).
3.1 Performance of the trained representations in Linear Evaluation Protocol
To evaluate the learned representation, we followed the linear evaluation protocol as in Chen et al. (2020)et al. (2011) to train the classifier. For SimCLR, it is often customary to use the "center crop" augmentation and report as the representation for
. However, in this example, "center crop" would extract an empty image with high probability. Thus, we computed the representation of eachby integrating the encoded variable with respect to , that is, ( for simCLR is uniform). For the models with non-uniform we also evaluated , the representation obtained by averaging over the set of s having the top eight density. As an ablation, we also evaluated the SimCLR-trained encoder by integrating its output with respect to the oracle concentrated uniformly on the crop positions with maximal intersection with the embedded MNIST image. We conducted each experiment with seeds. The table 1 summarizes the result.
We can see that, with our trained and , we can achieve a very high linear evaluation score, even better than the raw representation result on the ordinary MNIST dataset (). Interestingly, with our , the representation is competitive even at the projection head, and its performance even exceeds the representation of simCLR obtained by averaging over the oracle . This trend was also observed in the experiment on the original MNIST(see Appendix 5.4). This result may suggest that the poor quality of simCLR representation at the level of the projection head is partially due to the fact that proper is not used in training the model. Also, in confirmation of our problem statement in the section 1, the representation learned without the trainable collapses around that of the empty image (see Appendix 5.5). In terms of the average pairwise Gaussian potential used in Wang and Isola (2020) that measures the uniformity of the representations on the sphere(lower the better), our representation achieves as opposed to of the baseline simCLR.
3.2 The trained agrees with our intuition
Figure 3 visualizes the density of (second row) for various input image (second row). In each image of the second row, the intensity at th pixel is , where is the augmentation that crops the sub-image of size with the top left corner located at . As we can see in the figure, the learned is concentrated on the place of digit, ignoring the crop locations that would return the empty image. Our learned in fact captures the non-trivial crop with probability on 10,000 test images.
4 Related Works and conclusion
In a way, can be considered an augmentation policy. (Cubuk et al., 2019; Hataya et al., 2020) also learns with supervision signals. (Reed et al., 2021) extends these works to self-supervised setting by applying a modified (Hataya et al., 2020) to a set of self-supervised tasks that are empirically correlated to the target downstream tasks.
There also are several works that investigate the importance of non-uniform sampling in the constrative learning. For example, Tian et al. (2020) proposes the infoMin principle, which claims that one shall engineer the distribution of in such a way that it (1) shares as much information as possible with the target variable while (2) ensuring that, for any two realization of , and should have as little information in common. In their work, however, they do not provide an algorithm to optimize the distribution of . In a way, the requirements (1) and (2) seem to be respectively related to and in the numerator-denominator decomposition of (2
). Also, because they are practically conducting an empirical study on the joint distribution, their work might be also related to the optimization of in our context. Also, Tamkin et al. (2021) trains adversarially with respect to the loss. However, in the setting we discuss in this paper, this strategy would encourage to crop only the empty image and collapse the representations.
Previously, the connection between CLR and Mutual information has also been described based on the perspective that interprets CLR as a variational approximation of the mutual information between two views , where each is a "view" of produced by some augmentation function (Tian et al., 2019; Bachman et al., 2019; Tschannen et al., 2020; Wu et al., 2020). This variational approximation is based on the inequality
that holds for any measurable . Based on this infoNCE perspective, Tschannen et al. (2020) considers a case in which is trained as with invertible , and presents an empirical study suggesting that simCLR can improve the representation even in this setting. Based on this argument, Tschannen et al. (2020) suggests that cannot be used to explain the success of simCLR. However, as we point in our study, the transformation usually involves information loss via augmentations like cropping, and CLR is often evaluated based on sampled from . In this study, we formalize the augmentation channel as a part of , and present a result suggesting that, at least for the learning of , the Mutual information (MI) with regularization might be an empirically useful measure for learning a good representation, in particular at the level of final ouput(projection head). Our result may suggest that it might be still early to throw away the idea of MI in all aspects of the CLR because Tschannen et al. (2020) studies a case in which only the part of is made invertible.
It might also be worthwhile to mention some theoretical advantages of our formulation. Because (3) is a variational bound that holds for any choice of
, this inequality does not help in estimating how much the RHS derived from aspecific choice of (i.e. RHS(f)) differs from . Also, when we optimize RHS() using a popular family of defined as Wu et al. (2020), there is no way to know "in what proportion a given update of would affect and . Meanwhile, in our formulation, the difference between simCLR and MI is described directly with Jensen and mean approximation, for which there are known mathemtical tools like Gao et al. (2016). It might be interesting to further investigate the claims made by Tschannen et al. (2020) in this direction as well. We believe that our approach provides a new perspective to the study of contrastive learning as well as insights to the choice of augmentations.
- Learning representations by maximizing mutual information across views. Advances in Neural Information Processing Systems(NeurIPS). Cited by: §1, §1, §4.
A simple framework for contrastive learning of visual representations.
International Conference on Machine Learning(ICML). Cited by: §1, §1, §3.1.
- AutoAugment: learning augmentation policies from data. . Cited by: §4.
- Probability: theory and examples. Brooks/Cole Thomson. Cited by: §5.1.
- Bounds on the jensen gap, and implications for mean concentrated distributions. The Australian Journal of Mathematical Analysis and Applications 16. Cited by: §4.
- Reinforcement learning with deep energy-based policies. International Conference on Machine Learning(ICML). Cited by: §3.
Faster autoaugment: learning augmentation strategies using backpropagation. European Conference on Computer Vision(ECCV). Cited by: §4.
- Data-efficient image recognition with contrastive predictive coding. International Conference on Machine Learning(ICML). Cited by: §1.
- Learning deep representations by mutual information estimation and maximization. International Conference on Learning Representations(ICLR). Cited by: §1.
- Learning discrete representations via information maximizing self-augmented training. International Conference on Machine Learning(ICML). Cited by: §1, §2, footnote 1.
Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI). Cited by: §2.
- Scikit-learn: machine learning in python. Journal of machine learning research(JMLR) 12 (Oct), pp. 2825–2830. Cited by: §3.1.
- SelfAugment: automatic augmentation policies for self-supervised learning. IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Cited by: §4.
- Noise regularization for conditional density estimation. arXiv preprint arXiv:1907.08982. Cited by: §2.
- VIEWMAKER networks: learning views for unsupervised representation learning. International Conference on Learning Represenations(ICLR). Cited by: §4.
- Contrastive multiview coding. European Conference on Computer Vision(ECCV). Cited by: §1, §1, §4.
- What makes for good views for contrastive learning?. Advances in Neural Information Processing Systems(NeurIPS). Cited by: §4.
- On mutual information maximization for representation learning. International Conference on Learning Representations(ICLR). Cited by: §1, §4, §4.
- Self-supervised learning with data augmentations provably isolates content from style. arXiv preprint arXiv;2106.04619. Cited by: §1.
- Understanding contrastive representation learning through alignment and uniformity on the hypersphere. International Conference on Machine Learning(ICML). Cited by: §2, §3.1, §5.3, §5.5.
- On mutual information in contrastive learning for visual representations. arXiv preprint arXiv:2005.13149. Cited by: §1, §4, §4.
5.1 Formal statement and the proof of Proposition 1
Suppose that where is a similarity function and is a normalization constant dependent only on , Then
Also, when is uniformly distributed over a compact set of view-transformations, we recover the loss of SimCLR by (1) applying Jensen’s inequality on and (2) approximating with , the mean of .
We use upper case letter to denote the random variable and lower case letter to denote its corresponding realization (is a realization of ). We also use the standard notation in the measure theoretic probability that treat expressions like and as a random variable that is measurable with respect to . Thus, in the equality , the integral inside the RHS is a random variable with respect to . To clarify, we sometimes use the subscript to represent the variable with respect to which the integral is taken. For more details about this algebra, see Durrett  for example. Here, we show the proof of the version of the statement with the application of Jensen’s inequality. The proof without Jensen’s inequality can be derived easily from the intermediate results of this proof.
Altogether, we see that cancels out and
The equality emerges if we do not apply Jensen’s inequality on .
To show the connection of this result with simCLR, we approximate as , the mean of . With this approximation, the outermost integration with respect to will be replaced by the integration with respect to . Also, because is integrated away in the denominator of (16), the double prime superscript of the is superficial. Thus, we obtain
Choosing and , we get
which agrees with the simCLR loss when is set to be uniform.
The table shown below is the description of the algorithm based on Proposition 1 that trains and together. In this algorithm we assume that the support of is discrete. Instead of training and simultaneously, we train and in turn because this strategy was able to produce more stable results. With this algorithm’s notation, the very classic SimCLR would emerge if we set (the number of samples) to be and set to be uniform. In our experiments we set to be , as it performed better than anything less for both fixed (SimCLR) and trainable .
5.3 Model Architecture and entropy regularization
In our experiment, we used a three layer CNN with dimensional output for the intermediate encoder and a two layer MLP with dimensional output for the projection head (Figure 4). We chose this architecture because this choice performed stably for SimCLR on standard MNIST dataset (See Section 5.4). We trained with three layer CNN(Figure 5).
As in Wang and Isola , we normalized the final output of the encoder so that the final output is distributed on the sphere. As such, we used , and set since this choice yielded stable results for the learning of . At the inference time, we normalized . To discourage from collapsing prematurely, we imposed a regularization of with coefficient . We used coefficient , as it achieved the lowest contrastive loss on the training set in the range .
This choice of also produced the best linear evaluation score on the training dataset. Setting seemed to collapse in many cases.
5.4 Results on the original MNIST dataset
Table 2 shows the results on the original MNIST dataset. We used the same setting as for the main experiment in Section 3, except that we set . On this dataset, raw representation achieves . When trained with uniform , the projection head representation is not much better than the raw representation. However, when trained together with , the projection head representation is comparable to the output. This result also suggest that, by training together with , we can improve the utility of the representation at the level on which the objective function function is trained, instead of the heuristically chosen intermediate representation . This result also suggests that there is much room left for the study of the stochastic augmentation and intermediate representation.
5.5 Uniformity of the learned representation
Wang and Isola  reports that, for a good representation, the representation tends to be more uniformly distributed on the sphere. The graphs in Figure 6 are scatter plots of 2-dimensional representations trained with and without the trainable . The graphs in Figure 7 are superimposed plots of 50 dimensional representations with and without the trainable . On these graphs, we can visually see that what we feared in Section 1 and Figure 1 happens when we fix ; the majority of the representations becomes strongly concentrated around that of the empty image. This problem is successfully avoided with the trainable . In terms of the average pairwise Gaussian potential used in Wang and Isola  that measures the uniformity of the representations on the sphere(lower the better), our dimensional representation achieves as opposed to of the baseline SimCLR with fixed . The graphs in Figure 8 are the sorted values of for a randomly sampled set of pairs. We see in these graphs that the representations with the trainable are trained to be as orthogonal to each other as possible( is concentrated around ) , while the representations trained with the fixed are collapsing into one direction ( is concentrated around ).
. The horizontal axis represents the indices of the vectors, and each curve with a different color represents one instance of the vector. Right: The superimposed plot of 50 dimensional representations trained with uniform . We see that all instances of look very similar.