What Do We Maximize in Self-Supervised Learning?

by   Ravid Shwartz-Ziv, et al.

In this paper, we examine self-supervised learning methods, particularly VICReg, to provide an information-theoretical understanding of their construction. As a first step, we demonstrate how information-theoretic quantities can be obtained for a deterministic network, offering a possible alternative to prior work that relies on stochastic models. This enables us to demonstrate how VICReg can be (re)discovered from first principles and its assumptions about data distribution. Furthermore, we empirically demonstrate the validity of our assumptions, confirming our novel understanding of VICReg. Finally, we believe that the derivation and insights we obtain can be generalized to many other SSL methods, opening new avenues for theoretical and practical understanding of SSL and transfer learning.


page 1

page 2

page 3

page 4


How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?

Despite the recent success of video self-supervised learning, there is m...

Improvements to context based self-supervised learning

We develop a set of methods to improve on the results of self-supervised...

TriBYOL: Triplet BYOL for Self-Supervised Representation Learning

This paper proposes a novel self-supervised learning method for learning...

Learning Rich Nearest Neighbor Representations from Self-supervised Ensembles

Pretraining convolutional neural networks via self-supervision, and appl...

Simple Control Baselines for Evaluating Transfer Learning

Transfer learning has witnessed remarkable progress in recent years, for...

Self-supervised learning with rotation-invariant kernels

A major paradigm for learning image representations in a self-supervised...

Nonequilibrium thermodynamics of self-supervised learning

Self-supervised learning (SSL) of energy based models has an intuitive r...

1 Introduction

Self-Supervised Learning (SSL) algorithms (bromley1993signature) learn representations using a proxy objective (i.e., SSL objective) between inputs and self-defined signals. The results indicate that the learned representations can generalize well to a wide range of downstream tasks (chen2020simple; misra2020self), even when the SSL objective does not use downstream supervision during training. In SimCLR (chen2020simple)

, for example, a contrastive loss is defined between images with different augmentations (i.e., one as input and the other as a self-supervised signal). Then, we take our pre-learned model as a feature extractor and adopt the features to various applications, including image classification, object detection, instance segmentation, and pose estimation

(caron2021emerging). However, despite the success in practice, only a few works (arora2019theoretical; lee2021predicting) provide theoretical insights into the learning efficacy of SSL.

In recent years, information theory methods have played a key role in several notable deep learning achievements, from practical applications in representation learning as the variational information bottleneck

(alemi2016deep), to theoretical investigations (e.g., the generalization bound induced by mutual information (xu2017information; steinke2020reasoning; shwartz2022information). Moreover, different deep learning problems have been successfully approached by developing and applying novel estimators and learning principles derived from information-theoretic quantities, such as mutual information estimation. Many works have attempted to analyze SSL from an information theory perspective. An example is the use of the mutual information neural estimator (MINE) (belghazi2018mine) in representation learning (hjelm2018learning) in conjunction with the renowned information maximization (InfoMax) principle (linsker1988self). However, looking at these works may be confusing. Numerous objective functions are presented, some contradicting each other, as well as many implicit assumptions. Moreover, these works rely on a crucial assumption: a stochastic (often Gaussian) DN mapping, which is rarely the case nowadays.

This paper presents a unified framework for SSL methods from an information theory perspective which can be applied to deterministic DN training. We summarize our contributions into two points: (i) Firdt, in order to study deterministic DNs from an information theory perspective, we shift stochasticity to the DN input, which is a much more faithful assumption for current training techniques. (ii) Second, based on this formulation, we analyze how current SSL methods that use deterministic networks optimize information-theoretic quantities.

2 Background

Continuous Piecewise Affine (CPA) Mappings.  A rich class of functions emerges from piecewise polynomials: spline operators. In short, given a partition of a domain , a spline of order is a mapping defined by a polynomial of order on each region with continuity constraints on the entire domain for the derivatives of order ,…,. As we will focus on affine splines (), we define this case only for concreteness. An -dimensional affine spline produces its output via


with input and the per-region slope and offset parameters respectively, with the key constraint that the entire mapping is continuous over the domain . Spline operators and especially affine spline operators have been widely used in function approximation theory (cheney2009course), optimal control (egerstedt2009control), statistics (fantuzzi2002identification), and related fields.
Deep Networks.  A deep network (DN) is a (non-linear) operator with parameters that map a input to a prediction . The precise definitions of DNs operators can be found in goodfellow2016deep. We will omit the

notation for clarity unless needed. The only assumption we require for our study is that the non-linearities present in the DN are CPA, as is the case with (leaky-) ReLU, absolute value, and max-pooling. In that case, the entire input-output mapping becomes a CPA spline with an implicit partition

, the function of the weights and architecture of the network (montufar2014number; balestriero2018spline). For smooth nonlinearities, our results hold from a first-order Taylor approximation argument.
Self-Supervised Learning. Joint embedding methods learn the DN parameters without supervision and input reconstruction. Due to this formulation, the difficulty of SSL is to produce a good representation for downstream tasks whose labels are not available during training —while avoiding a trivially simple solution where the model maps all inputs to constant output. Many methods have been proposed to solve this problem. Contrastive methods learn representations by contrasting positive and negative examples, e.g. SimCLR (chen2020simple) and its InfoNCE criterion (oord2018representation). Other recent work introduced non-contrastive methods that employ different regularization methods to prevent collapsing of the representation. Several papers used stop-gradients and extra predictors to avoid collapse (chen2021exploring; grill2020bootstrap) while caron2020unsupervised uses an additional clustering step. As opposed to contrastive methods, noncontrastive methods do not explicitly rely on negative samples. Of particular interest to us is the VICReg method (bardes2021vicreg) that considers two embedding batches and each of size . Denoting by the covariance matrix obtained from we obtain the VICReg triplet loss

Our goal will now be to formulate SSL as an information-theoretic problem from which we can precisely relate VICReg to known methods even with a deterministic network.

Deep Networks and Information-Theory.  Recently, information-theoretic methods have played a key role in several remarkable deep learning achievements (alemi2016deep; xu2017information; steinke2020reasoning; shwartz2017opening). Moreover, different deep learning problems have been successfully approached by developing and applying information-theoretic estimators and learning principles (hjelm2018learning; belghazi2018mine; piran2020dual; shwartz2018representation)

. There is, however, a major problem when it comes to analyzing information-theoretic objectives in deterministic deep neural networks: the source of randomness. The mutual information between the input and the representation in such networks is infinite, resulting in ill-posed optimization problems or piecewise constant, making gradient-based optimization methods ineffective

(amjad2019learning). To solve these problems, researchers have proposed several solutions. For SSL, stochastic deep networks with variational bounds could be used, where the output of the deterministic network is used as parameters of the conditional distribution (lee2021compressive; shwartz2020information). dubois2021lossy suggested another option, which assumed that the randomness of data augmentation among the two views is the source of stochasticity in the network. For supervised learning, 2018Estimating introduced an auxiliary (noisy) DN framework by injecting additive noise into the model and demonstrated that it is a good proxy for the original (deterministic) DN in terms of both performance and representation. Finally, Achille and Soatto (achille2018information) found that minimizing a stochastic network with a regularizer is equivalent to minimizing cross-entropy over deterministic DNs with multiplicative noise. However, all of these methods assume that the noise comes from the model itself, which contradicts current training methods. In this work, we explicitly assume that the stochasticity comes from the data, which is a less restrictive assumption and does not require changing current algorithms.

3 Information Maximization of Deep Networks Outputs

This section first sets up notation and assumption on the information-theoretic challenges in self-supervised learning (Section 3.1) and on our assumptions regarding the data distribution (Section 3.2) so that any training sample

can be seen as coming from a single Gaussian distribution as in

. From this we obtain that the output of any deep network corresponds to a mixture of truncated Gaussian (Section 3.3). In particular, it can fall back to a single Gaussian under some small noise () assumptions. These results will enable information measures to be applied to deterministic DNs. We then recover the known SSL methods (bardes2021vicreg) by making different assumptions about the data distribution and estimating their information.

3.1 SSL as an Information-Theoretic Problem

To better grasp the difference between key SSL methods, we first formulate the general SSL goal from an information-theoretical perspective.

We start with the MultiView InfoMax principle, i.e., maximizing the mutual information between the representations of the two views. To do so, as shown in federici2020learning, we need to maximize and . We can do so by a lower bound using


where is the entropy of . In supervised learning, where we need to maximize , the labels () are fixed, the entropy term is constant, and you only need to optimize the log-loss (cross-entropy or square loss). However, in SSL, the entropy and are not constant and can be optimized throughout the learning process. Therefore, only maximizing will cause it to collapse to the trivial solution of making the representations constant (where the entropy goes to zero). To regularize these entropies, i.e., prevent collapse, different methods utilize different approaches to implicit regularizing information. To recover them in Section 4, we must first introduce the notation and results around the data distribution (Section 3.2) and how a DN transforms that distribution (Section 3.3).

3.2 Data Distribution Hypothesis

Our first step is to assess how the output random variables of the network are represented, assuming a distribution on the data itself. Under the manifold hypothesis, any point can be seen as a Gaussian random variable with a low-rank covariance matrix in the direction of the manifold tangent space of the data. Therefore, we will consider throughout this study the conditioning of a latent representation with respect to the mean of the observation, i.e.,

where the eigenvectors of

are in the same linear subspace than the tangent space of the data manifold at , which varies with the position of in space.

Hence a dataset is considered to be a collection of and the full data distribution to be a sum of low-rank covariance Gaussian densities as in


with the uniform Categorical random variable. To keep things simple and without loss of generality, we consider that the effective support of and do not overlap. This keeps things general, as it is enough to cover the domain of the data manifold overall, without overlap between different Gaussians. Hence, in general, we have that.


where is the Gaussian density at and with . This assumption that a dataset is a mixture of Gaussians with nonoverlapping support will simplify our derivations below, which could be extended to the general case if needed. Note that this is not restrictive since, given a sufficiently large , the above can represent any manifold with an arbitrarily good approximation.

3.3 Data Distribution After Deep Network Transformation

Consider an affine spline operator (Eq. 1) that goes from a space of dimension to a space of dimension with . The span, that we denote as image, of this mapping is given by


with the affine transformation of region by the per-region parameters , and with the partition of the input space in which lives in. We also provide an analytical form of the per-region affine mappings in Section 2. Hence, the DN mapping consists of affine transformations on each input space partition region based on the coordinate change induced by and the shift induced by .

When the input space is equipped with a density distribution, this density is transformed by the mapping . In general, finding the density of is an intractable task. However, given our disjoint support assumption provided in Section 3.2, we can arbitrarily increase the representation power of the density by increasing the number of prototypes . In doing so, the support of each Gaussian is included with the region in which its means lie in, leading to the following result.

Theorem 1.

Given the setting of Equation 4 the unconditional DN output density denoted as Z is a mixture of the affinely transformed distributions e.g. for the Gaussian case

where is the partition region in which the prototype lives in.

The proof of the above involves the fact that if then is linear within the effective support of . Therefore, any sample from will almost surely lie within a single region and therefore the entire mapping can be considered linear with respect to

. Thus, the output distribution is a linear transformation of the input distribution based on the per-region affine mapping.

4 Information Optimization and Optimality

Based on our analysis, we now show how specific SSL algorithms can be derived. According to Section 3.1, we want to maximize and . When our input noise is small, we can reduce the conditional output density to a single Gaussian.

where we abbreviated the parameters. Using that, and the result from Section 3.1, we see that one should optimize both and . As in a standard regression task, we assume a Gaussian observation model, which means

. Using the mean square error as a loss function in regression tasks is a particular application of this assumption, where

. To compute the expected loss, we need to marginalize out the stochasticity in , which means that the conditional decoding map is a Gaussian:


which gives the distribution meaning that we can lower bound the mutual information with


What happens if we attempt to optimize this objective? The only intractable component is the entropy of . We will begin by examining itself. It is natural to ask why the entropy of will not increase to infinity. Intuitively, the answer is that and are tied together, and one cannot increase without the other. Now, recalling that under our distribution assumption, is a mixture of Gaussian (recall Thm. 1), we can see how the existing upper and lower bounds could be used for this case; for example, the ones in boundsgmmentropy.

Figure 1: The network output with VICReg training is more gaussian for small input noise

. The P-value of the normality test for different SSL models trained on CIFAR-10 for different input noise levels. The x-axis is the coefficient that multiplies the data distribution standard deviation to obtain the Gaussian standard deviation that samples around each image. The dashed line represents the point at which the null hypothesis (Gaussian distribution of the network output) can be rejected with


4.1 Deriving VICReg From First Principles

We now propose to recover VICReg from the first principles per the above information-theoretic principle.

Recall that our goal is to estimate the entropy in Section 4, where

is a Gaussian mixture. This quantity is not known for a mixture of Gaussians due to the logarithm of a sum of exponential functions, except for the special case of a single Gaussian density. There are, however, several approximations in the literature that include both upper and lower bounds. Among the methods, some use the logarithmic sum of the probability

(kolchinsky2017estimating), and some use entropy-adjusted logarithmic probabilities (entropyapprox2008).

An even simpler solution is to approximate the entire mixture as a single Gaussian by capturing only the first two moments of the mixture distribution. Since the Gaussian distribution maximizes the entropy for a given covariance matrix, this method provides an upper bound approximation of our entropy of interest

. In this case, denoting by is the covariance matrix of , we find that we should maximize the following objective:

where is constant with respect to our optimization process, and the second term is the prediction performance of one representation from the other. A key result from shi2009data

connects the eigenvectors and eigenvalues of

and those of each component , and showed that under the assumption that the separation

between the different components is large enough —which holds true in our case as per our data distribution model— the eigenfunctions of

are approximately the eigenfunctions of . Therefore, in our case, this means that since all those eigenvalues are tied, we only need to find the most efficient way to maximize .

We know that the determinant of the matrix is the product of its eigenvalues. For every positive matrix, the maximum eigenvalue is greater than or equal to each diagonal element. Therefore, under the constraint of the eigenvalues of the matrix, the most efficient way is to decrease the off-diagonal terms and increase the diagonal terms. By setting , we therefore fully recover the VICReg objective.

4.2 SimCLR vs VICReg

lee2021compressive connected the SimCLR objective (chen2020simple) to the variational bound on the information between representations (poole2019variational), by using the von Mises-Fisher distribution as the conditional variational family. Based on our analysis in Section 4.1, we can identify two main differences between SimCLR and VICReg: (i) The conditional distribution ); SimCLR assumes a von Mises-Fisher distribution for the encoder, while VICReg assumes a Gaussian distribution. (ii) Entropy estimation; the entropy term in SimCLR, , is approximate based on the finite sum of the input samples. VICReg, however, uses a different approach and estimates the entropy of only from the first second moments. Creating self-supervised methods that combine these two differences would be an interesting future research direction. In theory, none of these assumptions is more valid than the other, and it depends on the specific task and our computational constraints.

4.3 Empirical Evaluation

The next step is to verify the validity of our assumptions. Based on the theory presented in Section 3.3, the conditional output density reduces to a single Gaussian with decreasing input noise. We validated it using a ResNet-18 model trained with SimCLR or VICReg on the CIFAR-10 dataset (Krizhevsky09learningmultiple). From the test dataset, we sample Gaussian samples for each image and analyzed whether these samples remain Gaussian (for each image) at the penultimate layer of the DN, that is, before the linear classification layer, independently for each output dimension. Then, we employ D’Agostino and Pearson’s test (10.2307/2334522)

to compute the p-value of the normality test under the null hypothesis that the sample represents a normal distribution. In this test, the deviation from normality is measured, and the test aims to determine whether the sample represents a normally distributed population. A kurtosis and skewness transformation is used to perform the test. The process is repeated for different noise standard deviations. Figure

1 shows the p-value as a function of the normalized standard deviation. We can observe that the network’s output is indeed Gaussian with a high probability for small input noise. As we increase the input noise, the network’s output becomes less Gaussian until the noise distribution can be rejected with confidence. Moreover, we can see that VICReg is, interestingly, more ”Gaussian” than SimCLR, which may have to do with the fact that it optimizes only the second moments of the density distribution to regularize .

5 Conclusions

In this study, we examine SSL’s objective function from an information-theoretic perspective. Our analysis, based on transferring the required stochasticity to the input distribution, shows how to derive SSL objectives. Therefore, it is possible to obtain an information-theoretic analysis even when using deterministic DNs. In the second part, we rediscovered VICReg’s loss function from first principles and showed its implicit assumptions. In short, VICReg performs a crude lower bound estimate of the output density entropy by approximating this distribution with a Gaussian matching the first two moments. Finally, we empirically validated that our assumptions are valid in practice, thus confirming the validity of our novel understanding of VICReg. Our work opens many new paths for future research; A better estimation of information-theoretic quantities fits our assumptions. Another exciting research direction is to identify which SSL method is preferred based on data properties.