DeepAI
Log In Sign Up

Toward a Geometrical Understanding of Self-supervised Contrastive Learning

05/13/2022
by   Romain Cosentino, et al.
0

Self-supervised learning (SSL) is currently one of the premier techniques to create data representations that are actionable for transfer learning in the absence of human annotations. Despite their success, the underlying geometry of these representations remains elusive, which obfuscates the quest for more robust, trustworthy, and interpretable models. In particular, mainstream SSL techniques rely on a specific deep neural network architecture with two cascaded neural networks: the encoder and the projector. When used for transfer learning, the projector is discarded since empirical results show that its representation generalizes more poorly than the encoder's. In this paper, we investigate this curious phenomenon and analyze how the strength of the data augmentation policies affects the data embedding. We discover a non-trivial relation between the encoder, the projector, and the data augmentation strength: with increasingly larger augmentation policies, the projector, rather than the encoder, is more strongly driven to become invariant to the augmentations. It does so by eliminating crucial information about the data by learning to project it into a low-dimensional space, a noisy estimate of the data manifold tangent plane in the encoder representation. This analysis is substantiated through a geometrical perspective with theoretical and empirical results.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

09/18/2022

The Geometry of Self-supervised Learning Models and its Impact on Transfer Learning

Self-supervised learning (SSL) has emerged as a desirable paradigm in co...
12/16/2021

High Fidelity Visualization of What Your Self-Supervised Representation Knows About

Discovering what is learned by neural networks remains a challenge. In s...
06/16/2020

Visual Chirality

How can we tell whether an image has been mirrored? While we understand ...
08/31/2020

A Framework For Contrastive Self-Supervised Learning And Designing A New Approach

Contrastive self-supervised learning (CSL) is an approach to learn usefu...
11/01/2022

Augmentation Invariant Manifold Learning

Data augmentation is a widely used technique and an essential ingredient...
06/08/2021

Self-Supervised Learning with Data Augmentations Provably Isolates Content from Style

Self-supervised representation learning has shown remarkable success in ...
08/11/2022

On the Pros and Cons of Momentum Encoder in Self-Supervised Visual Representation Learning

Exponential Moving Average (EMA or momentum) is widely used in modern se...

1 Introduction

Figure 1: Evolution throughout contrastive SSL training of the rank of a linear projector of dimension for different augmentation strengths, and the associated accuracy obtained on Cifar100 by using the representation extracted in the encoder space. Large, moderate, and small augmentations refer to the strength of the data augmentation applied to the input samples (see Table 2 for each configuration). The smaller the strength of the data augmentation policy, the less the projector suffers from dimensional collapse. However, when the projector is affected by a substantial dimensional collapse, the encoder representation becomes suitable for the downstream task. In this work, we demystify this intriguing relationship between augmentation strengths, encoder embedding, and projector geometry.

Training of models that are capable of extracting meaningful data embedding without relying on labels has recently reached new heights with the substantial development of self-supervised learning (SSL) methods. These approaches replace labels required for supervised learning with augmentation policies, which will define the desired invariances of the trained representation. For each dataset, the augmentation policies selected in practice result from vast cross-validations as there is little understanding regarding their implication on the learned representation. From these augmentation policies, multiple transformed instances of the same sample are generated, and the network is trained so that their embeddings coincide. After training, the network can be used as a mapping for other datasets to obtain an efficient data representation for various downstream tasks.

To the surprise of many, SSL’s performance is competitive with supervised learning methods (DBLP:journals/corr/abs-2104-14294; chen2020big; DBLP:journals/corr/abs-2002-05709; DBLP:journals/corr/abs-2006-07733; DBLP:journals/corr/abs-1912-01991; DBLP:journals/corr/abs-2104-14548; DBLP:journals/corr/abs-2105-04906; DBLP:journals/corr/abs-2103-03230; DBLP:journals/corr/abs-1912-01991; DBLP:journals/corr/abs-1807-05520; DBLP:journals/corr/abs-2005-04966; DBLP:journals/corr/abs-2005-10243; DBLP:journals/corr/abs-2006-07733; DBLP:journals/corr/abs-2104-14294), and more importantly, is more efficient to perform transfer learning for most data distributions and tasks (DBLP:journals/corr/abs-2011-13377; DBLP:journals/corr/abs-2104-14294).

Interestingly, most SSL frameworks developed so far do not use the data embedding provided by the network’s output and instead use the representation extracted from an internal layer. In particular, the deep neural network (DNN) that is used for SSL is usually composed of two cascaded neural networks: the encoder (backbone) and the projector. Usually, the encoder is a residual network, or more recently, a vision transformer, and the projector is an MLP (sometimes linear). While the projector output is used for training, only the encoder representation is used for downstream tasks. It is known that the representation at the output of the projector, which is designed to be almost invariant to the selected augmentations, discards crucial features for downstream tasks, such as color, rotation, and shape, while the one at the output of the encoder still contains these features (DBLP:journals/corr/abs-2002-05709; appalaraju2020good). Recently, in (DBLP:journals/corr/abs-2110-09348), the authors shed light on this loss of information and showed that, despite contrastive SSL making use of negative pairs, the output of its DNN suffers from dimensional collapse. This loss of information is explained by the rank deficiency of the projector.

In this work, we observe that the dimensional collapse in contrastive SSL is tied to the strength of the augmentations as shown in Fig. 1, where we display the evolution of the rank of a linear projector during training. As in (DBLP:journals/corr/abs-1909-13719), the notion of strength refers to the amplitude of the data augmentation policies, that is, how much the transformed input samples differ from their original version. For large augmentations, we observe the dimensional collapse phenomenon described in (DBLP:journals/corr/abs-2110-09348). However, we observe that the rank of a linear projector is inversely related to the strength of the augmentations. In Fig. 1 we also display the accuracy obtained using the encoder representation associated with each regime and see that in the absence of dimensional collapse, the encoder representation has poor generalization capability.

In our study, we take a geometrical approach to understand from first principles the relationship between the strength of the augmentation policies and the dimensional collapse phenomenon, its effect on the capability of the encoder to represent the data manifold, and the benefits of using a non-linear projector.

We investigate this relationship between the strength of the augmentation policies, the loss of information of the projector, and the encoder embedding by considering a geometrical point of view on the InfoNCE loss function

(oord2019representation). We first derive an interpretable upper bound of the InfoNCE loss function in Sec. 3

. We then leverage this upper bound to derive the following contributions guided by both theoretical and empirical evidences using cifar

dataset:

  • [leftmargin=*]

  • Dimensional collapse of the projector and data augmentation strength. We show that the dimensional collapse of the projector depends on the number of augmentations, the per-sample augmentation strength, the initial distribution of the data in the encoder space. To do so, we theoretically analyze two forces driving the InfoNCE loss function: undoing the effect of the data augmentation policies and reducing the similarity between each datum and its negative samples (Sec. 4).

  • Implications of the augmentation policy strength on the estimation of the data manifold tangent space. We show that there exists an intricate relationship between the estimation of the data manifold tangent space performed by the encoder and the strength of the augmentations. In particular, in the case of large augmentations, the estimate of the data manifold tangent space at initialization is poor, but, as training progresses augmentations become beneficial for representing data manifold directions more accurately, in the case of small augmentations the estimation of the data manifold tangent plane remains poor throughout the training, as the InfoNCE loss, in that case, is dominated by the goal to reduce the similarity between each datum and their associated negative samples (Sec. 5).

  • Impact of the projector on the estimation of the data manifold by the encoder. We show that the projector aims at projecting the augmented samples onto a subspace spanned by a noisy estimate of the data manifold tangent space. With a linear projector, the encoder is constrained to project the data manifold direction onto a linear subspace, while a non-linear projector enables the encoder to map the tangent space of the data manifold onto a continuous and piecewise affine subspace. The deeper and wider the projector is, the more the encoder has flexibility regarding its mapping. Therefore, a non-linear projector allows the encoder to unlock its expressive power (Sec. 6).

Figure 2: The strength of the augmentations (depicted as “power bars”) acts as a weighting term balancing the attraction of augmented samples and the repulsion effect of the negative samples. Contrastive SSL introduces negative samples to avoid feature collapse, which aims at enforcing the representation learned to fill the available dimensions. Under the InfoNCE loss, Eq. 1, there is a trade-off between the repulsion effect induced by the negative samples and the invariance property enforced by maximizing the similarity between two augmented versions of the same sample. We show that the strength of the data augmentation policies acts as a per-datum weight reinforcing the invariance term and reducing the effect of the negative samples. On the contrary, reducing the strength of the augmentations accentuates the repulsing effect from the negative samples. (Notations detailed in Table 1).

2 Background and notations

input data
augmentation distribution
augmented pair
encoder mapping
projector mapping
encoder and projector mapping
linear projector’s weights
Table 1: Notation reference card

Notations. Throughout this paper, we denote by , the original data in , and the augmented pairs obtained by sampling a distribution of augmentation, , and applying the sampled transformation onto each input data. These augmented pairs are fed into the network where is the encoder with output dimension and the projector with output dimension . Notations are summarized in Table 1.

Architecture and hyperparameters.

For all the experiments performed in this work, we used the SimCLR framework (DBLP:journals/corr/abs-2002-05709), trained using the InfoNCE loss function. The encoder is defined as a Resnet with output dimension . The projector is linear with output dimension . The optimization of the loss function is performed using LARS (you2017large) with learning rate , weight decay , and momentum . The dataset under consideration is Cifar (krizhevsky2009learning), sensible tradeoff between the computational resources needed for training and the challenge it poses to Resnet (i.e., to avoid fitting it with too much ease).

Data augmentation policies. The augmentation policies to train SimCLR as well as to perform our analysis will be based on: random horizontal flipping, random resized crop, random color jittering, and random grayscale (see https://pytorch.org/vision/stable/transforms.html for details). Three settings will be tested: small augmentations, moderate augmentations, and large augmentations. The details of the hyperparameters are shown in Table 2, where for resized crop, we display the scale parameter, and for color jittering the parameter that such that brightness , contrast , saturation , and hue . Note that the large augmentation setting corresponds to the optimal parameters to obtain the best representation on Cifar.

Augmentation strength
Large Moderate Small
Horiz. flipping
Grayscaling ()
Resized crop
Color jittering ()
Table 2: Data Augmentation Policies

Loss function. The infoNCE loss function, widely used in contrastive learning (chen2020simple; chen2020big; misra2019selfsupervised; he2020momentum; dwibedi2021little; yeh2021decoupled) is defined as

(1)

where . Thus, for each , consists of the indices of the negative samples related to , i.e., all the data points except the two augmentations and . Note that all projector outputs are normalized, that is, .

The numerator of the loss function in Eq. 1 favors a similar representation for two augmented versions of the same data, while the denominator tries to increase the distance between each first augmented sample and the components of all other pairs. Note that, in practice, this loss is often symmetrized. For all the experiments in this work, the temperature parameter is set to .

Rank estimation.

The numerical estimation of the rank is based on the total variance explained as recently used to analyze the rank of DNNs embedding

NEURIPS2020_d5ade38a. For a given matrix

with singular values

its estimated rank w.r.t to is defined as

(2)

that is, it is the number of singular values whose absolute values are greater than .

3 An interpretable InfoNCE upper bound

In this section, we propose an upper bound on the InfoNCE loss that will allow us to derive insights into the impact of the strength of the data augmentation policies on the projector and encoder embeddings and the nature of the information discarded by the projector and how does increasing the depth and width of the projector affect the encoder embedding. We provide an overview of InfoNCE in Appendix B. The following proposition illustrates the InfoNCE upper bound in the case of a linear projector.

Proposition 3.1.

Considering a linear projector , InfoNCE is upper bounded by

(3)

where

(4)
(5)

and . (Proof in Appendix C.2).

While in the InfoNCE for each augmented sample the negative samples taken into account are within a ball of radius defined by (Appendix B for detailed explanations), in our upper bound in Eq. 3, we only consider the negative sample resembling the most with its corresponding first augmented sample, denoted by .

The bound of Eq. 3 captures the essential trade-offs in SSL: corresponds to the maximization of the similarity between two versions of the same augmented data in the projector space, while approximates the repulsion term that aims at suppressing the collapse of the representation by using the closest negative sample. In Fig. 3, we show that InfoNCE and behave similarly during training. While tightening the bound seems possible, is intuitive and sufficient to support our analysis.

Figure 3: The upper bound in Eq. 3 tracks the InfoNCE loss

(mean and standard deviation over

runs) throughout training with a batch size of .

4 Dimensional collapse hinges on augmentations strength

In this section, we leverage the upper bound in Eq. 3 to show that the dimensional collapse of the projection is directly related to the strength of the augmentation policies, as observed in Fig. 1. In particular, the stronger the augmentations, the lower the rank of the projector. We first consider and from Eq. 3 separately, before addressing their interaction. We provide in Fig. 2 a description of the interactions between the invariance term and repelling term with respect to the strength of the augmentations.

(1) The invariance term in Eq. 4

promotes the augmented samples to coincide; we show here how it affects the rank of the projector and how stronger augmentations cause the projector to be invariant. We first develop some intuitions by considering the case where the transformations can be described by a linear action in the encoder space and then generalize this result to non-linear transformations.

(a) Linear transformations in encoder space. This assumption can be formally described by , where , that is, in the encoder space, the augmented samples are related by a linear translation. Under this assumption, the invariance term becomes

(6)

which can be minimized by maximizing the cosine similarity between

and . This can be achieved by projector weights, i.e., , such that: is colinear to or the are in the null space of . In both cases, the cosine similarity is maximized.

Formally these two cases correspond to:

  1. [label=(), leftmargin=*]

  2. , , s.t. ;

  3. , .

In Fig. 4, we show that, at any stage in the learning, the embedded data span all the available directions in the encoder space. Therefore, will almost surely lead to the collapse of the projector. In the case of , the rank of the projector decreases as the dimension of the span of the direction of the augmentations increases. The rank of the projector can be maintained if the encoder is able to map these latter directions onto a low-dimensional subspace.

If one considers multiple augmentations that point in different directions in the encoder space, the dimension of the span of the ’s increases. Similarly, the larger the spread (i.e., the difference between the maximum and minimum strength) of the augmentation policy distribution , the higher the dimension of the span of the . We provide empirical evidence regarding this aspect in Appendix D.

Figure 4: Encoder space log-singular values for Cifar dataset (). The log-spectrum is evaluated at different training time: initialization (dotted line), half-training time (dashed line), and after training (solid line). In these three settings, the estimated rank (see Eq. 2) leads to , i.e., the embedded transformed samples span all the available directions in the encoder space (albeit a slight rank deficiency at initialization).
Figure 5: Visualization of the Euclidean distance between and for different training configurations: small, moderate, and large augmentations. Under large augmentations, the repulsion term is deemphasized. Under small augmentations, dominates the loss. The augmentation strength act as a per-datum scalar weight that governs how much the representation needs to be invariant as opposed to how much each augmented sample and their nearest neighbor should be projected onto opposite directions.

(b) Non-linear transformations in encoder space. We formalize non-linear transformations as the result of the application of Lie groups onto the data embedded in the encoder space. Formally, , where is the generator of a Lie group that characterizes the type of transformation and is a scalar denoting the strength of the transformations induced by applied onto the datum. A primer on Lie group transformations is provided in Appendix A.

We can now express as a function of the strength parameter by linearizing the exponential map aforementioned (details for in Appendix A)

(7)

Eq. 6 This expression is similar to Eq. 6, but we now have access to the strength parameters and the generator of the transformation (an intuitive example of such a transformation given in Appendix A). As in Eq. 6, to minimize this term, the projector has two possibilities:

  1. [label=(), leftmargin=*]

  2. , , s.t. ;

  3. , .

In the first case, we also refer to Fig. 4 to assert that this will lead to the collapse of the output of the projector. In the second case, we see that this condition is satisfied if the column space of is in the null space of . This shows that, in order to maximize the similarity, one solution for the projector is to align its kernel with the generators underlying the augmentation policies.

As every augmentation policy has its generator, we observe that increasing the number of transformations will increase the dimension of the null space of , except if their generator spans similar directions. We also note that each datum has its own strength parameter . The larger the strength, the more the similarity in Eq. 7 will decrease, except if the augmentation is colinear to the data. The larger the augmentation strength, the more it penalizes the loss function for each data. Therefore, the strength of the transformation corresponds to a per-datum scalar weight that governs how much the transformation affects the projector toward an invariant representation.

Small Augmentations Moderate Augmentations Large Augmentations
Figure 6: Percentage of matching labels between and for the fine and coarse labels of the Cifar dataset when training with different augmentation strengths. Recall that the index of is determined in the projector space as defined in Proposition 3.1

. We observe that during training, across all augmentation regimes, the amount of shared semantic information increases. As the augmentation strength increases, the training time for the converge of this semantic sharing increases as well. Under small augmentations it converges after a couple hundred epochs, whereas under large augmentations it did not converge yet after

epochs. The strength of the augmentation establishes the trade-off between the invariance and repulsion terms (Sec. 4): under small small augmentations, the repulsion term dominates, which results in a poor representation of the data manifold.

(2) We now consider the repulsion term in Eq. 5. To minimize this term, the projector maps and its nearest neighbor onto diametrically opposed directions. If it is feasible for the projector to map each and , onto diametrically opposed directions, the representation provided would not encapsulate any information regarding the data manifold.

Let us consider the case where, at initialization, the encoder captures salient features of the data, i.e., semantically similar data are nearby. In that case, if dominates, data that are semantically similar will be pushed away. Thus, for encoder architectures that, at initialization, already capture the salient features of the data, such as resnets and vision transformers, strong augmentations are beneficial to prevent to dominate. On the contrary, in the case of encoders that are less efficient at capturing the image manifold information, such as MLPs, semantically similar data might not be close to each other in the DNN’s embedding, therefore, repulsion is necessary to aim an efficient data embedding.

In Fig. 5 we provide the histogram of the distances between and . As we understood from the previous discussions, when the augmentations are small, this distance tends to be higher as the term has a greater impact than on the InfoNCE loss. Conversely, for stronger augmentations, the term dominates.

Therefore, the trade-off between having strong augmentation ( dominates) and having small augmentations ( dominates) is based on the following factors: the number of augmentations, the per-sample augmentation strength, the initial distribution of the data in the encoder space.

Note that, in this section, we assume that the linearization capability of the resnet with respect to the transformations enables us to express the transformations with respect to the the generator of the transformation and the per-sample strength. In practice, this linearization capability might be limited, and therefore, the projector discards more than just the generator induced by the augmentation policies. This loss of information has been empirically observed in (DBLP:journals/corr/abs-2002-05709; appalaraju2020good).

5 The effect of augmentation strength on the encoder embedding

In this section, we investigate how the augmentations affect the geometry of the representation. Given that the outputs of the projector have equal norm, the upper bound in Eq. 3 becomes

(8)

where , living in the encoder’s output space will be important to describe the geometry of both the encoder and projector. Fig. 7 presents a cartoon visualization of this geometry.

In Sec. 4 we showed how the augmentation strength affects the projector’s rank. We now provide a different geometrical perspective, where the projector attempts to map onto an estimate of a manifold tangent space induced by . We investigate how close to the data manifold this tangent space is and how the strength of the augmentation affects this estimation process. Specific details on manifold tangent plane estimation can be found in (tenenbaum2000global; bengio2005non; bengio2005non2; NIPS2017_86a1793f).

Data manifold tangent plane estimation: augmentation strengths.

We describe here the span of the displacement vectors,

, and how they evolve during training with respect to the augmentation strength. In order to qualitatively analyze , we plot in Fig. 6 the amount of label sharing between and , as training progresses. Note that Cifar contains fine and coarse classes. We observe that semantic similarity between and increases with the augmentation strength.

In the small augmentations regime, at initialization, and are close to each other, but do not share the same label. During training, the dissimilarity between and increases as dominates the loss as discussed in Sec. 4 and observed in Fig. 6. The term of the loss does not favor specific directions. Therefore, it is hard to characterize how (the approximation of) the data manifold tangent plane in the encoder space evolves during training in the case of small augmentations.

In the large augmentations regime, at initialization, and are distant from each other (as is near ). Thus, the directions spanned by are a coarse estimation of the underlying data manifold. However, from Sec. 4, we know that during training, dominates, so that becomes closer to . Therefore, for large augmentations, the estimation of the data manifold tangent plane becomes finer as training progresses. This phenomenon is observed in Fig. 6, where the percentage of sharing similar semantic content with steadily increases during training.

Figure 7: Intuitive illustration of the projector aiming at aligning its column span with the noisy estimate of the data manifold tangent space performed by the encoder. We provide here in the encoder output space the visualization of the noisy estimate of the data manifold: a continuous and piecewise affine surface representing (top). We shed light on the aim of the projector by zooming inside a region of this estimated manifold: it aligns its column span (yellow) with the local estimated data manifold tangent space. In the case of a linear projector, the column of should align with the entire estimated data manifold (top). Therefore forcing the encoder to provide a linear estimate of the data manifold. However, in the case of a non-linear projector, this alignment is performed locally. As the span of the projector is also a continuous and piecewise affine, using a non-linear projector lifts such a constraint on the encoder. The deeper and wider the projector is, the larger is its number of regions, therefore, the less constrained the encoder is with respect to its estimation of the data manifold tangent space. Note that our proposed analysis on the strength of the augmentations can be applied to the non-linear case by considering it locally, i.e., for each local affine map.
Figure 8: Fraction of variance unexplained, Eq. 11, as a measure of misalignment between the column space of the linear projector and , where defined as (mean and standard deviation over runs). We observe that the amount of unexplained variance decreases drastically after the initialization, showing that the projector attempts to approximates the subspace .

6 The projector’s approximate estimation of the data manifold

We now propose to understand the relationship between the noisy estimate of the data manifold tangent space, defined by , and the projector. We provide the interpretation of the projector in the encoder output space by considering the column space . Specifically, we find that, during training, the span of the projector aligns with the noisy . Under this geometrical picture, we understand the role of the depth and width of non-linear projector. For sake of clarity, we first develop the case of linear projector, then, we resolve the case of non-linear projector using their continuous and piecewise affine formulation.

Linear Projector. We focus our analysis by considering the following reformulation of Eq. 8 in order to describe the role of the projector in the encoder space

(9)

where in the case of a linear projector

(10)

i.e., the projection of onto the column space of .

To minimize Eq. 9, should enable the projection of onto . Therefore, the columns of needs to include in their span .

In order to gain insights about the relation of the projector and this subspace, we provide in Fig. 8 the fraction of variance of unexplained by as follows

(11)

We observe that during training, the fraction of variance unexplained decreases, and we posit that the remaining part of the unexplained variance is mainly due to the fact that the encoder embedding of these semantic directions is non-linear, the Resnet encoder is not capable of linearizing .

Our geometrical understanding of the projector can be summarized as follows: For each batch, one obtains in the encoder space an estimate of the data manifold directions by taking the difference between and , then, the projector attempts to fit the subspace spanned by these vectors. Given the nature of this subspace described in Sec. 5, we conclude that the projector attempts to align each datum, embedded in the encoder space, onto the directions of the estimated data manifold tangent space. Therefore the projector discards the information that is not part of the subspace .

From this, we understand that in the case of small augmentations, due to the proximity of , , and at initialization, the error performed by the projector is small in order to obtain the desired projections. This also highlights why, in Fig. 1, the rank of the projector is not modified when training under small augmentations as its objective is already reached at initialization.

Non-linear Projector. We now discuss how the width and depth of a non-linear projector affects its objective of projecting each onto . In particular, we show that when using an MLP, the approximation of is performed locally, i.e., each part of the encoder output space will be mapped by a different (local) affine transformation induced by the projector. Each affine transformation depends on both the type of non-linearity and the weights of the projector’s MLP. Therefore, the aforementioned discussions regarding the strength of the augmentations transfer to this case in a local manner.

We consider MLP projectors employing non-linearities such as (leaky-)ReLU, which are continuous piecewise affine operators living on a partition

of , i.e., the encoder’s output space. In fact, the projector acts as an affine transformation for each non-overlapping region of the encoder space. For more details regarding this approach for partitioning the input space of deep neural network, the reader should refer to (balestriero2018spline; balestriero2019geometry; DBLP:journals/corr/abs-2009-09525). These MLPs can be expressed using the following closed-form (zeros-bias MLP projector is used for the sake of clarity)

(12)

where defines a partition of .

For each region in the encoder’s output space, the projector acts as a linear mapping, defined by . It is now clear that whereas in the linear case the InfoNCE loss adjusts to fit , in the non-linear case, this fitting is performed locally; for each region in the encoder space, the corresponding is trained to fit the samples belonging to the region onto the subspace . This is formally described as .

The deeper and wider the MLP is, the number of regions increases and their volumes surely decrease. Therefore, the granularity of the projection and the fitting of the displacement vector is getting more refined as the depth increases (montufar2021sharp). This interpretation helps us understand why in (DBLP:journals/corr/abs-2006-10029), it was empirically shown that having deeper and larger MLPs could help to improve the accuracy of downstream tasks. In fact, a linear projector forces the encoder to linearize the subspace , while in the case of a non-linear projector, these directions can be mapped to a non-linear manifold (under the constraint that the projector’s MLP can fit it). The wider and deeper the MLP, the less the encoder is constrained.

7 Related Work

DNN Mapping Insights Data Augmentation Insights
Encoder Projector Theoretical Empirical Theoretical Aug. Empirical Aug.
Analysis Analysis Analysis Analysis Model Model
Our work Yes LinearNonlinear Encoder Encoder Lie generator Conv. practice
(DBLP:journals/corr/abs-2110-09348) No Linear Projector Encoder Additive noise Gaussian noise
(huang2021towards) No separate analysis Projector Encoder (-augmentation Conv. practice
(DBLP:journals/corr/abs-2005-10242) No separate analysis Projector Encoder - Conv. practice
(haochen2021provable) No separate analysis Projector - Gaussian noise -
(DBLP:journals/corr/abs-2005-10243) No separate analysis Projector Encoder - Conv. practice
(wang2021understanding) No separate analysis - Encoder - Conv. practice
Table 3: Related Work Table. Conv stands for conventional.

We summarize in Table 3 the main differences with the related work. There are three key aspects differentiating this work. (a) Encoder Analysis: We provide both theoretical and empirical insights on the encoder embedding, i.e., the actual output used for downstream tasks. In contrast, the authors in (DBLP:journals/corr/abs-2110-09348) do not analyze the learning in the encoding network. They investigate an end-to-end linear network that directly connects the input to the overall output fed to the loss function. Similarly,(DBLP:journals/corr/abs-2005-10242) analyzes theoretically an end-to-end network mapping the input to output living on the unit sphere and fed into the loss function. (b) Nonlinear Projector: Our analysis describes how a nonlinear projector impacts the encoder embedding and analyze the effect of its depth and width on the geometry of the encoder. (c) Lie generator modeling of data augmentations: We consider Lie group transformations in the latent space as a way to model gentle augmentations (Sec. 4.1.b). This is motivated by the fact that the encoder output is a continuous piecewise affine low dimensional manifold (DBLP:journals/corr/abs-1905-12784; balestriero2019geometry), in which the Lie group defines transformations between points that can model those observed in various natural datasets (connor2021variational; pmlr-v145-cosentino22a). While existing approaches (DBLP:journals/corr/abs-2110-09348; huang2021towards; haochen2021provable) are distance-based, our provides a way to capture geometric information via the Lie generator.

8 Conclusions

In this work, we have investigated the intricate relationship between the strength of the policy augmentations, the encoder representation, and the projector’s effective dimensionality in the context of self-supervised learning. Our analysis justify the use of large augmentations in practice. Under small augmentations, the data representations are haphazardly mutually repulsed, leading them to reflect the semantics poorly. However, under large augmentations, SSL attempts to approximate the data manifold, extracting in the process useful representation for downstream transfer tasks. Generally, our analysis provides the foundations for further understanding and improvements of SSL.

9 Acknowledgement

This material is supported by Defense Advanced Research Projects Agency (DARPA) under the Learning with Less Labels (LwLL) program.

References

Appendix A Lie Group

A Lie group is a group that is a differentiable manifold. For instance, the group of rotation

One of the main advantage of having a group with a differentiable manifold structure is that it can be defined by an exponential map: where is the infinitesimal operator of the group. The infinitesimal operator is thus encapsulating the group information.

The group action, defined as , that is, corresponds to the mapping induced by the action of the group element onto the data .

One can exploit the Taylor series expansion of the exponential map to obtain its linearized version

(13)

where is the data and its transformed version with respect to the group induced by the generator . For more details regarding Lie group and the exponential map refer to (hall2015lie).

Appendix B The InfoNCE Framework

We begin our analysis by considering the InfoNCE loss function, this loss function is commonly used in contrastive learning frameworks and has the benefit of providing multiple equivalent formulations that ease the derivation of insights regarding self-supervised learning embedding. In this section, we propose first a novel formulation of the InfoNCE which allows us to formalize common intuitions regarding contrastive learning losses. We then leverage this reformulation to provide an upper bound of the InfoNCE loss that will be central to our analysis regarding the role and properties of the projector. W.l.o.g. the proof will be first derived with to not introduce further notations.

The following proposition is a re-expression of the InfoNCE that allows us to consider this loss function as a regularized non-contrastive loss function where the regularizations provides insights into how this loss is actually behaving.

Proposition B.1.

The infoNCE can be reformulated as

(14)

where, denotes the entropy,

is a probability distribution such that

and recall that (Proof in App.C.1).

When minimizing Eq. 14, the first term forces the two augmented data in the projector space to be as similar as possible, which is the intuition behind most self-supervised learning losses. Note that only relying on such a similarity loss often leads to the collapse of the representation, that is one issue that is tackled in non-contrastive self-supervised learning using various hacks (DBLP:journals/corr/abs-2006-09882; DBLP:journals/corr/abs-2006-07733).

How InfoNCE selects its negative pairs?

From Proposition B.1, we observe that, when training under the InfoNCE, two additional terms act as a regularization to this non-contrastive similarity loss function, hindering the representation’s collapse. In particular, the second term,

, pushes the first augmented data in the projector space to be as different as possible to a weighted average of the other data. This expectation is dependant on the probability distribution

, which assigns high probability to data that are similar to . Thus, the more the resembles , the more they account for the loss, and hence the more they tend to be repelled. Now, it is clear that if the distribution is uniform, then all the data will tend to account for the same error. The third term avoids this by forcing this distribution to be with local support, and therefore, only consider a few instances of , in particular, the ones that are the most similar to . This last term thus enforces the to be different from each other by being repelled with different strength from .

How does the temperature parameter affects these regularizations?

From the Proposition B.1, we develop some understanding regarding the temperature parameter ; in particular, this parameter influences the distribution , which in turn, characterizes how local or global is the data taken into account into the repulsion process allowing to the differentiation between every data embedding. We first observe that, when , , which induces that only the closest data from that is not is taken into account in the error. Besides, when , , therefore, this regularization term does not affect the loss function. This shows that increasing the temperature parameter reduces the support of the candidates for negative pairs and that the ones considered are the ones most similar to the considered data. Therefore, the temperature parameter effects the support of the per data distribution underlying the regularisation terms, and . From this, it is clear that having only one temperature parameter for the entire dataset is not adequate, as the locality of the distribution should be aligned with the volume of the data required to be repelled.

Appendix C Proofs

c.1 Proposition b.1

Proof.
(15)

Let’s denote by and let’s consider the loss function for each data denoted by .

Let’s now define a probability distribution such that .

We then compute the entropy of such distribution,

(16)

Thus,

From we obtain

Averaging over the concludes the proof. ∎

c.2 Proposition 3.1

Proof.

We consider the result of Proposition B.1 and we take the derivative of (described in Proof C.1) with respect to ,

(17)

Then,

Now, given that , we have

Therefore,

Now, defining , we have

(18)

Now, recall that recall that for sake of clarity we used . Adding the normalization and assuming a linear projector, i.e., with we obtain

(19)

Averaging over the concludes the proof. ∎

Figure 9: Encoder space log-singular values for the transformed Cifar dataset where the encoder space dimension and a linear projector of output dimension . The log-spectrum is evaluated at different training time: initialization (dotted line), half-training time (dashed line), and after training (solid line). In these three settings, the calculation of the rank following Eq. 2, leads to , that is, the representation induced by the encoder is full rank (and close to full rank at the initialization) until the end of the training as for the case where the input data are not transformed, in Fig. 4.

Appendix D Toy-example - Rank versus Augmentation Strength

Figure 10: Rank of the covariance matrix of rotated version of a random one-hot image of dimension (mean and standard deviation for runs). We sample the angle between and the -axis angle value uniformly. The larger the transformation, the higher rank the collection of images is, and there the lower rank a linear map is required to provide an invariant map to this specific transformation