DeepAI
Log In Sign Up

Joint Embedding Self-Supervised Learning in the Kernel Regime

09/29/2022
by   Bobak T. Kiani, et al.
MIT
Facebook
3

The fundamental goal of self-supervised learning (SSL) is to produce useful representations of data without access to any labels for classifying the data. Modern methods in SSL, which form representations based on known or constructed relationships between samples, have been particularly effective at this task. Here, we aim to extend this framework to incorporate algorithms based on kernel methods where embeddings are constructed by linear maps acting on the feature space of a kernel. In this kernel regime, we derive methods to find the optimal form of the output representations for contrastive and non-contrastive loss functions. This procedure produces a new representation space with an inner product denoted as the induced kernel which generally correlates points which are related by an augmentation in kernel space and de-correlates points otherwise. We analyze our kernel model on small datasets to identify common features of self-supervised learning algorithms and gain theoretical insights into their performance on downstream tasks.

READ FULL TEXT VIEW PDF

page 9

page 25

10/31/2020

A Survey on Contrastive Self-supervised Learning

Self-supervised learning has gained popularity because of its ability to...
10/04/2022

Contrastive Learning Can Find An Optimal Basis For Approximately View-Invariant Functions

Contrastive learning is a powerful framework for learning self-supervise...
03/03/2021

Contrastive learning of strong-mixing continuous-time stochastic processes

Contrastive learning is a family of self-supervised methods where a mode...
12/09/2021

Exploring the Equivalence of Siamese Self-Supervised Learning via A Unified Gradient Framework

Self-supervised learning has shown its great potential to extract powerf...
12/02/2021

Probabilistic Contrastive Loss for Self-Supervised Learning

This paper proposes a probabilistic contrastive loss function for self-s...
11/29/2021

Overcoming the Domain Gap in Contrastive Learning of Neural Action Representations

A fundamental goal in neuroscience is to understand the relationship bet...

1 Introduction

Self-supervised learning (SSL) algorithms are broadly tasked with learning from unlabeled data. In the joint embedding framework of SSL, mainstream contrastive methods build representations by reducing the distance between inputs related by an augmentation (positive pairs) and increasing the distance between inputs not known to be related (negative pairs) (Chen et al., 2020; He et al., 2020; Oord et al., 2018; Ye et al., 2019). Non-contrastive methods only enforce similarities between positive pairs but are designed carefully to avoid collapse of representations (Grill et al., 2020; Zbontar et al., 2021). Recent algorithms for SSL have performed remarkably well reaching similar performance to baseline supervised learning algorithms on many downstream tasks (Caron et al., 2020; Bardes et al., 2021; Chen and He, 2021).

In this work, we study SSL from a kernel perspective. In standard SSL tasks, inputs are fed into a neural network and mapped into a feature space which encodes the final representations used in downstream tasks (e.g., in classification tasks). In the kernel setting, inputs are embedded in a feature space corresponding to a kernel, and representations are constructed via an optimal mapping from this feature space to the vector space for the representations of the data. Since the feature space of a kernel can be infinite dimensional, one practically may only have access to the kernel function itself. Here, the task can be framed as one of finding an optimal “induced” kernel, which is a mapping from the original kernel in the input feature space to an updated kernel function acting on the vector space of the representations. Our results show that such an induced kernel can be constructed using only manipulations of kernel functions and data that encodes the relationships between inputs in an SSL algorithm (e.g., adjacency matrices between the input datapoints).

More broadly, we make the following contributions:

  • For a contrastive and non-contrastive loss, we provide closed form solutions when the algorithm is trained over a single batch of data. These solutions form a new “induced” kernel which can be used to perform downstream supervised learning tasks.

  • We show that a version of the representer theorem in kernel methods can be used to formulate kernelized SSL tasks as optimization problems. As an example, we show how to optimially find induced kernels when the loss is enforced over separate batches of data.

  • We empirically study the properties of our SSL kernel algorithms to gain insights about the training of SSL algorithms in practice. We study the generalization properties of SSL algorithms and show that the choice of augmentation and adjacency matrices encoding relationships between the datapoints are crucial to performance.

We proceed as follows. First, we provide a brief background of the goals of our work and the theoretical tools used in our study. Second, we show that kernelized SSL algorithms trained on a single batch admit a closed form solution for commonly used contrastive and non-contrastive loss functions. Third, we generalize our findings to provide a semi-definite programming formulation to solve for the optimal induced kernel in more general settings and provide heuristics to better understand the form and properties of the induced kernels. Finally, we empirically investigate our kernelized SSL algorithms when trained on various datasets.

1.1 Notation and setup

We denote vectors and matrices with lowercase () and uppercase () letters respectively. The vector 2-norm and matrix operator norm is denoted by . The Frobenius norm of a matrix is denoted as . We denote the transpose and conjugate transpose of by and

respectively. We denote the identity matrix as

and the vector with each entry equal to one as

. For a diagonalizable matrix

, its projection onto the eigenspace of its positive eigenvalues is

.

For a dataset of size , let for denote the elements of the dataset. Given a kernel function , let be the map from inputs to the reproducing kernel Hilbert space (RKHS) denoted by with corresponding inner product and RKHS norm . Throughout we denote to be the kernel matrix of the SSL dataset where . We consider linear models which map features to representations . Let be the representation matrix which contains as rows of the matrix. This linear function space induces a corresponding RKHS norm which can be calculated as where denotes the -th component of the output of the linear mapping . This linear mapping constructs an “induced” kernel denoted as as discussed later.

The driving motive behind modern self-supervised algorithms is to maximize the information of given inputs in a dataset while enforcing similarity between inputs that are known to be related. The adjacency matrix (also can be generalized to ) connects related inputs and (i.e., if inputs and are related by a transformation) and is a diagonal matrix where entry on the diagonal is equal to the number of nonzero elements of row of .

2 Related works

Any joint-embedding SSL algorithm requires a properly chosen loss function and access to a set of observations and known pairwise positive relation between those observations. Methods are denoted as non-contrastive if the loss function is only a function of pairs that are related (Grill et al., 2020; Chen and He, 2021; Zbontar et al., 2021). One common method using the VICReg loss (Bardes et al., 2021), for example, takes the form

(1)

We adapt the above for the non-contrastive loss we study in our work. Contrastive methods also penalize similarities of representations that are not related. Popular algorithms include SimCLR (Chen et al., 2020), SwAV (Caron et al., 2020), NNCLR (Dwibedi et al., 2021), contrastive predictive coding (Oord et al., 2018), and many others. We consider a variant of the spectral contrastive loss in our work (HaoChen et al., 2021).

Theoretical studies of SSL:

In tandem with the success of SSL in deep learning, a host of theoretical tools have been developed to help understand how SSL algorithms learn

(Arora et al., 2019b; Balestriero and LeCun, 2022; HaoChen et al., 2022; Lee et al., 2021). Findings are often connected to the underlying graph connecting the data distribution (Wei et al., 2020; HaoChen et al., 2021) or the choice of augmentation (Wen and Li, 2021). Building representations from known relationships between datapoints is also studied in spectral graph theory (Chung, 1997). We employ findings from this body of literature to provide intuition behind the properties of the algorithms discussed here.

Neural tangent kernels and gaussian processes: Prior work has connected the outputs of infinite width neural networks to a corresponding gaussian process (Williams and Rasmussen, 2006; Neal, 1996; Lee et al., 2017). When trained using continuous time gradient descent, these infinite width models evolve as linear models under the so called neural tangent kernel (NTK) regime (Jacot et al., 2018; Arora et al., 2019a). The discovery of the NTK opened a flurry of exploration into the connections between so-called lazy training of wide networks and kernel methods (Yang, 2019; Chizat and Bach, 2018; Wang et al., 2022; Bietti and Mairal, 2019). Though the training dynamics of the NTK has previously been studied in the supervised settings, one can analyze an NTK in a self-supervised setting by using that kernel in the SSL algorithms that we study here. We perform some preliminary investigation into this direction in our experiments.

Kernel and metric learning: From an algorithmic perspective, perhaps the closest lines of work are related to kernel and metric learning (Bellet et al., 2013; Yang and Jin, 2006). Since our focus is on directly kernelizing SSL methods to eventually analyze and better understand SSL algorithms, we differ from these methods in that our end goal is not to improve the performance of kernel methods but instead, to bridge algorithms in SSL with kernel methods. In kernel learning, prior works have proposed constructing a kernel via a learning procedure; e.g., via convex combinations of kernels (Cortes et al., 2010), kernel alignment (Cristianini et al., 2001), and unsupervised kernel learning to match local data geometry (Zhuang et al., 2011). Prior work in distance metric learning using kernel methods aim to produce representations of data in unsupervised or semi-supervised settings by taking advantage of links between data points. For example, Baghshah and Shouraki (2010); Hoi et al. (2007) learn to construct a kernel based on optimizing distances between points embedded in Hilbert space according to a similarity and dissimilarity matrix. Yeung and Chang (2007) perform kernel distance metric learning in a semi-supervised setting where pairwise relations between data and labels are provided. Xia et al. (2013) propose an online procedure to learn a kernel which maps similar points closer to each other than dissimilar points. Many of these works also use semi-definite programs to perform optimization to find the optimal kernel.

3 Contrastive and non-contrastive kernel methods

Figure 1: To translate the SSL setting into the kernel regime, we aim to find the optimal linear function which maps inputs from the RKHS into the -dimensional feature space of the representations. This new feature space induces a new optimal kernel denoted the “induced” kernel. Relationships between data-points are encoded in an adjacency matrix (the example matrix shown here contains pairwise relationships between datapoints).

Stated informally, the goal of SSL in the kernel setting is to start with a given kernel function (e.g., RBF kernel or a neural tangent kernel) and map this kernel function to a new “induced” kernel which is a function of the SSL loss function and the SSL dataset. For two new inputs and , the induced kernel generally outputs correlated values if and are correlated in the original kernel space to some datapoint in the SSL dataset or correlated to separate but related datapoints in the SSL dataset as encoded in the graph adjacency matrix. If no relations are found in the SSL dataset between and , then the induced kernel will generally output an uncorrelated value.

To kernelize SSL methods, we consider a setting generalized from the prototypical SSL setting where representations are obtained by maximizing/minimizing distances between augmented/un-augmented samples. Translating this to the kernel regime, as illustrated in Figure 1, our goal is to find a linear mapping which obtains the optimal representation of the data for a given SSL loss function and minimizes the RKHS norm. This optimal solution produces an “induced kernel” which is the inner product of the data in the output representation space. Once constructed, the induced kernel can be used in downstream tasks to perform supervised learning.

Due to a generalization of the representer theorem (Schölkopf et al., 2001), we can show that the optimal linear function must be in the support of the data. This implies that the induced kernel can be written as a function of the kernel between datapoints in the SSL dataset.

Proposition 3.1 (Form of optimal representation).

Given a dataset , let be a kernel function with corresponding map into the RKHS . Let be a function drawn from the space of linear functions mapping inputs in the RKHS to the vector space of the representation. For a risk function and any strictly increasing function , consider the optimization problem

(2)

The optimal solutions of the above take the form

(3)

where is a matrix that must be solved for and is a vector with entries .

Proposition 3.1, proved in Section A.1, provides a prescription for finding the optimal representations or induced kernels: i.e, one must search over the set of matrices to find an optimal matrix. This search can be performed using standard optimization techniques as we will discuss later, but in certain cases, the optimal solution can be calculated in closed-form as shown next for both a contrastive and non-contrastive loss function.

Non-contrastive loss

Consider a variant of the VICReg (Bardes et al., 2021) loss function below:

(4)

where

is a hyperparameter that controls the invariance term in the loss and

is the graph Laplacian of the data. When the representation space has dimension and the kernel matrix of the data is full rank, the induced kernel of the above loss function is:

(5)

where projects the matrix inside the parentheses onto the eigenspace of its positive eigenvalues, is the kernel row-vector with entry equal to with equal to its transpose, and is the kernel matrix of the training data for the self-supervised dataset where entry is equal to . When we restrict the output space of the self-supervised learning task to be of dimension , then the induced kernel only incorporates the top eigenvectors of :

(6)

where is the eigendecomposition including only positive eigenvalues sorted in descending order, denotes the matrix consisting of the first columns of and denotes the matrix consisting of entries in the first rows and columns. Proofs of the above are in Section A.2.

Contrastive loss

For contrastive SSL, we can also obtain a closed form solution to the induced kernel for a variant of the spectral contrastive loss (HaoChen et al., 2021):

(7)

where is the adjacency matrix encoding relations between datapoints. When the representation space has dimension , this loss results in the optimal induced kernel:

(8)

where is equal to the projection of onto its eigenspace of positive eigenvalues. In the standard SSL setting where relationships are pair-wise (i.e., if and are related by an augmentation), then has only positive or zero eigenvalues so the projection can be ignored. If , then we similarly project the matrix onto its top eigenvalues and obtain an induced kernel similar to the non-contrastive one:

(9)

where as before, is the eigendecomposition including only positive eigenvalues with eigenvalues in descending order, consists of the first columns of and is the matrix of the first rows and columns. Proofs of the above are in Section A.3.

3.1 General form as SDP

The closed form solutions for the induced kernel obtained above assumed the loss function was enforced across a single batch. Of course, in practice, data are split into several batches. This batched setting may not admit a closed-form solution, but by using Proposition 3.1, we know that any optimal induced kernel takes the general form:

(10)

where is a positive semi-definite matrix. With constraints properly chosen so that the solution for each batch is optimal (Balestriero and LeCun, 2022; HaoChen et al., 2022), one can find the optimal matrix by solving a semi-definite program (SDP). We perform this conversion to a SDP for the contrastive loss here and leave proofs and further details including the non-contrastive case to Section A.4.

Introducing some notation to deal with batches, assume we have datapoints split into of size . We denote the -th datapoint within batch as . As before, denotes the -th datapoint across the whole dataset. Let be the kernel matrix over the complete dataset where , be the kernel matrix between the complete dataset and batch where , and be the adjacency matrix for inputs in batch . With this notation, we now aim to minimize the loss function adapted from Equation 7 including a regularizing term for the RKHS norm:

(11)

where is a weighting term for the regularizer. Taking the limit of , we can find the optimal induced kernel for a representation of dimension by enforcing that optimal representations are obtained in each batch:

(12)

where as before, where is equal to the projection of onto its eigenspace of positive eigenvalues. Relaxing and removing the constraint that results in an SDP which can be efficiently solved using existing optimizers. Further details and a generalization of this conversion to other representation dimensions is shown in Section A.4.

3.2 Interpreting the induced kernel

As a loose rule, the induced kernel will correlate points that are close in the kernel space or related by augmentations in the SSL dataset and uncorrelate points otherwise. As an example, note that in the contrastive setting (Equation 8), if one calculates the induced kernel between two points in the SSL dataset indexed by and that are related by an augmentation (i.e., ), then the kernel between these two points is . More generally, if the two inputs to the induced kernel are close in kernel space to different points in the SSL dataset that are known to be related by , then the kernel value will be close to . We formalize this intuition below for the standard setting with pairwise augmentations.

Proposition 3.2.

Given kernel function with corresponding map into the RKHS , let be an SSL dataset normalized such that and formed by pairwise augmentations (i.e., every element has exactly one neighbor in ) with kernel matrix . Given two points and , if there exists two points in the SSL dataset indexed by and which are related by an augmentation (=1) and and , then the induced kernel for the contrastive loss is at least .

We prove the above statement in Section A.5. The bounds in the above statement which depend on the number of datapoints and the kernel matrix norm are not very tight and solely meant to provide intuition for the properties of the induced kernel. In more realistic settings, stronger correlations will be observed for much weaker values of the assumptions. In light of this, we visualize the induced kernel values and their relations to the original kernel function in Section 4.1 on a simple -dimensional spiral dataset. Here, it is readily observed that the induced kernel better connects points along the data manifold that are related by the adjacency matrix.

3.3 Downstream tasks

In downstream tasks, one can apply the induced kernels directly on supervised algorithms such as kernel regression or SVM. Alternatively, one can also extract representations directly by obtaining the representation as as shown in Proposition 3.1 and employ any learning algorithm from these features. As an example, in kernel regression, we are given a dataset of size consisting of input-output pairs and aim to train a linear model to minimize the mean squared error loss of the outputs (Williams and Rasmussen, 2006). The optimal solution using an induced kernel takes the form:

(13)

where is the kernel matrix of the supervised training dataset with entry equal to and is the concatenation of the targets as a vector. Note that since kernel methods generally have complexity that scales quadratically with the number of datapoints, such algorithms may be unfeasible in large-scale learning tasks unless modifications are made.

A natural question is when and why should one prefer the induced kernel of SSL to a kernel used in the standard supervised setting perhaps including data augmentation? Kernel methods generally fit a dataset perfectly so an answer to the question more likely arises from studying generalization. In kernel methods, generalization error typically tracks well with the norm of the classifier captured by the complexity quantity defined as (Mohri et al., 2018; Steinwart and Christmann, 2008)

(14)

where is a vector of targets and

is the kernel matrix of the supervised dataset. For example, the generalization gap of an SVM algorithm can be bounded with high probability by

(see example proof in Section B.1) (Meir and Zhang, 2003; Huang et al., 2021). For kernels functions bounded in output between and , the quantity is minimized in binary classification when for drawn from the same class and for drawn from distinct classes. If the induced kernel works ideally – in the sense that it better correlates points within a class and decorrelates points otherwise – then the entries of the kernel matrix approach these optimal values. This intuition is also supported by the hypothesis that self-supervised and semi-supervised algorithms perform well by connecting the representations of points on a common data manifold (HaoChen et al., 2021; Belkin and Niyogi, 2004). To formalize this somewhat, consider such an ideal, yet fabricated, setting where the SSL induced kernel has complexity that does not grow with the dataset size.

Proposition 3.3 (Ideal SSL outcome).

Given a supervised dataset of points for binary classification drawn from a distribution with and connected manifolds for classes with labels and respectively, if the induced kernel matrix of the dataset successfully separates the manifolds such that if are in the same manifold and otherwise, then .

The simple proof of the above is in Appendix B. In short, we conjecture that SSL should be preferred in such settings where the relationships between datapoints are “strong” enough to connect similar points in a class on the same manifold. We analyze the quantity in Section C.3 to add further empirical evidence behind this hypothesis.

Figure 2: Comparison of the RBF kernel space (first row) and induced kernel space (second row). The induced kernel is computed based on Equation 5, and the graph Laplacian matrix is derived from the inner product neighborhood in the RBF kernel space, i.e., using the neighborhoods as data augmentation. a) We plot three randomly chosen points’ kernel entries with respect to the other points on the manifolds. When the neighborhood augmentation range used to construct the Laplacian matrix is small enough, the SSL-induced kernel faithfully learns the topology of the entangled spiral manifolds. b) When the neighborhood augmentation range used to construct the Laplacian matrix is too large, it creates the “short-circuit” effect in the induced kernel space. Each subplot on the second row is normalized by its largest absolute value for better contrast.

4 Experiments

In this section, we empirically investigate the performance and properties of the SSL kernel methods on a toy spiral dataset and portions of the MNIST and eMNIST datasets for hand-drawn digits and characters

(Cohen et al., 2017). As with other works, we focus on small-data tasks where kernel methods can be performed efficiently without modifications needed for handling large datasets (Arora et al., 2019a; Fernández-Delgado et al., 2014). For simplicity and ease of analysis, we perform experiments here with respect to the RBF kernel. Additional experiments reinforcing these findings and also including analysis with neural tangent kernels can be found in Appendix C.

contrastive-MNIST contrastive-EMNIST non-contrastive-MNIST non-contrastive-EMNIST
Figure 3: Depiction of MNIST and EMNIST full test set performances using the contrastive and non-contrastive kernels (in red) and benchmarked against the supervised case with labels on all samples (original + augmented) in black and with labels only on the original samples in blue, with the number of original samples given by (each column) and the number of augmented samples (in log2) in x-axis. The first row corresponds to Gaussian blur data-augmentation (poorly aligned with the data distributions) and the second row corresponds to random rotations (-10,10), translations (-0.1,0.1) and scaling (0.9,1.1). We set the SVM regularization to and use the RBF kernel (NTK kernels in Section C.2). We restrict to settings where the total number of samples does not except , and in all cases, the kernel representation dimensions are unconstrained. Two key observations emerge. First, whenever the augmentation is not aligned with the data distribution, the SSL kernels falls below the supervised case, especially as increases. Second, when the augmentation is aligned with the data distribution, both SSL kernels are able to get close and even outperform the supervised benchmark with augmented labels.

4.1 Visualizing the induced kernel on the spiral dataset

For an intuitive understanding, we provide a visualization in Figure 2 to show how the SSL-induced kernel is helping to manipulate the distance and disentangle manifolds in the representation space. In Figure 2

, we study two entangled 1-D spiral manifolds in a 2D space with 200 training points uniformly distributed on the spiral manifolds. We use the non-contrastive SSL-induced kernel, following

Equation 5, to demonstrate this result, whereas a contrastive SSL-induced kernel is qualitatively similar and left to Section C.1. In the RBF kernel space shown in the first row of Figure 2, the value of the kernel is captured purely by distance. Next, to consider the SSL setting, we construct the graph Laplacian matrix by connecting vertices between any training points with , i.e., if and otherwise. The diagonal entries of are equal to the degree of the vertices, respectively. This construction can be viewed as using the Euclidean neighborhoods of as the data augmentation. We choose , where other choices within a reasonable range lead to similar qualitative results. In the second row of Figure 2, we show the induced kernel between selected points (marked with an x) and other training points in the SSL-induced kernel space. When is chosen properly, as observed in the second row of Figure 2(a), the SSL-induced kernel faithfully captures the topology of manifolds and leads to a more disentangled representation. However, the augmentation has to be carefully chosen as Figure 2(b) shows that when is too large, the two manifolds become mixed in the representation space.

4.2 Classification Experiments

Figure 4: MNIST classification task with original training samples and augmentations per sample on the train and test set (full, samples) with baselines in the titles given by supervised SVM using labels for all samples or original () samples only. We provide on the left the contrastive kernel performances when ablating over the (inverse) of the SVM’s regularizer y-axis and representation dimension () in the x-axis. Similarly, we provide on the right the non-contrastive kernel performances when ablating over on the y-axis and over the representation dimension () on the x-axis. We observe, as expected, that reducing prevents overfitting and should be preferred over the regularizer, and that acts jointly with the representation dimension i.e. one only needs to tune one of the two.

We explore in Figure 3 the supervised classification setting of MNIST and EMNIST which consist of grayscale images. (E)MNIST provide a strong baseline to evaluate kernel methods due to the absence of background in the images making kernels such as RBF more aligned to measure input similarities. In this setting, we explore two different data-augmentation (DA) policies, one aligned with the data distribution (rotation+translation+scaling) and one largely misaligned with the data distribution (aggressive Gaussian blur). Because our goal is to understand how much DA impacts the SSL kernel compared to a fully supervised benchmark, we consider two (supervised) benchmarks: one that employs the labels of the sampled training set and all the augmented samples and one that only employs the sampled training set and no augmented samples. We explore a small training set size going from to and for each case we produce a number of augmented samples for each datapoint so that the total number of samples does not exceed which is a standard threshold for kernel methods. We note that our implementation is based on the common Python scientific library Numpy/Scipy (Harris et al., 2020) and runs on CPU. We observed in Figure 3 that the SSL kernel is able to match and even outperform the fully supervised case when employing the correct data-augmentation, while with the incorrect data-augmentation, the performance is not even able to match the supervised case that did not see the augmented samples. To better understand the impact of different hyper-parameters onto the two kernels, we also study in Figure 4 the MNIST test set performances when varying the representation dimension , the SVM’s regularization parameter, and the non-contrastive kernel’s parameter. We observe that although is an additional hyper-parameter to tune, its tuning plays a similar role to , the representation dimension. Hence, in practice, the design of the non-contrastive model can be controlled with a single parameter as with the contrastive setting. We also interestingly observe that is a more preferable parameter to tune to prevent over-fitting as opposed to SVM’s regularizer.

5 Discussion

Our work explores the properties of SSL algorithms when trained via kernel methods. Connections between kernel methods and neural networks have gained significant interest in the supervised learning setting (Neal, 1996; Lee et al., 2017) for their potential insights into the training of deep networks. As we show in this study, such insights into the training properties of SSL algorithms can similarly be garnered from an analysis of SSL algorithms in the kernel regime. Our theoretical and empirical analysis, for example, highlights the importance of the choice of augmentations and encoded relationships between data points on downstream performance. Looking forward, we believe that interrelations between this kernel regime and the actual deep networks trained in practice can be strengthened particularly by analyzing the neural tangent kernel. In line with similar analysis in the supervised setting (Yang et al., 2022; Seleznova and Kutyniok, 2022; Lee et al., 2020), neural tangent kernels and their corresponding induced kernels in the SSL setting may shine light on some of the theoretical properties of the finite width networks used in practice.

References

  • M. ApS (2019) The mosek optimization toolbox for matlab manual. version 9.0.. External Links: Link Cited by: §A.4.
  • S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang (2019a) On exact computation with an infinitely wide neural net. Advances in Neural Information Processing Systems 32. Cited by: §2, §4.
  • S. Arora, H. Khandeparkar, M. Khodak, O. Plevrakis, and N. Saunshi (2019b) A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229. Cited by: §2.
  • M. S. Baghshah and S. B. Shouraki (2010) Kernel-based metric learning for semi-supervised clustering. Neurocomputing 73 (7-9), pp. 1352–1361. Cited by: §2.
  • R. Balestriero and Y. LeCun (2022) Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods. arXiv preprint arXiv:2205.11508. Cited by: §2, §3.1.
  • A. Bardes, J. Ponce, and Y. LeCun (2021)

    Vicreg: variance-invariance-covariance regularization for self-supervised learning

    .
    arXiv preprint arXiv:2105.04906. Cited by: §1, §2, §3.
  • P. L. Bartlett and S. Mendelson (2002) Rademacher and gaussian complexities: risk bounds and structural results.

    Journal of Machine Learning Research

    3 (Nov), pp. 463–482.
    Cited by: §B.1, §B.1.
  • M. Belkin and P. Niyogi (2004) Semi-supervised learning on riemannian manifolds. Machine learning 56 (1), pp. 209–239. Cited by: §3.3.
  • A. Bellet, A. Habrard, and M. Sebban (2013) A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709. Cited by: §2.
  • A. Bietti and J. Mairal (2019) On the inductive bias of neural tangent kernels. Advances in Neural Information Processing Systems 32. Cited by: §2.
  • S. Boyd, S. P. Boyd, and L. Vandenberghe (2004) Convex optimization. Cambridge university press. Cited by: §A.2, §A.3.
  • M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin (2020) Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems 33, pp. 9912–9924. Cited by: §1, §2.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. Cited by: §1, §2.
  • X. Chen and K. He (2021) Exploring simple siamese representation learning. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 15750–15758. Cited by: §1, §2.
  • L. Chizat and F. Bach (2018) A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956 8. Cited by: §2.
  • F. R. Chung (1997) Spectral graph theory. Vol. 92, American Mathematical Soc.. Cited by: §2.
  • G. Cohen, S. Afshar, J. Tapson, and A. Van Schaik (2017) EMNIST: extending mnist to handwritten letters. In 2017 international joint conference on neural networks (IJCNN), pp. 2921–2926. Cited by: §4.
  • C. Cortes, M. Mohri, and A. Rostamizadeh (2010) Two-stage learning kernel algorithms. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 239–246. Cited by: §2.
  • N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. Kandola (2001) On kernel-target alignment. Advances in neural information processing systems 14. Cited by: §2.
  • D. Dwibedi, Y. Aytar, J. Tompson, P. Sermanet, and A. Zisserman (2021) With a little help from my friends: nearest-neighbor contrastive learning of visual representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9588–9597. Cited by: §2.
  • M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim (2014) Do we need hundreds of classifiers to solve real world classification problems?. The journal of machine learning research 15 (1), pp. 3133–3181. Cited by: §4.
  • J. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al. (2020) Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems 33, pp. 21271–21284. Cited by: §1, §2.
  • J. Z. HaoChen, C. Wei, A. Gaidon, and T. Ma (2021) Provable guarantees for self-supervised deep learning with spectral contrastive loss. Advances in Neural Information Processing Systems 34, pp. 5000–5011. Cited by: §2, §2, §3, §3.3.
  • J. Z. HaoChen, C. Wei, A. Kumar, and T. Ma (2022) Beyond separability: analyzing the linear transferability of contrastive representations to related subpopulations. arXiv preprint arXiv:2204.02683. Cited by: §2, §3.1.
  • C. R. Harris, K. J. Millman, S. J. Van Der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, et al. (2020) Array programming with numpy. Nature 585 (7825), pp. 357–362. Cited by: §4.2.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729–9738. Cited by: §1.
  • S. C. Hoi, R. Jin, and M. R. Lyu (2007) Learning nonparametric kernel matrices from pairwise constraints. In Proceedings of the 24th international conference on Machine learning, pp. 361–368. Cited by: §2.
  • H. Huang, M. Broughton, M. Mohseni, R. Babbush, S. Boixo, H. Neven, and J. R. McClean (2021) Power of data in quantum machine learning. Nature Communications 12 (1). External Links: Document, Link Cited by: §B.1, §B.1, Theorem B.1, §3.3.
  • A. Jacot, F. Gabriel, and C. Hongler (2018) Neural tangent kernel: convergence and generalization in neural networks. Advances in neural information processing systems 31. Cited by: §C.2, §2.
  • J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein (2017) Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165. Cited by: §2, §5.
  • J. Lee, S. Schoenholz, J. Pennington, B. Adlam, L. Xiao, R. Novak, and J. Sohl-Dickstein (2020) Finite versus infinite neural networks: an empirical study. Advances in Neural Information Processing Systems 33, pp. 15156–15172. Cited by: §5.
  • J. D. Lee, Q. Lei, N. Saunshi, and J. Zhuo (2021) Predicting what you already know helps: provable self-supervised learning. Advances in Neural Information Processing Systems 34, pp. 309–323. Cited by: §2.
  • R. Meir and T. Zhang (2003) Generalization error bounds for bayesian mixture algorithms. Journal of Machine Learning Research 4 (Oct), pp. 839–860. Cited by: §B.1, §3.3.
  • M. Mohri, A. Rostamizadeh, and A. Talwalkar (2018) Foundations of machine learning. MIT press. Cited by: §B.1, §B.1, Theorem B.2, Lemma B.3, §3.3.
  • R. M. Neal (1996) Priors for infinite networks. In Bayesian Learning for Neural Networks, pp. 29–53. Cited by: §2, §5.
  • R. Novak, L. Xiao, J. Hron, J. Lee, A. A. Alemi, J. Sohl-Dickstein, and S. S. Schoenholz (2020) Neural tangents: fast and easy infinite neural networks in python. In International Conference on Learning Representations, External Links: Link Cited by: §C.2.
  • A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §1, §2.
  • B. Schölkopf, R. Herbrich, and A. J. Smola (2001) A generalized representer theorem. In

    International conference on computational learning theory

    ,
    pp. 416–426. Cited by: Theorem A.1, §3.
  • M. Seleznova and G. Kutyniok (2022) Analyzing finite neural networks: can we trust neural tangent kernel theory?. In Mathematical and Scientific Machine Learning, pp. 868–895. Cited by: §5.
  • I. Steinwart and A. Christmann (2008) Support vector machines. Springer Science & Business Media. Cited by: §3.3.
  • J. F. Sturm (1999) Using sedumi 1.02, a matlab toolbox for optimization over symmetric cones. Optimization methods and software 11 (1-4), pp. 625–653. Cited by: §A.4.
  • V. Vapnik (1999)

    The nature of statistical learning theory

    .
    Springer science & business media. Cited by: §B.1.
  • S. Wang, X. Yu, and P. Perdikaris (2022) When and why pinns fail to train: a neural tangent kernel perspective. Journal of Computational Physics 449, pp. 110768. Cited by: §2.
  • C. Wei, K. Shen, Y. Chen, and T. Ma (2020) Theoretical analysis of self-training with deep networks on unlabeled data. arXiv preprint arXiv:2010.03622. Cited by: §2.
  • Z. Wen and Y. Li (2021) Toward understanding the feature learning process of self-supervised contrastive learning. In International Conference on Machine Learning, pp. 11112–11122. Cited by: §2.
  • C. K. Williams and C. E. Rasmussen (2006) Gaussian processes for machine learning. Vol. 2, MIT press Cambridge, MA. Cited by: §2, §3.3.
  • H. Xia, S. C. Hoi, R. Jin, and P. Zhao (2013) Online multiple kernel similarity learning for visual search. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (3), pp. 536–549. Cited by: §2.
  • G. Yang, E. J. Hu, I. Babuschkin, S. Sidor, X. Liu, D. Farhi, N. Ryder, J. Pachocki, W. Chen, and J. Gao (2022) Tensor programs v: tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466. Cited by: §5.
  • G. Yang (2019) Scaling limits of wide neural networks with weight sharing: gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760. Cited by: §2.
  • L. Yang and R. Jin (2006) Distance metric learning: a comprehensive survey. Michigan State Universiy 2 (2), pp. 4. Cited by: §2.
  • M. Ye, X. Zhang, P. C. Yuen, and S. Chang (2019) Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6210–6219. Cited by: §1.
  • D. Yeung and H. Chang (2007) A kernel approach for semisupervised metric learning. IEEE Transactions on Neural Networks 18 (1), pp. 141–149. Cited by: §2.
  • J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny (2021) Barlow twins: self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pp. 12310–12320. Cited by: §1, §2.
  • J. Zhuang, J. Wang, S. C. Hoi, and X. Lan (2011) Unsupervised multiple kernel learning. In Asian Conference on Machine Learning, pp. 129–144. Cited by: §2.

Appendix A Deferred proofs

a.1 Representer theorem

Theorem A.1 (Representer theorem for self-supervised tasks; adapted from Schölkopf et al. (2001)).

Let for be elements of a dataset of size , be a kernel function with corresponding map into the RKHS , and be any strictly increasing function. Let be a linear function mapping inputs to their corresponding representations. Given a regularized loss function of the form

(15)

where is an error function that depends on the representations of the dataset, the minimizer of this loss function will be in the span of the training points , i.e. for any :

(16)
Proof.

Decompose , where for all . For a loss function of the form listed above, we have

(17)

Where in the terms in the sum, we used the property that is not in the span of the data. For the regularizer term, we note that

(18)

Therefore, strictly enforcing minimizes the regularizer while leaving the rest of the cost function unchanged. ∎

As a consequence of the above, all optimal solutions must have support over the span of the data. This directly results in the statement shown in Proposition 3.1.

a.2 Closed form non-contrastive loss

Throughout this section, for simplicity, we define . With slight abuse of notation, we also denote as a matrix whose -th row contains the features of .

(19)

Note that if the RKHS is infinite dimensional, one can apply the general form of the solution as shown in Proposition 3.1 to reframe the problem into the finite dimensional setting below. As a reminder, we aim to minimize the VICreg cost function:

(20)

By applying the definition of the Frobenius norm and from some algebra, we obtain:

(21)

The optimum of the above formulation has the same optimum as the following optimization problem defined as :

(22)

Since is a projector and (since the all-ones vector is in the kernel of ), then we can solve the above by employing the Eckart-Young theorem and matching the -dimensional eigenspace of with that of the top eigenspace of .

One must be careful in choosing this optimum as can only take positive eigenvalues. Therefore, this is achieved by choosing the optimal to project the data onto this eigenspace as

(23)

where we set the eigendecomposition of as and is the matrix consisting of the first rows of . Similarly, denotes the top left matrix of . Also in the above, denotes the pseudo-inverse of . If the diagonal matrix contains negative entries, which can only happen when is set to a large value, then the values of for those entries is undefined. Here, the optimum choice is to set those entries to equal zero. In practice, this can be avoided by setting to be no larger two times than the degree of the graph.

Note, that since only appears in the cost function in the form , the solution above is only unique up to an orthogonal transformation. Furthermore, the rank of is at most

so a better optimum cannot be achieved by increasing the output dimension of the linear transformation beyond

. To see that this produces the optimal induced kernel, we simply plug in the optimal :

(24)

Now, it needs to be shown that is the unique norm optimizer of the optimization problem. To show this, we analyze the following semi-definite program which is equivalent since the cost function is over positive semi-definite matrices of the form :

(25)

To lighten notation, we denote and simply use for . The above can easily be derived by setting in Equation 22. This has corresponding dual

(26)

The optimal primal can be obtained from by . The optimal dual can be similarly calculated and is equal to . A straightforward calculation shows that the optimum value of the primal and dual formulation are equal for the given solutions. We now check whether the chosen solutions of the primal and dual satisfy the KKT optimality conditions (Boyd et al., 2004):

(27)

The primal feasibility and dual feasibility criteria are straightforward to check. For complementary slackness, we note that

(28)

In the above we used the fact that since is a projector and is unchanged by that projection. This completes the proof of the optimality.

a.3 Contrastive loss

For the contrastive loss, we follow a similar approach as above to find the minimum norm solution that obtains the optimal representation. Note, that the loss function contains the term . Since is positive semi-definite, then this is optimized when matches the positive eigenspace of defined as . Enumerating the eigenvalues of as with corresponding eigenvalues , then . More generally, if the dimension of the representation is restricted such that , then we abuse notation and define

(29)

where are sorted in descending order.

To find the minimum RKHS norm solution, we have to solve a similar SDP to Equation 25:

(30)

This has corresponding dual

(31)

The optimal primal is . The optimal dual is equal to . Directly plugging these in shows that the optimum value of the primal and dual formulation are equal for the given solutions. As before, we now check whether the chosen solutions of the primal and dual satisfy the KKT optimality conditions (Boyd et al., 2004):

(32)

The primal feasibility and dual feasibility criteria are straightforward to check. For complementary slackness, we note that

(33)

In the above, we used the fact that is a projection onto the row space of . This completes the proof of the optimality.

a.4 Optimization via semi-definite program

In general scenarios, Proposition 3.1 gives a prescription for calculating the optimal induced kernel for more complicated optimization tasks since we note that the optimal induced kernel must be of the form below:

(34)

where is a row vector whose entry equals the kernel between and the -th datapoint and is a positive semi-definite matrix.

For example, such a scenario arises when one wants to apply the loss function across batches of data. To frame this as an optimization statement, assume we have datapoints split into batches of size each. We denote the -th datapoint within batch as . As before, denotes the -th datapoint across the whole dataset. We define the following variables:

  • (kernel matrix over complete dataset including all batches) where

  • (kernel matrix between complete dataset and dataset of batch k) where ; similarly, is simply the transpose of

  • is the adjacency matrix for inputs in batch with corresponding graph Laplacian

In what follows, we denote the representation dimension as .

Non-contrastive loss function

In the non-contrastive setting, we consider a regularized version of the batched loss of Equation 4. Applying the reduction of the loss function in Equation 22, we consider the following loss function where we want to find the minimizer :

(35)

where is a hyperparameter. The term regularizes for the RKHS norm of the resulting solution given by . For simplicity, we denote as before . Taking the limit and enforcing a representation of dimension , the loss function above is minimized when we obtain the optimal representation, i.e. we must have that