For self-supervised learning, Rationality implies generalization, provably

10/16/2020 ∙ by Yamini Bansal, et al. ∙ Boaz Barak Harvard University 0

We prove a new upper bound on the generalization gap of classifiers that are obtained by first using self-supervision to learn a representation r of the training data, and then fitting a simple (e.g., linear) classifier g to the labels. Specifically, we show that (under the assumptions described below) the generalization gap of such classifiers tends to zero if 𝖢(g) ≪ n, where 𝖢(g) is an appropriately-defined measure of the simple classifier g's complexity, and n is the number of training samples. We stress that our bound is independent of the complexity of the representation r. We do not make any structural or conditional-independence assumptions on the representation-learning task, which can use the same training dataset that is later used for classification. Rather, we assume that the training procedure satisfies certain natural noise-robustness (adding small amount of label noise causes small degradation in performance) and rationality (getting the wrong label is not better than getting no label at all) conditions that widely hold across many standard architectures. We show that our bound is non-vacuous for many popular representation-learning based classifiers on CIFAR-10 and ImageNet, including SimCLR, AMDIM and MoCo.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The current standard approach for classification is “end-to-end supervised learning” where one fits a complex (e.g., a deep neural network) classifier to the given training set

(Tan & Le, 2019; He et al., 2016). However, modern classifiers are heavily over-parameterized, and as demonstrated by Zhang et al. (2017), can fit 100% of their training set even when given random labels as inputs (in which case test performance is no better than chance). Hence, the training performance of such methods is by itself no indication of their performance on new unseen test points.

In this work, we study a different class of supervised learning procedures that have recently attracted significant interest. These classifiers are obtained by: (i) performing pre-training with a self-supervised task (i.e., without labels) to obtain a complex representation of the data points, and then (ii) fitting a simple (e.g., linear) classifier on the representation and the labels. Such Self-Supervised + Simple”

(SSS for short) algorithms are commonly used in natural language processing tasks

(Devlin et al., 2018; Brown et al., 2020), and have recently found uses in other domains as well (Baevski et al., 2020; Ravanelli et al., 2020; Liu et al., 2019).111In this work we focus only on algorithms that learn a representation, “freeze” it, and then perform classification using a simple classifier. We do not consider algorithms that “fine tune” the entire representation.

Compared to standard “end-to-end supervised learning”, SSS algorithms have several practical advantages. In particular, SSS algorithms can incorporate additional unlabeled data, the representation obtained can be useful for multiple downstream tasks, and they can have improved out-of-distribution performance (Hendrycks et al., 2019). Moreover, recent works show that even without additional unlabeled data, SSS algorithms can get close to state-of-art accuracy in several classification tasks (Chen et al., 2020b; He et al., 2020; Misra & Maaten, 2020; Tian et al., 2019). For instance, SimCLRv2 (Chen et al., 2020b) achieves top-1 performance on ImageNet with a variant of ResNet-152, on par with the end-to-end supervised accuracy of this architecture at .

We show that SSS algorithms have another advantage over standard supervised learning—they often have a small generalization gap between their train and test accuracy, and we prove non-vacuous bounds on this gap. We stress that SSS algorithms use over-parameterized models to extract the representation, and reuse the same training data to learn a simple classifier on this representation. Thus, the final classifier they produce has high complexity by most standard measures and the resulting representation could “memorize” the training set. Consequently, it is not a priori evident that their generalization gap will be small.

Our bound is obtained by first noting that the generalization gap of every training algorithm is bounded by the sum of three quantities, which we name the Robustness gap, Rationality gap, and Memorization gap (we call this the RRM bound, see Fact 2). We now describe these gaps at a high level, deferring the formal definitions to Section 2. All three gaps involve comparison with a setting where we inject label noise by replacing a small fraction of the labels with random values.

The robustness gap corresponds to the amount by which training performance degrades by noise injection. That is, it equals the difference between the standard expected training accuracy (with no label noise) and the expected training accuracy in the noisy setting; in both cases, we measure accuracy with respect to the original (uncorrupted) labels. The robustness gap is nearly always small, and sometimes provably so (see Section 4).

The rationality gap corresponds to the difference between performance on the noisy training samples (on which the training algorithm gets the wrong label) and test samples (on which it doesn’t get any label at all), again with respect to uncorrupted labels. An optimal Bayesian procedure would have zero rationality gap, and we show that this gap is typically zero or small in practice.


Figure 1: Empirical RRM bound. The components of the RRM bound, as well as the upper bound of Theorem 1 for a variety of SSS models on the CIFAR-10 dataset with noise .
Each vertical line corresponds to a single model (architecture + self-supervised task + fitting algorithm) and plots the RRM bound for this model. The green component corresponds to robustness, yellow to rationality, and red to memorization. The axis is the generalization gap, and so the RRM bound is always above the dashed line. A negative generalization gap can occur in algorithms that use augmentation. The blue dots correspond to the bound on the generalization gap obtained by replacing the memorization gap with the bound of Theorem 1. See Sections 5 and B.3 for more information.

The memorization gap, which often accounts for the lion’s share of the generalization gap, corresponds to the difference in the noisy experiment between the training accuracy on the entire train set and the training accuracy on the samples that received the wrong label (both measured with respect to uncorrupted labels). The memorization gap can be thought of as quantifying the extent to which the classifier can “memorize” noisy labels, or act differently on the noisy points compared to the overall train set. The memorization gap is large in standard “end-to-end supervised training”. In contrast, our main theoretical result is that for SSS algorithms, the memorization gap is small if the simple classifier has small complexity, independently of the complexity of the representation. As long as the simple classifier is under-parameterized (i.e., its complexity is asymptotically smaller than the sample size), our bound on the memorization gap tends to zero. When combined with small rationality and robustness, we get concrete non-vacuous generalization bounds for various SSS algorithms on the CIFAR-10 and ImageNet datasets (see Figures 1 and 5).

Our results.

In a nutshell, our contributions are the following:

  1. Our main theoretical result (Theorem 1) is that the memorization gap of an SSS algorithm is bounded by where is the complexity of the simple classifier produced in the “simple fit” stage. This bound is oblivious to the complexity of the representation produced in the pre-training and does not make any assumptions on the relationship between the representation learning method and the supervised learning task.

  2. We complement this result with an empirical study of the robustness, rationality, and memorization gaps. We show that the RRM bound is typically non-vacuous, and in fact, often close to tight, for a variety of SSS algorithms on the CIFAR-10 and ImageNet datasets, including SimCLR (which achieves test errors close to its supervised counterparts). Moreover, in our experimental study, we demonstrate that the generalization gap for SSS algorithms is substantially smaller than their fully-supervised counterparts. See Figures 1 and 5 for sample results and Section 5 for more details.

  3. We demonstrate that replacing the memorization gap with the upper bound of Theorem 1 yields a non-vacuous generalization bound for a variety of SSS algorithms on CIFAR-10 and ImageNet. Moreover, this bound gets tighter with more data augmentation.

  4. The robustness gap is often negligible in practice, and sometimes provably so (see Section 4). We show that the rationality gap is small in practice as well. We also prove that a positive rationality gap corresponds to “leaving performance on the table”, in the sense that we can transform a learning procedure with a large rationality gap into a procedure with better test performance (Theorem 4).

One way to interpret our results is that instead of obtaining generalization bounds under statistical assumptions on the distribution, we assume that the rationality and robustness gaps are at most some value (e.g., 5%). Readers might worry that we are “assuming away the difficulty”, but small rationality and robustness gaps do not by themselves imply a small generalization gap. Indeed, these conditions widely hold across many natural algorithms (including not just SSS but also end-to-end supervised algorithms) with both small and large generalization gaps. As discussed in Section 4, apart from the empirical evidence, there are also theoretical justifications for small robustness and rationality. See Remark 4 and Appendix C for examples showing the necessity of these conditions.

1.1 Related Work.

Our work analyses the generalization gap for supervised classifiers that first use self-supervision to learn a representation. We provide a brief exposition of the various types of self-supervised methods in Section 5, and a more detailed discussion in Appendix B.1.

A variety of prior works have provided generalization bounds for supervised deep learning (e.g.,

Neyshabur et al. (2017); Bartlett et al. (2017); Dziugaite & Roy (2017); Neyshabur et al. (2018); Golowich et al. (2018); Cao & Gu (2019), and references therein). However, many of these bounds provide vacuous guarantees for modern architectures (such as the ones considered in this paper) that have the capacity to memorize their entire training set (Zhang et al., 2017). While some non-vacuous bounds are known (e.g., Zhou et al. (2019) gave a 96.5% bound on the error of MobileNet on ImageNet), Belkin et al. (2019); Nagarajan & Kolter (2019) have highlighted some general barriers for bounding the generalization gaps of over-parameterized networks that are trained end-to-end. For similar reasons, standard approaches such as Rademacher complexity cannot directly bound SSS algorithms’ generalization gap (see Remark 4).

Recently, Saunshi et al. (2019) and Lee et al. (2020) gave generalization bounds for self-supervised based classifiers. The two works considered special cases of SSS algorithms, such as contrastive learning and pre-text tasks. Both works make strong statistical assumptions of (exact or approximate) conditional independence relating the pre-training and classification tasks. For example, if the pre-training task is obtained by splitting a given image into two pieces and predicting from , then Lee et al. (2020)’s results require and to be approximately independent conditioned on their class . However, in many realistic cases, the two parts of the same image will share a significant amount of information not explained by the label.

Our work applies to general SSS algorithms without such statistical assumptions, at the expense of assuming bounds on the robustness and rationality gaps. There have been works providing rigorous bounds on the robustness gap or related quantities (See Section 4.). However, as far as we know, the rationality gap has not been explicitly defined or studied before. To bound the memorization gap, we use information-theoretic complexity measures. Various information-theoretic quantities have been proposed to bound generalization gap in previous work (see Steinke & Zakynthinou (2020) and references therein). While these works bounds generalization directly, we bound a different quantity—the memorization gap in the RRM decomposition.

1.2 Paper Organization

Section 2 contains formal definitions and statements of our results. Section 4 provides an overview of prior work and our new results on the three gaps of the RRM bound. In Section 5, we describe our experimental setup and detail our empirical results. Section 7 concludes the paper and discusses important open questions. Section 3 contains the proof of Theorem 1, while Section 6 contains the proof of Theorem 4. Appendix B fully details our experimental setup.222We provide our code and data in:


We use capital letters (e.g.,

) for random variables, lower case letters (e.g.,

) for a single value, and bold font (e.g., ) for tuples (which will typically have dimension corresponding to the number of samples, denoted by ). We use for the -th element of the tuple . We use calligraphic letters (e.g., ) for both sets and distributions.

2 Formal statement of results

A training procedure is a (possibly randomized) algorithm that takes as input a train set and outputs a classifier . For our current discussion, we make no assumptions on the type of classifier output or the way that it is computed. We denote the distribution over training sets in by and the distribution over test samples in by .333The train and test data often stem from the same distribution (i.e., ), but not always (e.g., it does not hold if we use data augmentation). enters the RRM bound only via the rationality gap, so the assumption of small rationality may be affected if , but the RRM bound still holds. The generalization gap of a training algorithm with respect to a distribution pair is the expected difference between its train accuracy (which we denote by ) and its test performance (which we denote by ). We will often drop subscripts such as when they can be inferred from the context. We will also consider the -noisy experiment, which involves computing the classifier where

with probability

and is uniform otherwise.

Our starting point is the following observation which we call the RRM bound (for Robustness, Rationality, and Memorization). The quantities appearing in it are defined in Table 1.

[RRM bound] For every noise parameter , training procedure and distribution over training sets and test samples, the RRM bound with respect to and is,

where we denote .

Quantity Training       Measurement
for for .
for for train sample .
for , w.p. , uniform o/w for train sample where original label for .
for , w.p. , uniform o/w for a corrupted train sample where original label for .
Table 1: The measurements of accuracy in the RRM bound, all with respect to a training algorithm , distributions and parameter . The robustness gap is , the rationality gap is , and the memorization gap is .

The RRM bound is but an observation, as it directly follows from the fact that for every . However, it is a very useful one. As mentioned above, for natural algorithms, we expect both the robustness and rationality components of this gap to be small, and hence the most significant component is the memorization gap. In this work we show a rigorous upper bound on this gap for SSS models.

We define formally an SSS Algorithm to be a training procedure that is obtained by (1) first training on to get a representation and then (2) training on for to obtain a classifier . The classifier output by is defined as . Our main theoretical result is the following.

[Memorization gap bound] For every SSS Algorithm , noise parameter and distribution over :

where is a complexity measure of the second phase training procedure, which in particular is upper bounded by the number of bits required to describe the classifier (See Definition 2.1.).

2.1 Complexity measures

We now define three complexity measures, all of which can be plugged in as the measure in Theorem 1. The first one, , is the minimum description length of a classifier. The other two measures and are superficially similar to Rademacher Complexity (cf. Bartlett & Mendelson (2002)) in the sense that they capture the ability of the hypothesis to correlate with random noise.

[Complexity of training procedures] Let be a training procedure taking as input a set and outputting a classifier and let . For every training set , we define the following three complexity measures with respect to :

  • The minimum description length of is defined as where we consider the model as a random variable arising in the -noisy experiment.444The name “minimum description length” is justified by the operational definition of entropy relating it to the minimum amortized length of a prefix-free encoding of a random variable.

  • The prediction complexity of is defined as where the ’s are the labels obtained in the -noisy experiment.

  • The (unconditional) deviation complexity of is defined as where the random variables above are taken over and subtraction is done modulo , identifying with the set .

Conditioned on and the choice of the index , the deviations and determine the predictions and noisy labels , and vice versa. Hence we can think of as an “averaged” variant of , where we make the choice of the index part of the sample space for the random variables. While we expect the two measures to be approximately close, the fact that takes

into the sample space makes it easier to estimate this quantity in practice without using a large number of experiment repetitions (see Figure 

B.2 for convergence rates). The measure is harder to evaluate in practice, as it requires finding the optimal compression scheme for the classifier. Section 3 contains the full proof of Theorem 1. It is obtained by showing that: (i) for every , and it holds that , and (ii) for every SSS algorithm and distribution , the memorization gap of is at most


It is the quantity (1) that we compute in our experiments.

3 Proof of Theorem 1

We now prove Theorem 1. We start by relating our three complexity measures. The following theorem shows that is upper bounded by , which in turn is bounded by the entropy of .

[Relation of complexity measures] For every , and

where is the classifier output by (considered as a random variable).


Fix . We get by choosing i.i.d random variables , each equalling with probability and uniform otherwise, and letting .

We start by proving the second inequality . Let and define

be the vector of predictions. Then,


with the last equality holding since for fixed , determines and vice versa. However, since the full vector contains only more information than , the right-hand side of (2) is at most , using the fact that random variables are independent (see Lemma A). For a fixed , the value of is completely determined by and hence the entropy of is at most , establishing the second inequality of the theorem.

We now turn to the first inequality . Let . Then,


since determines and vice versa (given ). But, since and , the right-hand side of (3) equals


Since are identically distributed, which means that the right-hand side of (4) equals

with the inequality holding since on average conditioning reduces entropy. By definition , establishing what we wanted to prove. ∎

The complexity measures and are defined with respect to a fixed train set , rendering them applicable for single training sets such as CIFAR-10 and ImageNet that arise in practice. If is a distribution over , then we define the complexity measures and with respect to as the average of the corresponding measure with respect to . We now restate Theorem 1:

[Theorem 1, restated] Let be a training procedure obtained by first training on to obtain a representation and then training on where to obtain a classifier . Then, for every noise parameter and distribution over ,

where is the distribution over induced by on .

Note that the bound on the right-hand side is expressed only in terms of the complexity of the second stage and is independent of the complexity of . The crux of the proof is showing (close to) independence between the corrupted indices and prediction deviation of resulting from the noise.


Let be sampled by first drawing over then applying where . Consider the sample space of sampling according to the -noisy distribution with respect to , computing , and sampling . We define the following two Bernoulli random variables over this sample space:

For a given , since is determined by and is determined by , . By Lemma A, for every Bernoulli random variables ,

And hence in our case (since ),

But corresponds to the probability that for in the train set, while corresponds to this probability over the noisy samples. Hence the memorization gap is bounded by

using the Jensen inequality and the concavity of square root for the first inequality. ∎

4 The three gaps

We now briefly describe what is known and what we prove about the three components of the RRM bound. We provide some additional discussions in Appendix C, including “counter-examples” of algorithms that exhibit large values for each one of these gaps.




Figure 2: Robustness, Rationality, and Memorization for CIFAR-10. Each blue point is a different combination of (architecture + self-supervised task + fitting algorithm). Each red point is a different architecture trained end-to-end with supervision. We use the ‘’ marker to denote the two best models of each type (SSS and supervised). No augmentations were added. Noise is . Details in Section B.3
The robustness gap.

The robustness gap measures the decrease in training accuracy from adding noisy labels, measured with respect to the clean labels. The robustness gap and related notions such as noise stability or tolerance have been studied in various works (cf. Frénay & Verleysen (2013); Manwani & Sastry (2013)). Interpolating classifiers (with zero train error) satisfy and hence their robustness gap is at most (see left panel of Figure 2). In SSS algorithms, since the representation is learned without using labels, the injection of label noise only affects the simple classifier, which is often linear. Robustness guarantees for linear classifiers have been given previously by Rudin (2005). While proving robustness bounds is not the focus of this paper, we note in the appendix some simple bounds for least-squares minimization of linear classifiers and the (potentially inefficient) Empirical Risk Minimization algorithm (see Appendices D.1 and D.2). Empirically, we observe that the robustness gap of SSS algorithms is often significantly smaller than . (See left panels of Figure 2 and Figure 3.)

The rationality gap.

To build intuition for the rationality gap, consider the case where the inputs are images, and the label is either “cat” or “dog”. A positive rationality gap means that giving the incorrect label “dog” for a cat image makes the output classifier more likely to classify as a cat compared to the case where it is not given any label for at all. Hence intuitively, a positive rationality gap corresponds to the training procedure being “irrational” or “inconsistent”—wrong information should be only worse than no information, and we would expect the rationality gap to be zero or close to it. Indeed, the rationality gap is always zero for interpolating classifiers that fit the training data perfectly. Moreover, empirically the rationality gap is often small for SSS algorithms, particularly for the better-performing ones. (See middle panels of Figure 2 and Figure 3.)

We also show that positive rationality gap corresponds to “leaving performance on the table” by proving the following theorem (see Section 6 for a formal statement and proof):

[Performance on the table theorem, informal] For every training procedure and distribution , , there exists a training procedure satisfying .

One interpretation of Theorem 4 is that we can always reduce the generalization gap to if we are willing to move from the procedure to . In essence, if the rationality gap is positive, we could include the test sample in the train set with a random label to increase the test performance. However, this transformation comes at a high computational cost; inference for the classifier produced by is as expensive as retraining from scratch. Hence, we view Theorem 4 more as a “proof of concept” than as a practical approach for improving performance.

[Why rationality?] Since SSS algorithms use a simple classifier (e.g., linear), the reader may wonder why we cannot directly prove bounds on the generalization gap. The issue is that the representation used by SSS algorithms is still sufficiently over-parameterized to allow memorizing the training set samples. As a pedagogical example, consider a representation-learning procedure that maps a label-free training set to a representation that has high quality, in the sense that the underlying classes become linearly separable in the representation space. Moreover, suppose that the representation space has dimension much smaller than , and hence a linear classifier would not be able to fit noise, meaning the resulting procedure will have a small memorization gap and small empirical Rademacher complexity. Without access to the labels, we can transform to a representation that on input will output if is in the training set, and output the all-zero vector (or some other trivial value) otherwise. Given sufficiently many parameters, the representation (or a close-enough approximation) can be implemented by a neural network. Since and are identical on the training set, the procedure using will have the same train accuracy, memorization gap, and empirical Rademacher complexity. However, using , one cannot achieve better than trivial accuracy on unseen test examples. This does not contradict the RRM bound since this algorithm will be highly irrational.

The memorization gap.

The memorization gap corresponds to the algorithm’s ability to fit the noise (i.e., the gap increases with the number of fit noisy labels). If, for example, the classifier output is interpolating, i.e., it satisfies for every , then accuracy over the noisy samples will be (since for them ). In contrast, the overall accuracy will be in expectation at least which means that the memorization gap will be for small . However, we show empirically (see right panels of Figures 2 and 3) that the memorization gap is small for many SSS algorithms and prove a bound on it in Theorem 1. When combined with small rationality and robustness, this bound results in non-vacuous generalization bounds for various real settings (e.g., 48% for ResNet101 with SimCLRv2 on ImageNet, and as low as 4% for MoCo V2 with ResNet-18 on CIFAR-10). Moreover, unlike other generalization bounds, our bound decreases with data augmentation (see Figure 5).

[Memorization vs. Rademacher] The memorization gap, as well the complexity measures defined in Section 2.1 have a superficial similarity to Rademacher complexity (Bartlett & Mendelson, 2002), in the sense that they quantify the ability of the output classifier to fit noise. One difference is that Rademacher complexity is defined with respect to % noise, while we consider the -noisy experiment for small . A more fundamental difference is that Rademacher complexity is defined via a supremum over all classifiers in some class. In contrast, our measures are defined with respect to a particular training algorithm. As mentioned, Zhang et al. (2017) showed that modern end-to-end supervised learning algorithm can fit 100% of their label noise. This is not the case for SSS algorithms, which can only fit 15%-25% of the CIFAR-10 training set when the labels are completely random (see Table B.1 in the appendix). However, by itself, the inability of an algorithm to fit random noise does not imply that the Rademacher complexity is small, and does not imply a small generalization gap. Indeed, the example of Remark 4 yields an SSS method with both small memorization gap and empirical Rademacher complexity, and yet has a large generalization gap.

5 Empirical study of the RRM bound

In support of our theoretical results, we conduct an extensive empirical study of the three gaps and empirically evaluate the theoretical bound on the memorization gap (from Equation (1

) ) for a variety of SSS algorithms for the CIFAR-10 and ImageNet datasets. We provide a summary of our setup and findings below. For a full description of the algorithms and hyperparameters, see Appendix 



Figure 3: Robustness, Rationality and Memorization for ImageNet. Each point represents a different combination of self-supervised learning algorithm (e.g., SimCLR), backbone architecture (e.g., ResNet-50) and simple classifier (e.g., linear classification). Star indicates experiments with 10 augmentations per training sample. Noise level is . Full experimental details in Section B.

SSS Algorithms (). For the first phase of training , we consider various self-supervised training algorithms that learn a representation without explicit training labels. There are two main types of representation learning methods (1) Contrastive Learning, which finds an embedding by pushing ‘’similar” samples closer, and (2) Pre-text tasks, which hand craft a supervised task that is independent of downstream tasks, such as prediction the rotation angle of a given image (Gidaris et al., 2018). Our analysis is independent of the type of representation learning method, and we focus on methods that achieve high test accuracy when combined with the simple test phase. The list of methods included in our study is Instance Discrimination (Wu et al., 2018), MoCoV2 (He et al., 2020), SimCLR (Chen et al., 2020a, b), AMDIM (Bachman et al., 2019), CMC (Tian et al., 2019), InfoMin (Tian et al., 2020) as well as adversarial methods such as BigBiGAN (Donahue & Simonyan, 2019).

For the second phase of training (also known as the evaluation phase (Goyal et al., 2019)

), we consider simple models such as regularized linear regression, or small Multi-Layer Perceptrons (MLPs). For each evaluation method, we run two experiments: 1) the clean experiment where we train

on the data and labels ; 2) the -noisy experiment where we train on where are the noised labels. Unless specified otherwise we set the noise to .

Adding augmentations. We investigate the effect of data augmentation on the three gaps and the theoretical bound. For each training point, we sample random augmentations ( unless stated otherwise) and add it to the train set. Note that in the noisy experiment two augmented samples of the same original point might be assigned with different labels. We use the same augmentation used in the corresponding self-supervised training phase.

Figure 4: The RRM bound of SSS methods on ImageNet, with models sorted by the generalization gap. We plot the robustness, rationality and memorization gaps. Similar to Figure 1, for most models, the bound is tight and is dominated by the memorization gap. Theorem 1 bound is marked for the two leftmost models (we did not evaluate it for the others, for computational reasons).
Figure 5: Empirical RRM for the AMDIM SSS model on CIFAR-10 with increasing number of augmentations. While robustness and memorization gaps decrease, and so does our generalization bound, the rationality gap increases since and grow apart.

Results. Figures 1 and 2 provide a summary of our experimental results for CIFAR-10. The robustness and rationality gaps are close to zero for most SSS algorithms, while the memorization gap is usually the dominant term, especially so for models with larger generalization gap. Moreover, we see that often produces a reasonably tight bound for the memorization gap, leading to a generalization bound that can be as low as -. In Figures 3 and 5 we give a summary of our experimental results for SSS algorithms on ImageNet. Again, the rationality and robustness gaps are bounded by small constants. Notice, that adding augmentations reduces memorization, but may lead to an increase in the rationality gap. This is also demonstrated in Figure 5 where we vary the number of data augmentations systematically for one SSS algorithm (AMDIM) on CIFAR-10. Since computing the Theorem II bound for ImageNet is computationally expensive we compute it only for two algorithms, which achieve non-vacuous bounds between  -, with room for improvement (See Appendix B.5.1.)

6 Positive rationality gap leaves room for improvement

We now prove the “performance on the table theorem” that states that we can always transform a training procedure with a positive rationality gap into a training procedure with better performance:

[Performance on the table theorem, restated] For every training procedure and , if and has a positive rationality gap with respect to these parameters, then there exists a training procedure such that,


where is a term that vanishes with , and assuming that .

The assumption, stated differently, implies that the memorization gap will be positive. We expect this assumption to be true for any reasonable training procedure (see right panel of Figure 2), since performance on noisy train samples will not be better than the overall train accuracy. Indeed, it holds in all the experiments described in Section 5. In particular (since we can always add noise to our data), the above means that if the rationality gap is positive, we can use the above to improve the test performance of “irrational” networks. We now provide a proof for the theorem.


Let be a procedure with positive rationality gap that we are trying to transform. Our new algorithm would be the following:

  • Training: On input a training set , algorithm does not perform any computation, but merely stores the dataset . Thus the “representation” of a point is simply .

  • Inference: On input a data point and the original training dataset , algorithm chooses and lets be the training set obtained by replacing with where is chosen uniformly at random. We then compute , and output .

First note that while the number of noisy samples could change by one by replacing with

, since this number is distributed according to the Binomial distribution with mean

and standard deviation

, this change can affect probabilities by at most additive factor (since the statistical distance between the distribution and is ). If has classes, then with probability we will make noisy () in which case the expected performance on it will be . With probability , we choose the correct label in which case performance on this sample will be equal to the expected performance on clean samples which by our assumptions is at least as well. Hence, the accuracy on the new test point is at least . ∎

We stress that the procedure described above, while running in “polynomial time”, is not particularly practical, since it makes inference as computationally expensive as training. However, it is a proof of concept that irrational networks are, to some extent, “leaving performance on the table”.

7 Conclusions and open questions

This work demonstrates that SSS algorithms have small generalization gaps. While our focus is on the memorization gap, our work motivates more investigation of both the robustness and rationality gaps. In particular, we are not aware of any rigorous bounds for the rationality gap of SSS algorithms, but we view our “performance on the table” theorem (Theorem 4) as a strong indication that it is close to zero for natural algorithms. Given our empirical studies, we believe the assumptions of small robustness and rationality conform well to practice.

Our numerical bounds are still far from tight, especially for ImageNet, where evaluating the bound (more so with augmentations) is computationally expensive. Nevertheless, we find it striking that already in this initial work, we get non-vacuous (and sometimes quite good) bounds. Furthermore, the fact that the empirical RRM bound is often close to the generalization gap, shows that there is significant room for improvement.

Overall, this work can be viewed as additional evidence for the advantages of SSS algorithms over end-to-end supervised learning. Moreover, some (very preliminary) evidence shows that end-to-end supervised learning implicitly separates into a representation learning and classification phases (Morcos et al., 2018). Understanding the extent that supervised learning algorithms implicitly perform SSS learning is an important research direction in its own right. To the extent this holds, our work might shed light on such algorithms’ generalization performance as well.

8 Acknowledgements

We thank Dimitris Kalimeris, Preetum Nakkiran, and Eran Malach for comments on early drafts of this work. This work supported in part by NSF award CCF 1565264, IIS 1409097, DARPA grant W911NF2010021, and a Simons Investigator Fellowship. We also thank Oracle and Microsoft for grants used for computational resources. Y.B is partially supported by MIT-IBM Watson AI Lab. Work partially performed while G.K. was an intern at Google Research.


Appendix A Mutual information facts

If are two Bernoulli random variables with nonzero expectation then,


A standard relation between mutual information and KL-divergence gives,

On the other hand, by the Pinsker inequality,

Thus (letting ),


For three random variables , s.t. and are independent,


Using the chain rule for mutual information we have:

Since are independent, and since conditioning only reduces entropy, we have . Combining the two we get,

Thus we have that . ∎

Note that by induction we can extend this argument to show that where are mutually independent.

Appendix B Experimental details

We perform an empirical study of the RRM bound for a wide variety of self-supervised training methods on the ImageNet (Deng et al., 2009) and CIFAR-10 (Krizhevsky et al., 2009) training datasets. We provide a brief description of all the self-supervised training methods that appear in our results below. For each method, we use the official pre-trained models on ImageNet wherever available. Since very few methods provide pre-trained models for CIFAR-10, we train models from scratch. The architectures and other training hyper-parameters are summarized in Table E.4 and Table E.3. Since our primary aim is to study the RRM bound, we do not optimize for reaching the state-of-the-art performance in our re-implementations. For the second phase of training, we use L2-regularized linear regression, or small non-interpolating Multi-layer perceptrons (MLPs).

b.1 Self-supervised training methods ()

There is a variety of self-supervised training methods for learning representations without explicit labels. The two main branches of self-supervised learning methods are:

  1. Contrastive learning: These methods seek to find an embedding of the dataset that pushes a positive pair of images close together and a pair of negative images far from each other. For example, two different augmented versions of the same image may be considered a positive pair, while two different images may be considered a negative pair. Different methods such as Instance Discrimination, MoCo, SimCLR, AMDIM, differ in the the way they select the positive/negative pairs, as well other details like the use of a memory bank or the encoder architecture. (See Falcon & Cho (2020) for detailed comparison of these methods.)

  2. Handcrafted pretext tasks: These methods learn a representation by designing a fairly general supervised task, and utilizing the penultimate or other intermediate layers of this network as the representation. Pretext tasks include a diverse range of methods such as predicting the rotation angle of an input image (Gidaris et al., 2018), solving jigsaw puzzles (Noroozi & Favaro, 2016), colorization (Zhang et al., 2016), denoising images (Vincent et al., 2008)

    or image inpainting

    (Pathak et al., 2016).

Additionally, adversarial image generation can be used for by augmenting a the image generator with an encoder (Donahue & Simonyan, 2019). We focus primarily on contrastive learning methods since they achieve state-of-the-art performance. We now describe these methods briefly.

Instance Discrimination: (Wu et al., 2018) In essence, Instance Discrimination performs supervised learning with each training sample as a separate class. They minimize the non-parametric softmax loss given below for each training sample


where is the feature vector for the -th example and is a temperature hyperparameter. They use memory banks and a contrastive loss (also known as Noise Contrastive Estimation or NCE (Gutmann & Hyvärinen, 2010)) for computing this loss efficiently for large datasets. So in this case, a positive pair is an image and itself, while a negative pair is two different training images.

Momentum Contrastive (MoCo): (He et al., 2020) MoCo replaces the memory bank in Instance Discrimination with a momentum-based query encoder. MoCoV2 (Chen et al., 2020c) applies various modifications over SimCLR, like a projection head, and combines it with the MoCo framework for improved performance.

AMDIM: (Bachman et al., 2019) AMDIM uses two augmented versions of the same image as possitive pairs. For these augmentations, they use random resized crops, random jitters in color space, random horizontal flips and random conversions to grayscale. They apply the NCE loss across multiple scales, by using features from multiple layers. They use a modified ResNet by changing the receptive fields to decrease overlap between positive pairs.

CMC: (Tian et al., 2019) CMC creates two views for contrastive learning by converting each image into the Lab color space. L and ab channels from the same image are considered to be a positive pair, while those from two different images are considered to be a negative pair.

PiRL: (Misra & Maaten, 2020) PiRL first creates a jigsaw transformation of an image (it divides an image into 9 patches and shuffles these patches). It treats an image and its jigsaw as a positive pair, and that of a different image as a negative pair.

SimCLRv1 and SimCLRv2: (Chen et al., 2020a, b) SimCLR also use strong augmentations to create positive and negative pairs. They use random resized crops, random Gaussian blurring and random jitters in color space. Crucially, they use a projection head that maps the representations to a 128-dimensional space where they apply the contrastive loss. They do not use a memory bank, but use a large batch size.

InfoMin: (Tian et al., 2020) InfoMin uses random resized crops, random color jitters and random Gaussian blurring, as well as jigsaw shuffling from PiRL.

b.2 Simple Classifier ()

After training the representation learning method, we extract representations for the training and test images. We do not add random augmentations to the training images (unless stated otherwise). Then, we train a simple classifier on the dataset . We use a linear classifier in most cases, but we also try a small multi-layer perceptron (as long as it has few parameters and does not interpolate the training data). We add weight decay in some methods to achieve good test accuracy (see Table E.4 and Table E.3 for values for each method). For the noisy experiment, we set the noise level to . To compute the complexity bound we run 20 trials (same experiment with different random seed) of the noisy experiment for CIFAR-10 and 50 trials for ImageNet.

b.3 Experimental details for each plot

Figure 1. This figure shows the robustness, rationality and memorization gap for various SSS algorithms trained on CIFAR-10. The type of self-supervised method, the encoder architecture, as well as the training hyperparameters are described in Table E.3. For the second phase , we use L2-regularized linear regression for all the methods. For each algorithm listed in Table E.3, the figure contains 2 points, one without augmentations, and one with augmentations. Further, we compute the complexity measure for all the methods. All the values (along with the test accuracy) are listed in Table E.1.

Figure 2. This figure shows the robustness, rationality and memorization for CIFAR-10 for all the same methods as in Figure 1. We only include the points without augmentation to show how rationality behaves when are identical. All the values (along with the test accuracy) are listed in Table E.1. In addition, we add three end-to-end fully supervised methods (red circles) to compare and contrast the behavior of each of the gaps for SSS and supervised methods. For the supervised architectures, we train a Myrtle-5 (Page, 2018) convolutional network, a ResNet-18 (He et al., 2016) and a WideResNet-28-10 (Zagoruyko & Komodakis, 2016) with standard hyperparameters.

Figure 3 and Figure 5. These figures show the robustness, rationality and memorization for the ImageNet dataset. The type of self-supervised method, the encoder architecture, as well as the training hyperparameters are described in Table E.4. For the second phase , we use L2-regularized linear regression for all the methods. The figures also contain some points with 10 augmentations per training image. Further, we compute the complexity measure for all three methods—SimCLRv2 with architectures ResNet-50-1x and ResNet-101-2x. All the values (along with the test accuracy) are listed in Table E.2.

Figure 5 This figure shows the effect of increasing augmentations. We add augmentations and re-train the simple classifier. We do this for the CIFAR-10 dataset, AMDIM self-supervised training with the AMDIM encoder and linear regression (see Table E.3 for the hyperparameters).

b.4 Additional Results

b.4.1 Generalization error of SSS algorithms

To show that SSS algorithms have qualitatively different generalization behavior compared to standard end-to-end supervised methods, we repeat the experiment from Zhang et al. (2017). We randomize all the training labels in the CIFAR-10 dataset and train 3 high-performing SSS methods on these noisy labels. For results see Table B.1. Unlike fully supervised methods, SSS algorithms do not achieve 100% training accuracy on the dataset with noisy labels. In fact, their training accuracies are fairly low (-). This suggests that the empirical Rademacher complexity is bounded. The algorithms were trained without any augmentations during the simple fitting phase for both SSS and supervised algorithms. The SSS methods were trained using parameters described in Table E.3.

Training method
Train Acc
Test Acc
Supervised (Zhang et al., 2017) Inception (no aug) 100% 86%
(fitting random labels) 100% 10%
SSS SimCLR (ResNet-50) + Linear 94% 92%
(fitting random labels) 22% 10%
AMDIM (AMDIM Encoder) + Linear 94% 87.4%
(fitting random labels) 18% 10%
MoCoV2 (ResNet-18) + Linear 69% 67.6%
(fitting random labels) 15% 10%

Table B.1: Train and Test performance on 100% label noise for fully supervised vs. SSS algorithms on CIFAR-10. The first row is from Zhang et al. (2017), while the second one is our results for SSS methods averaged over 5 runs without augmentations.

b.5 RRM bound with varying noise parameter

We now investigate the effect of varying noise levels on the three gaps as well as on the complexity. We see that the robustness gap increases as we add more noise—this is expected as noise should affect the clean training accuracy. We also observe that the memorization gap decreases, suggesting that as a function of goes down faster than (see Section 2.1). The Theorem 1 bound on memorization gap also decays strongly with the , becoming more tight as the noise increases.


Figure B.1: RRM + bound with changing

b.5.1 Convergence of complexity measures

We now plot (see Figure B.2) the complexity measures and with increasing number of trials for one of the SSS algorithms. As expected, and converges in about 20 trials for CIFAR-10. On the other hand, the complexity computations for ImageNet need many more trials for convergence, since it contains about augmentations million training samples making it cost prohibitive to compute for all the methods. For the CIFAR-10, we use AMDIM with the AMDIM encoder architecture without augmentations. For ImageNet, we use SimCLRv2 with the ResNet-101 architecture with 10 augmentations per training sample.


(a) Theorem II bound with increasing trials. The bound based on is lower than as expected, and converges within 20 trials.


(b) Theorem II bound with increasing trials. is slow to converge due to the large dataset size (10 augmentations 1.2 million training samples).
Figure B.2: Convergence of Theorem II bounds for CIFAR-10 and ImageNet

Appendix C Examples of algorithms with large gaps

While we argued that SSS algorithms will tend to have small robustness, rationality, and memorization gaps, this does not hold in the worst case and there are examples of such algorithms that exhibit large gaps in each of those cases.

c.1 Large robustness gap

Large robustness gap can only arise via computational (as opposed to statistical) considerations. That is, if a training procedure outputs a classifier that achieves on average accuracy on a clean train set , then with high probability, if is an -noisy train set then there exists that achieves accuracy on this train set (by fitting only the “clean” points).

However, the training algorithm might not always be able to find such a classifier. For example, if the distribution has the form where and is some hidden vector, then there is an efficient algorithm (namely Gaussian elimination) to find given the samples and hence get accuracy . However, for every and , there is no known efficient algorithm that, given a perturbed equations of the form finds such that on a fraction of the ’s. This is known as the learning parity with noise (LPN) problem (Blum et al., 1993).

The assumption of robustness is necessary for a small generalization gap, in the sense that we can come up with (contrived) examples of algorithms that have small rationality and memorization gaps while still having large generalization gap. For example, consider an algorithm that has large generalization gap (high train accuracy and small test accuracy), and suppose we augment to the following algorithm

where denotes the constant zero function (e.g., some trivial classifier) and we use some algorithm to estimate whether or not the labels are noisy. (Such estimates can often be achieved in many natural cases.) The algorithm will inherit the generalization gap of , since that depends only on the experiment without noise. Since performance on noisy and clean training samples will be the same (close to random), will have zero memorization gap. Since we have assumed small test accuracy, it will have zero rationality gap also.

c.2 Large rationality gap

As discussed in Section 6, in the case that , a robust algorithm with large rationality gap leaves “performance on the table”. We can obtain such algorithms by artificially dropping performance on the test data. For example, in the SSS framework, since the representation is over-parameterized and can memorize the entire train set, we can consider the trivial representation

If we now train some simple classifier on then it can have non-trivial performance on the noisy train samples, while getting trivial accuracy on all samples outside the train set.

In cases where and are different (for example when is an augmented version of ) then we can no longer claim that a large rationality gap corresponds to “leaving performance on the table”. For example, we do observe (mild) growth in the rationality gap as we add more augmented points to the training set.

c.3 Large memorization gap

It is not hard to find examples of networks with large memorization gap. Indeed, as mentioned before, any standard interpolating supervised learning algorithm will get a memorization gap close to .

Appendix D Simple robustness bounds

While robustness is not the focus of this work, we collect here two observations on the robustness of the least-square and minimum risk classifiers. These bounds are arguably folklore, but we state them here for completeness.

d.1 Robustness of least squares classifiers

One can prove robustness for classes of algorithms under varying assumptions. As a simple example, we record here a self-contained observation of how margin leads to robustness in least squares minimization. This is a very simple but also pessimistic bound, and much better ones often hold.

Let and , and consider a linear function that minimizes the quantity , and suppose that for fraction of the ’s, the maximum over of is larger than the second-largest value.

Then in expectation, if we let be the -noisy version of and minimizes , we get that for at least fraction of the ’s.


We identify

with its “one hot” encoding as a vector in

. Let be the subspace of all vectors of the form for linear . If is the minimizer in the theorem statement, and then where is the orthogonal projection to the subspace . If is the minimizer for the noisy labels and , then where is the noise vector .

Hence . But in expectation (since we flip a label with probability ). For every point for which the margin was at least in , if ’s prediction is different in , then the contribution of the -th block to their square norm difference is at least (by shifting the maximum coordinate by and the second largest one by ). Hence at most of these points could have different predictions in and

d.2 Robustness of empirical risk minimizer

The (potentially inefficient) algorithm that minimizes the classification errors is always robust.

Let . Then for every ,


Let be any train set, and let and be the minimizer of this quantity. Let be the -noisy version of and let be the fraction of on which . Then,


Hence if is the minimizer of (7) then we know that for at most fraction of the ’s, and so for at most fraction of the ’s. Since the train accuracy of is and in expectation of is , we get that in expectation

Appendix E Large Tables

Method Backbone Data Aug Generalization Gap Robustness Mem- orization Rationality Theorem II bound RRM bound Test Acc
mocov2 resnet18 True -7.35 0.07 0.21 0.00 3.47 0.28 67.19
mocov2 wide_resnet50_2 True -6.37 0.18 1.03 0.00 7.63 1.21 70.99
mocov2 resnet101 True -6.01 0.15 0.71 0.00 6.38 0.86 68.58
mocov2 resnet50 True -5.38 0.19 0.84 0.00 6.99 1.03 69.68
simclr resnet50 True -2.89 0.30 0.55 0.00 6.63 0.85 91.96
amdim resnet101 True -0.91 0.64 3.70 0.00 25.99 4.34 63.56
amdim resnet18 True 0.33 0.23 1.15 0.00 8.66 1.38 62.84
mocov2 resnet18 False 1.43 0.15 1.24 0.03 14.14 1.43 67.60
simclr resnet18 False 1.43 0.28 0.79 0.36 13.35 1.43 82.50
amdim wide_resnet50_2 True 1.60 0.69 2.46 0.00 19.20 3.15 64.38
simclr resnet50 False 1.97 0.22 0.78 0.97 15.75 1.97 92.00
simclr resnet50 False 2.24 0.52 1.71 0.01 19.53 2.24 84.94
mocov2 resnet50 False 2.72 0.30 2.96 0.00 24.18 3.26 70.09
mocov2 resnet101 False 2.82 0.33 3.03 0.00 22.78 3.36 69.08
mocov2 wide_resnet50_2 False 3.11 0.38 2.79 0.00 22.39 3.18 70.84
amdim resnet50_bn True 3.69 0.84 4.22 0.00 31.12 5.06 66.44
amdim resnet18 False 4.34 0.42 4.58 0.00