Convolutional neural networks (CNNs) have achieved remarkable performance in various computer vision tasks(Krizhevsky et al., 2012; Xu et al., 2015; Taigman et al., 2014). In practice, these networks typically have more parameters than training points (i.e., are overparameterized), and are trained with gradient based methods. Despite non-convexity and the potential problem of overfitting, these algorithms find solutions with low test error. It is still largely unknown why such simple optimization algorithms have outstanding test performance for learning overparameterized convolutional networks.
Recently, there have been major efforts to provide theoretical guarantees for overparameterized neural networks. However, these results do not provide guarantees in practical settings even for very simple learning tasks. Current results either hold for the Neural Tangent Kernel regime (NTK) where neural network dynamics are approximately linear, or do not provide good generalization guarantees for datasets of practical size. However, NTK is not an accurate model of neural networks in practice (Yehudai and Shamir, 2019; Bai and Lee, 2019; Woodworth et al., 2019) and empirically small datasets suffice for good generalization. The difficulty is that even for very simple tasks the optimization problem is non-convex and obtaining practical generalization guarantees is a major challenge.
Therefore, to fully understand overparameterized convolutional neural networks it is necessary to first understand simple settings which are amenable to theoretical and empirical analysis. Towards this goal, we analyze a simplified pattern recognition task where all patterns in the images are orthogonal and the classification is binary. We consider learning a 3-layer overparameterized convolutional neural network with stochastic gradient descent (SGD). We take a unique approach that combines both a novel empirical insight and theoretical guarantees which pinpoint the inductive bias of overparameterized convolutional neural networks in our setting and show why SGD generalizes well.
Empirically, we identify a novel property of the solutions found by SGD, in which the statistics of patterns in the training data govern the magnitude of the dot-product between learned pattern detectors and their detected patterns. Specifically, patterns that appear almost exclusively in one of the classes will have a large dot-product with the channels that detect them. On the other hand, patterns that appear roughly equally in both classes, will have a low dot-product with their detecting channels. We formally define this “Pattern Statistics Inductive Bias” condition (PSI) and provide empirical evidence that PSI holds across a large number of settings. We also prove that SGD indeed satisfies PSI in a simple setup of two points in the training set.
Under the assumption that PSI holds, we analyze the sample complexity of SGD and prove that it is at most , where is the filter dimension. Importantly, the sample complexity is independent of the number of hidden units in the network. In contrast, we show that the VC dimension of the class of functions we consider is exponential in , and thus there exist other learning algorithms (not SGD) that will have exponential sample complexity. Together, these results provide firm evidence that even though SGD can in principle overfit, it is nonetheless biased towards solutions which are determined by the statistics of the patterns in the training set and consequently it has very good generalization performance. Our results suggest that PSI is a fundamental property of gradient descent and we believe that it can pave the way for analyzing and understanding other settings of overparameterized CNNs.
2 Related Work
Several recent works have studied the inductive bias of gradient based methods learning CNNs. Numerous works show that under simplified assumptions, e.g., linearly separable data or linear networks, gradient methods are biased towards low norm or low rank solutions (Ji and Telgarsky, 2019, 2018; Soudry et al., 2018; Brutzkus et al., 2018; Arora et al., 2019a; Nacson et al., 2019; Lyu and Li, 2019; Wei et al., 2019). Other works study the inductive bias via the NTK approximation (Du et al., 2019, 2018c; Arora et al., 2019b; Fiat et al., 2019). We present a novel view of the inductive bias of SGD that is complimentary to these approaches.
Yu et al. (2019) study a pattern classification problem similar to ours. However, their analysis holds for an unbounded hinge loss which is not used in practice. Furthermore, their sample complexity depends on the network size, and thus does not explain why large networks do not overfit. Other works have studied learning under certain ground truth distributions. For example, Brutzkus and Globerson (2019) study a simple extension of the XOR problem, showing that overparameterized CNNs generalize better than smaller CNNs. Single-channel CNNs are analyzed in (Du et al., 2018b, a; Brutzkus and Globerson, 2017).
3 The Orthogonal Patterns Problem
Data Generating Distribution:
We consider a learning problem that captures a key property of visual classification. Many visual classes are characterized by the existence of certain patterns in the data. For example an 8 will typically contain an x like pattern somewhere in the image. Here we consider an abstraction of this behavior where images consist of a set of patterns. Furthermore, each class is characterized by a pattern that appears exclusively in it (see Fig. 1). We define this formally below.
be a set of orthogonal vectors in, where . For simplicity, we assume that for all . We consider input vectors with patterns of dimension . Formally, where is the th pattern of and . For a pattern we denote if contains the pattern .111We say that contains if there exists such that . Let . We define , where the union is disjoint, is the set of positive patterns, the set of negative patterns and is the set of mutual patterns.
For simplicity, in this work we consider the case where . We denote, , and . For convenience, we also refer to a set of patterns as a set of the indices of the patterns. For example, we denote if .
We consider distributions over with the following properties:
Given , a vector is sampled as follows. Choose the positive pattern and randomly choose a set of patterns from . Denote this set of chosen patterns by . Set be some such that , where the location of each pattern in is chosen arbitrarily.222We will see that the order of the patterns does not matter, because the convolutional network is invariant to this order. For example, if and the patterns are this can result in samples such as or .
Similarly given , choose the negative pattern and continue as for the positive pattern above.
We are interested in neural architectures that can learn pattern detection problems such as the one above. A natural model in this context is a 3-layer network with a convolutional layer, followed by ReLU, max pooling and a fully-connected layer. Each channel in the first layer can be thought of as a detector for a given pattern. We say that a detector detects pattern, if has the largest dot product with the detector among all patterns in and this dot product is positive. For simplicity we fix the weights on the last linear layer to values .
Let denote the number of channels. We partition the channels into two sets: and . These will have weights of and in the output respectively. Finally, let be the weight matrix whose rows are followed by .
For an input where , the output of the network, denoted by is given by:
where is the ReLU activation applied element-wise. Let denote the class of all networks in Eq. 1, with any value of . Finally, we note that can perfectly fit the distribution above, by setting , and . Therefore, for the network is overparameterized.
Let be a training set with IID samples from . We consider minimizing the hinge loss:
For optimization, we use SGD with constant learning rate . The parameters
are initialized as IID Gaussians with zero mean and standard deviation. Let be the weight matrix at iteration of SGD. Similarly let be the corresponding vectors at iteration .
Correlation Between Patterns and their Detectors:
We are interested in how different patterns affect the output of the network. Towards this end, we simplify the output of the network as follows. For any positive point :
Similarly, for any negative point :
Empirical Pattern Bias:
In a given training set, patterns will appear in both positive and negative labels. The following measure captures how well-balanced are the patterns between the labels. For any pattern , define the following statistic of the training set:
4 Pattern Statistics Inductive Bias
The inductive bias of a learning algorithm refers to how the algorithm chooses among all models that fit the data equally well. For example, an SVM algorithm has an inductive bias towards low norm. Indeed, as noted in Zhang et al. (2017)
, understanding the success of deep learning requires understanding the inductive bias of the learning algorithms used to learn networks, and in particular SGD.
In what follows, we define a certain inductive bias of an algorithm, which we refer to as the Patterns Statistics Inductive Bias (PSI) property. The PSI property states a simple relation between the relative frequency of patterns (see Eq. 9) and the dot product between detectors and their detected pattern, namely the quantities in Eq. 4 and Eq. 7. We begin by providing the formal definition of PSI, and then provide further intuition. For the definition, we let be any learning algorithm which given a training set returns a network as in Eq. 1.
We say that a learning algorithm satisfies the Patterns Statistics Inductive Bias condition with constants ((,,)-PSI) if the following holds. For any , 333We chose for simplicity, it can be replaced by any constant. with probability at least
with probability at leastover the randomization of and training set of size , satisfies the following:
For all :
For all :
We next provide some intuition as to why SGD updates may lead to PSI (in Sec. 8 we provide a proof of this for a restricted setting).
We will consider updates made by gradient descent (full batch SGD). Let be all points such that . Define similarly. Define to be the set with weight vectors instead of . Similarly, define , and . Throughout the discussion below, we assume that these sets have roughly the same size. This holds with high probability at initialization for a sufficiently large network. Furthermore, in Section 8, we show that it holds during training in the case of two training points.
Given two pattern indices and any the gradient update is given by:
Similarly, for any :
First assume that in Eq. 10. Since appears in all positive points and does not appear in negative points, at iteration , increases. At iterations it will still hold that , and therefore will continue to increase. Thus, we should expect that should be large. Note also that is a large positive number.
Now assume that in Eq. 11, and appears by chance only in negative points in . In this case, . Therefore, by Eq. 11 we should expect that now is large and has approximately the same value as (under our assumption that the sets in Eq. 3 and Eq. 3 have roughly the same size). Specifically, we see that a large value of predicts a large value of .
On the other hand, if appears in roughly an equal number of positive and negative points, i.e., , then we should expect to be low. To see this, consider the gradient update in Eq. 11 and consider the corresponding detector for the pattern . Negative points with the pattern increase its norm at iteration , while positive points decrease it. Since there is roughly an equal number of positive points and negative points with pattern in , we should expect to be low. This intuition is not exact because there are the terms , which may be zero for some of the positive or negative points. We note that this is what hinders a theoretical analysis, because
is a random variable which is very hard to analyze. Nevertheless, we still see here that negative points will increase the norm and positive points will decrease the norm. Thus,should be smaller compared to , which is always increasing. Importantly, we see here that a low value of predicts a low value of .
Given the intuitions above, one possible conjecture is that the ratio is bounded by an affine function of , which leads to our PSI measure in Definition 4.1. The bias term in the affine function takes into account that our intuition above is not exact. It is reasonable to assume that the bias in the affine function decreases with , because the gradient updates are scaled by . Therefore, any additive error in the weights trajectory is scaled by .
5 VC Dimension Bounds and Relation to PSI
In Section 3 we defined a classification problem and a neural architecture. In this section we show that this neural architecture is highly expressive, and can thus potentially overfit and generalize poorly. However, as we show later in Sec. 6, if one restricts the class to models with the PSI property, overfitting is avoided.
In Section 5.1, we prove upper and lower bounds on the VC dimension. One particularly interesting consequence of the lower bound proof, is that the networks constructed there for shattering a set do not satisfy the PSI condition. We show this in Sec. 5.2.
5.1 VC Bounds
First, we give a simple upper bound on the VC dimension. Without considering the order of the patterns in the images, there are at most input points in . Since the network in Eq. 1 is invariant to the order of the patterns in an image, this implies:
Standard VC bounds thus imply that sample complexity for learning in is upper bounded by . Note that this is true regardless of the number of channels .
The lower bound below is more challenging, and reveals interesting connections to the PSI property.
Assume that and , then .
We will construct a set of size that can be shattered. For a given let be its th entry. For any such , define a point such that for any ,
Furthermore, arbitrarily choose or . Define . Note that is a valid set of points according to our distribution definition in Section 3.
Now, assume that each point has label . We will show that there is a network such that for all . For each , define:
and , where is the unique solution of the following linear system with equations. For each the system has the following equation:
where for any , is defined such that for all
. There is a unique solution because the corresponding matrix of the linear system is the difference between an all 1’s matrix and the identity matrix. By the Sherman-Morrison formula(Sherman and Morrison, 1950), this matrix is invertible.
Now, recall that the network output is (see Eq. 1)
Then, for any we have that:
by the definition of , the orthogonality of the patterns, and Eq. 13. We have shown that any labeling can be achieved, and hence the set is shattered, completing the proof. ∎
5.2 Relation to PSI
The proof of Theorem 5.2 shows that there are exponentially large training sets that can be exactly fit with . This fact can be used to show a lower bound on sample complexity that is exponential in for general ERM algorithms (Anthony and Bartlett, 2009). The networks that fit these datasets are those defined by . It is easy to see that these networks do not satisfy the PSI property. To see this, note that , which implies that the left-hand sides of parts 1 and 2 in the Definition 4.1 are infinite. Therefore, PSI is not satisfied for these networks.
The networks in Theorem 5.2 classify points based on the patterns , and not on the patterns which determine the class. Networks that satisfy PSI are essentially the opposite. Namely, they classify a point mostly based on detectors for the patterns and and thus generalize well, as we show in the next section.
6 PSI Implies Good Generalization
In the previous section we showed that a general ERM algorithm for the class may need exponentially many training samples to get low test error. Here we show that any algorithm satisfying the PSI condition (see Definition 4.1) will have polynomial sample complexity, when patterns in are unbiased (i.e., for ). Specifically, we show that such an algorithm will have zero test error w.h.p., given only training samples.
Assume that satisfies the conditions in Section 3 and for all . Let be a learning algorithm which satisfies (,,)-PSI with . Then, if , with probability at least ,444We note that the may be improved to an arbitrary if we scale by . has test error with respect to .
Note that is an average of
IID binary variables, and for these have expected value zero because . Thus, by Hoeffding’s inequality we have for all that:
Therefore, by a union bound over all patterns (recall ), with probability at least , for all :
Next we consider (the positive pattern), for which (because it only appears in the positive examples, and the prior over is ). Hoeffding’s bound and the definition of imply that with probability at least .555In fact we can have exponential dependence here, but we use to simplify later expressions.
We can now do a union bound over all patterns and PSI condition to obtain that with probability at least we have by the PSI property and Eq. 14, for all :
Furthermore, by PSI we have .
Therefore, for any positive point we have:
7 SGD satisfies PSI - Empirical Analysis
Thus far we have established that the PSI property implies good generalization. Here we provide empirical evidence that SGD indeed learns such models with overparameterized CNNs. In Section 7.1, we empirically validate the PSI condition. In Section 7.2, we provide a qualitative analysis that confirms that the statistics of the patterns in the training set correlate with the dot product between a detector and its detected pattern. Details of the experiments are provided in the supplementary.
7.1 Empirical Validation of PSI
Given or , the distribution denoted by selects patterns uniformly at random without replacement from .
Assume that . Given or , for each the distribution selects one pattern from uniformly at random.
Both and satisfy for all . Thus, given Theorem 6.1 if we can show the PSI condition holds, good generalization will be implied.
The support of is the shattered set in the proof of Theorem 5.2. The proof implies that for any sampled training and test sets which are subsets of , there exists a network with 0 training error and arbitrarily high test error. Therefore, by optimizing the training error, SGD can converge to these solutions. However, as we show next, SGD does not converge to these solutions, but rather it satisfies PSI and converges to solutions with good generalization performance.
To empirically validate PSI and show that it implies good generalization, we could in principle show that the conditions of Theorem 6.1 hold empirically, i.e., there exist , and such that and PSI holds with constants , and high probability . However, as with most generalization results, the bound is not exact up to constants and using its numerical value results in large which cannot be empirically tested.
Instead, we show that for empirically large , PSI holds with small constants and which do not change the order of magnitude of the bound, namely, . Indeed, we will show empirically that across a large number of experiments.
We trained a neural network in our setting with SGD as described in Section 3. We performed more than 1000 experiments with parameter values , , for and for . For each distribution or , we performed 10 experiments for each set of values for , , and . For each experiment, we set and empirically calculated the lowest constant which satisfies the PSI definition, which we denote by . Formally,
Figure 1(a) shows that across all experiments the value of is less than 1.
To further validate the PSI condition, we tested whether the conditions in the proof of Theorem 6.1 empirically hold. Specifically, in the proof we showed that for all (in Eq. 15). We checked this for all settings of and largest possible and , and . In all of our experiments, SGD converged to a solution with 0 test error such that Eq. 15 holds for all .
7.2 Qualitative Analysis of Inductive Bias
The intuition we described in Section 4, suggests that there is a positive correlation between and . To test this, we experimented with a distribution which can vary the probability of a mutual pattern to be selected and thus can control . Given it selects with probability or with probability . Then it selects the remaining patterns from uniformly at random without replacement. Similarly, given it selects with probability or with probability . The remaining patterns are selected uniformly without replacement from . We experimented with various and plotted for each solution of SGD, and for all . Figure 3 clearly shows a positive correlation between these quantities, strongly suggesting that SGD results in large dot products between the detectors and detected patterns that are biased towards a certain class.
8 SGD satisfies PSI - Optimization Analysis
In this section we show that PSI holds for a simple setup of two points in the training set. We assume that the training set consists of two points and denote . We further assume that and have exactly the same patterns in . Note that in this case we have and for all . We analyze gradient descent and assume that it runs with a constant learning rate . We denote if is sufficiently small compared to , e.g., .
The following theorem shows that PSI holds with constants and . The proof analyzes the trajectory of gradient descent and is provided in the supplementary.
For a sufficiently small , , such that and , with probability at least , gradient descent converges to a global minimum with parameters after iterations and the following holds:
For all , we have:
For all :
Therefore, the PSI condition is satisfied with and .
Notice that the theorem holds for overparameterized networks (sufficiently large ), which coincides with our empirical findings in the previous section. Finally, we note that the theorem holds for sufficiently small initialization, and thus it is not in the same regime of NTK analysis where initialization is relatively large (Woodworth et al., 2019; Chizat et al., 2019).
Understanding the inductive bias of gradient methods for deep learning is an important challenge. In this paper, we identify a new form of inductive bias for CNNs and provide theoretical and empirical support that SGD exhibits this bias and consequently has good generalization performance.
We use a unique approach of combining novel empirical observations with theoretical guarantees to make headway in a challenging setting of non-linear overparameterized CNNs. We believe that this can pave the way for studying inductive bias in other difficult settings. Extending the PSI notion to other neural architectures and distributions is an interesting direction for future work.
This research is supported by the Blavatnik Computer Science Research Fund and by the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC HOLI 819080). We thank Roi Livni for helpful discussions.
- Neural network learning: theoretical foundations. cambridge university press. Cited by: §5.2.
- Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems, pp. 7411–7422. Cited by: §2.
- Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pp. 322–332. Cited by: §2.
- Beyond linearization: on quadratic and higher-order approximation of wide neural networks. arXiv preprint arXiv:1910.01619. Cited by: §1.
- SGD learns over-parameterized networks that provably generalize on linearly separable data. International Conference on Learning Representations. Cited by: §2.
- Globally optimal gradient descent for a convnet with gaussian inputs. In International Conference on Machine Learning, pp. 605–614. Cited by: §2.
- Why do larger models generalize better? a theoretical perspective via the xor problem. In International Conference on Machine Learning, pp. 822–830. Cited by: §2.
- On lazy training in differentiable programming. In Advances in Neural Information Processing Systems, pp. 2933–2943. Cited by: §8.
- Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pp. 1675–1685. Cited by: §2.
- Gradient descent learns one-hidden-layer cnn: don’t be afraid of spurious local minima. In International Conference on Machine Learning, pp. 1339–1348. Cited by: §2.
- When is a convolutional filter easy to learn?. ICLR. Cited by: §2.
- Gradient descent provably optimizes over-parameterized neural networks. International Conference on Learning Representations. Cited by: §2.
- Decoupling gating from linearity. arXiv preprint arXiv:1906.05032. Cited by: §2.
Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300. Cited by: §2.
- Gradient descent aligns the layers of deep linear networks. ICLR. Cited by: §2.
- Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
- Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890. Cited by: §2.
- Lexicographic and depth-sensitive margins in homogeneous and non-homogeneous deep models. In International Conference on Machine Learning, pp. 4683–4692. Cited by: §2.
- Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics 21 (1), pp. 124–127. Cited by: §5.1.
- The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research 19 (1), pp. 2822–2878. Cited by: §2.
- Deepface: closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1701–1708. Cited by: §1.
- Regularization matters: generalization and optimization of neural nets vs their induced kernel. In Advances in Neural Information Processing Systems, pp. 9709–9721. Cited by: §2.
- Kernel and deep regimes in overparametrized models. arXiv preprint arXiv:1906.05827. Cited by: §1, §8.
- Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §1.
- On the power and limitations of random features for understanding neural networks. In Advances in Neural Information Processing Systems, pp. 6594–6604. Cited by: §1.
- On the learning dynamics of two-layer nonlinear convolutional neural networks. arXiv preprint arXiv:1905.10157. Cited by: §2.
- Understanding deep learning requires rethinking generalization. ICLR. Cited by: §4.
Appendix A Experimental Details in Section 7
Here we provide details of the experiments performed in Section 7
. All experiments were run on NVidia Titan Xp GPUs with 12GB of memory. Training algorithms were implemented in TensorFlow. All of the empirical results can replicated in approximately 150 hours on a single Nvidia Titan Xp GPU.
a.1 Figure 1(a) Experiment
We performed more than 1000 experiments with the network in Eq. 1 and SGD. We experimented with parameter values , , for and for . For each distribution or , we performed 10 experiments for each set of values for , , and . For each set of values we plot the mean of the 10 experiments and standard deviation error bars in shaded regions. In each one of the 10 experiments we randomly sampled the training and test sets according to the given distribution or and randomly sampled the initialization of the network. We used a test set of size 1000. All orthogonal patterns were one-hot vectors. We trained only the weights of the first convolutional layer. We used a batch size of if and batch size of for . The learning rate was set to and to
. The solution SGD returned was either after 50000 epochs or if there was an epoch where the training loss was less than. For each experiment, we set and empirically calculated .
a.2 Figure 1(b) Experiment
In the same setup of Section A.1 (i.e., batch size, stopping criteria, learning rate etc.), we performed experiments with distribution , , , and .
a.3 Figure 1(c) Experiment
In the same setup of Section A.1, we performed experiments with distribution , , , and .
a.4 Figure 3 Experiment
In the setup of Section A.1 we experimented with distributions for values in
We experimented with values , , and . The solution SGD returned was either after 2000 epochs or if there was an epoch where the training loss was less than .
Appendix B Proof of Theorem 8.1
Here we define additional notations that will be useful for the proof of the theorem.
Finally we define to be any polynomial function of .
b.2 Auxiliary Lemmas
We now prove several technical lemmas. In Section B.3 we use the lemmas to prove the theorem. In the next 3 lemmas we provide high probability bounds on sizes of certain sets that are functions of the sets in Eq. B.1 and Eq. B.1.
For any , with probability at least for any :
It suffices to show that for any :
For each it holds that with probability . Therefore by Hoeffding’s inequality, with probability at least ,
The same argument applies for , a union bound and setting concludes the proof. ∎