1 Introduction
Central to any form of learning is an inductive bias that induces some sort of capacity control (i.e. restricts or encourages predictors to be “simple” in some way), which in turn allows for generalization. The success of learning then depends on how well the inductive bias captures reality (i.e. how expressive is the hypothesis class of “simple” predictors) relative to the capacity induced, as well as on the computational complexity of fitting a “simple” predictor to the training data.
Let us consider learning with feedforward networks from this perspective. If we search for the weights minimizing the training error, we are essentially considering the hypothesis class of predictors representable with different weight vectors, typically for some fixed architecture. Capacity is then controlled by the size (number of weights) of the network
^{1}^{1}1The exact correspondence depends on the activation function—for hard thresholding activation the pseudodimension, and hence sample complexity, scales as
, where is the number of weights in the network. With sigmoidal activation it is between and (Anthony & Bartlett, 1999).. Our justification for using such networks is then that many interesting and realistic functions can be represented by nottoolarge (and hence bounded capacity) feedforward networks. Indeed, in many cases we can show how specific architectures can capture desired behaviors. More broadly, any time computable function can be captured by an sized network, and so the expressive power of such networks is indeed great (Sipser, 2006, Theorem 9.25).At the same time, we also know that learning even moderately sized networks is computationally intractable—not only is it NPhard to minimize the empirical error, even with only three hidden units, but it is hard to learn small feedforward networks using any learning method (subject to cryptographic assumptions). That is, even for binary classification using a network with a single hidden layer and a logarithmic (in the input size) number of hidden units, and even if we know the true targets are exactly captured by such a small network, there is likely no efficient algorithm that can ensure error better than 1/2 (Sherstov, 2006; Daniely et al., 2014)—not if the algorithm tries to fit such a network, not even if it tries to fit a much larger network, and in fact no matter how the algorithm represents predictors (see the Appendix). And so, merely knowing that some nottoolarge architecture is excellent in expressing reality does not explain why we are able to learn using it, nor using an even larger network. Why is it then that we succeed in learning using multilayer feedforward networks? Can we identify a property that makes them possible to learn? An alternative inductive bias?
Here, we make our first steps at shedding light on this question by going back to our understanding of network size as the capacity control at play.
Our main observation, based on empirical experimentation with singlehiddenlayer networks of increasing size (increasing number of hidden units), is that size does not behave as a capacity control parameter, and in fact there must be some other, implicit, capacity control at play. We suggest that this hidden capacity control might be the real inductive bias when learning with deep networks.
In order to try to gain an understanding at the possible inductive bias, we draw an analogy to matrix factorization and understand dimensionality versus norm control there. Based on this analogy we suggest that implicit norm regularization might be central also for deep learning, and also there we should think of infinitesized boundednorm models. We then also demonstrate how (implicit) weight decay in an infinite twolayer network gives rise to a “convex neural net”, with an infinite hidden layer and (not ) regularization in the top layer.
2 Network Size and Generalization
Consider training a feedforward network by finding the weights minimizing the training error. Specifically, we will consider a network with realvalued inputs , a single hidden layer with rectified linear units, and outputs ,
(1) 
where is the rectified linear activation function and are the weights learned by minimizing a (truncated) softmax cross entropy loss^{2}^{2}2When using softmax crossentropy, the loss is never exactly zero for correct predictions with finite margins/confidences. Instead, if the data is seperable, in order to minimize the loss the weights need to be scaled up toward infinity and the cross entropy loss goes to zero, and a global minimum is never attained. In order to be able to say that we are actually reaching a zero loss solution, and hence a global minimum, we use a slightly modified softmax which does not noticeably change the results in practice. This truncated loss returns the same exact value for wrong predictions or correct prediction with confidences less than a threshold but returns zero for correct predictions with large enough margins: Let be the scores for possible labels and be the correct labels. Then the softmax crossentropy loss can be written as
but we instead use the differentiable loss function
where for and otherwise. Therefore, we only deviate from the softmax crossentropy when the margin is more than , at which point the effect of this deviation is negligible (we always have )—if there are any actual errors the behavior on them would completely dominate correct examples with margin over , and if there are no errors we are just capping the amount by which we need to scale up the weights. on labeled training examples. The total number of weights is then .What happens to the training and test errors when we increase the network size
? The training error will necessarily decrease. The test error might initially decrease as the approximation error is reduced and the network is better able to capture the targets. However, as the size increases further, we loose our capacity control and generalization ability, and should start overfitting. This is the classic approximationestimation tradeoff behavior.
Consider, however, the results shown in Figure 1
, where we trained networks of increasing size on the MNIST and CIFAR10 datasets. Training was done using stochastic gradient descent with momentum and diminishing step sizes, on the training error and without any explicit regularization. As expected, both training and test error initially decrease. More surprising is that if we increase the size of the network past the size required to achieve zero training error, the test error continues decreasing! This behavior is not at all predicted by, and even contrary to, viewing learning as fitting a hypothesis class controlled by network size. For example for MNIST, 32 units are enough to attain zero training error. When we allow more units, the network is not fitting the training data any better, but the estimation error, and hence the generalization error, should increase with the increase in capacity. However, the test error goes down. In fact, as we add more and more parameters, even beyond the number of training examples, the generalization error does not go up.
We also further tested this phenomena under some artificial mutilations to the data set. First, we wanted to artificially ensure that the approximation error was indeed zero and does not decrease as we add more units. To this end, we first trained a network with a small number of hidden units ( on MNIST and on CIFAR) on the entire dataset (train+test+validation). This network did have some disagreements with the correct labels, but we then switched all labels to agree with the network creating a “censored” data set. We can think of this censored data as representing an artificial source distribution which can be exactly captured by a network with hidden units. That is, the approximation error is zero for networks with at least hidden units, and so does not decrease further. Still, as can be seen in the middle row of Figure 2, the test error continues decreasing even after reaching zero training error.
Next, we tried to force overfitting by adding random label noise to the data. We wanted to see whether now the network will use its higher capacity to try to fit the noise, thus hurting generalization. However, as can be seen in the bottom row of Figure 2, even with five percent random labels, there is no significant overfitting and test error continues decreasing as network size increases past the size required for achieving zero training error.
What is happening here? A possible explanation is that the optimization is introducing some implicit regularization. That is, we are implicitly trying to find a solution with small “complexity”, for some notion of complexity, perhaps norm. This can explain why we do not overfit even when the number of parameters is huge. Furthermore, increasing the number of units might allow for solutions that actually have lower “complexity”, and thus generalization better. Perhaps an ideal then would be an infinite network controlled only through this hidden complexity.
We want to emphasize that we are not including any explicit regularization, neither as an explicit penalty term nor by modifying optimization through, e.g., dropouts, weight decay, or with onepass stochastic methods. We are using a stochastic method, but we are running it to convergence—we achieve zero surrogate loss and zero training error. In fact, we also tried training using batch conjugate gradient descent and observed almost identical behavior. But it seems that even still, we are not getting to some random global minimum—indeed for large networks the vast majority of the many global minima of the training error would horribly overfit. Instead, the optimization is directing us toward a “low complexity” global minimum.
Although we do not know what this hidden notion of complexity is, as a final experiment we tried to see the effect of adding explicit regularization in the form of weight decay. The results are shown in the top row of figure 2. There is a slight improvement in generalization but we still see that increasing the network size helps generalization.
3 A Matrix Factorization Analogy
To gain some understanding at what might be going on, let us consider a slightly simpler model which we do understand much better. Instead of rectified linear activations, consider a feedforward network with a single hidden layer, and linear activations, i.e.:
(2) 
This is of course simply a matrixfactorization model, where and . Controlling capacity by limiting the number of hidden units exactly corresponds to constraining the rank of , i.e. biasing toward low dimensional factorizations. Such a lowrank inductive bias is indeed sensible, though computationally intractable to handle with most loss functions.
However, in the last decade we have seen much success for learning with low norm factorizations. In such models, we do not constrain the inner dimensionality of , and instead only constrain, or regularize, their norm. For example, constraining the Frobenius norm of and corresponds to using the tracenorm as an inductive bias (Srebro et al., 2004):
(3) 
Other norms of the factorization lead to different regularizers.
Unlike the rank, the tracenorm (as well as other factorization norms) is convex, and leads to tractable learning problems (Fazel et al., 2001; Srebro et al., 2004). In fact, even if learning is done by a local search over the factor matrices and (i.e. by a local search over the weights of the network), if the dimensionality is high enough and the norm is regularized, we can ensure convergence to a global minima (Burer & Choi, 2006). This is in stark contrast to the dimensionalityconstrained lowrank situation, where the limiting factor is the number of hidden units, and local minima are abundant (Srebro & Jaakkola, 2003).
Furthermore, the tracenorm and other factorization norms are welljustified as sensible inductive biases. We can ensure generalization based on having low tracenorm, and a lowtrace norm model corresponds to a realistic factor model with many factors of limited overall influence. In fact, empirical evidence suggests that in many cases lownorm factorization are a more appropriate inductive bias compared to lowrank models.
We see, then, that in the case of linear activations (i.e. matrix factorization), the norm of the factorization is in a sense a better inductive bias than the number of weights: it ensures generalization, it is grounded in reality, and it explain why the models can be learned tractably.
Let us interpret the experimental results of Section 2 in this light. Perhaps learning is succeeding not because there is a good representation of the targets with a small number of units, but rather because there is a good representation with small overall norm, and the optimization is implicitly biasing us toward lownorm models. Such an inductive bias might potentially explain both the generalization ability and the computational tractability of learning, even using local search.
Under this interpretation, we really should be using infinitesized networks, with an infinite number of hidden units. Fitting a finite network (with implicit regularization) can be viewed as an approximation to fitting the “true” infinite network. This situation is also common in matrix factorization: e.g., a very successful approach for training low tracenorm models, and other infinitedimensional boundednorm factorization models, is to approximate them using a finite dimensional representation Rennie & Srebro (2005); Srebro & Salakhutdinov (2010). The finite dimensionality is then not used at all for capacity (statistical complexity) control, but purely for computational reasons. Indeed, increasing the allowed dimensionality generally improves generalization performance, as it allows us to better approximate the true infinite model.
4 Infinite Size, Bounded Norm Networks
In this final section, we consider a possible model for infinite sized normregularized networks. Our starting point is that of global weight decay, i.e. adding a regularization term that penalizes the sum of squares of all weights in the network, as might be approximately introduced by some implicit regularization. Our result in this Section is that this global regularization is equivalent to a Convex Neural Network (Convex NN; Bengio et al. (2005))—an infinite network with regularization on the top layer. Note that such models are rather different from infinite networks with regularization on the top layer, which reduce to linear methods with some specific kernel (Cho & Saul, 2009; Bach, 2014)
. Note also that our aim here is to explain what neural networks are doing instead of trying to match the performance of deep models with a known shallow model as done by, e.g.,
Lu et al. (2014).For simplicity, we will focus on single output networks (), i.e. networks which compute a function . We first consider finite twolayer networks (with a single hidden layer) and show that regularization on both layers is equivalent to and constraint on each unit in the hidden layer, and regularization on the top unit:
Theorem 1.
Let be a loss function and be training examples.
(4) 
is the same as
(5)  
Proof.
By the inequality between the arithmetic and geometric means, we have
The righthand side can always be attained without changing the inputoutput mapping by the rescaling and . The reason we can rescale the weights without changing the inputoutput mapping is that the rectified linear unit is piecewise linear and the piece a hidden unit is on is invariant to rescaling of the weights. Finally, since the righthand side of the above inequality is invariant to rescaling, we can always choose the norm to be bounded by one. ∎
Now we establish a connection between the regularized network (5), in which we learn the input to hidden connections, to convex NN (Bengio et al., 2005).
First we recall the definition of a convex NN. Let be a fixed “library” of possible weight vectors and and be positive (unnormalized) measures over representing the positive and negative part of the weights of each unit. Then a “convex neural net” is given by predictions of the form
(6) 
with regularizer (i.e. complexity measure) . This is simply an infinite generalization of network with hidden units and regularization on the second layer: if is finite, (6) is equivalent to
(7) 
with and regularizer . Training a convex NN is then given by:
(8) 
Moreover, even if is infinite and even continuous, there will always be an optimum of (8) which is a discrete measure with support at most (Rosset et al., 2007). That is, (8) can be equivalently written as:
(9) 
which is the same as (5), with .
The difference between the network (5), and the infinite network (8) is in learning versus selecting the hidden units, and in that in (8) we have no limit on the number of units used. That is, in (8) we have all possible units in available to us, and we merely need to select which we want to use, without any constraint on the number of units used, only the over norm. But the equivalence of (8) and (9) establishes that as long as the number of allowed units is large enough, the two are equivalent:
Corollary 1.
In summary, learning and selecting is equivalent if we have sufficiently many hidden units and Theorem 1 gives an alternative justification for employing regularization when the input to hidden weights are fixed and normalized to have unit norm, namely, it is equivalent to regularization, which can be achieved by weight decay or implicit regularization via stochastic gradient descent.
The above equivalence holds also for networks with multiple output units, i.e. , where the regularization on is replaced with the group lasso regularizer . Indeed, for matrix factorizations (i.e. with linear activations), such a grouplasso regularized formulation is known to be equivalent to the trace norm (3) (see Argyriou et al., 2007).
References
 Anthony & Bartlett (1999) Anthony, Martin and Bartlett, Peter L. Neural network learning: Theoretical foundations. Cambridge University Press, 1999.
 Argyriou et al. (2007) Argyriou, Andreas, Evgeniou, Theodoros, and Pontil, Massimiliano. Multitask feature learning. Advances in neural information processing systems, pp. 41–48, 2007.

Bach (2014)
Bach, Francis.
Breaking the curse of dimensionality with convex neural networks.
http://www.di.ens.fr/~fbach/fbach_cifar_2014.pdf, 2014.  Bengio et al. (2005) Bengio, Yoshua, Roux, Nicolas L., Vincent, Pascal, Delalleau, Olivier, and Marcotte, Patrice. Convex neural networks. Advances in neural information processing systems, pp. 123–130, 2005.
 Burer & Choi (2006) Burer, Samuel and Choi, Changhui. Computational enhancements in lowrank semidefinite programming. Optimization Methods and Software, 21(3):493–512, 2006.
 Cho & Saul (2009) Cho, Youngmin and Saul, Lawrence K. Kernel methods for deep learning. Advances in neural information processing systems, pp. 342–350, 2009.
 Daniely et al. (2014) Daniely, Amit, Linial, Nati, and ShalevShwartz, Shai. From average case complexity to improper learning complexity. STOC, 2014.

Fazel et al. (2001)
Fazel, Maryam, Hindi, Haitham, and Boyd, Stephen P.
A rank minimization heuristic with application to minimum order system approximation.
Proceedings of American Control Conference, pp. 4734–4739, 2001.  Kearns & Valiant (1994) Kearns, Michael and Valiant, Leslie. Cryptographic limitations on learning boolean formulae and finite automata. Journal of the ACM (JACM), 41(1):67–95, 1994.
 Livni et al. (2014) Livni, Roi, ShalevShwartz, Shai, and Shamir, Ohad. On the computational efficiency of training neural networks. Advances in Neural Information Processing Systems, pp. 855–863, 2014.
 Lu et al. (2014) Lu, Zhiyun, May, Avner, Liu, Kuan, Garakani, Alireza Bagheri, Guo, Dong, Bellet, Aurlien, Fan, Linxi, Collins, Michael, Kingsbury, Brian, Picheny, Michael, and Sha, Fei. How to scale up kernel methods to be as good as deep neural nets. Technical Report, arXiv:1411.4000, 2014.

Rennie & Srebro (2005)
Rennie, Jasson DM and Srebro, Nathan.
Fast maximum margin matrix factorization for collaborative
prediction.
In
Proceedings of the 22nd international conference on Machine learning
, pp. 713–719. ACM, 2005.  Rosset et al. (2007) Rosset, Saharon, Swirszcz, Grzegorz, and Srebro, Nathan. regularization in infinite dimensional feature spaces. In COLT, pp. 544–558. Springer, 2007.
 Sherstov (2006) Sherstov, Adam R Klivansand Alexander A. Cryptographic hardness for learning intersections of halfspaces. In Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, pp. 553–562. IEEE, 2006.

Sipser (2006)
Sipser, Michael.
Introduction to the Theory of Computation
. Thomson Course Technology, 2006.  Srebro & Jaakkola (2003) Srebro, Nathan and Jaakkola, Tommi S. Weighted lowrank approximations. ICML, pp. 720–727, 2003.
 Srebro & Salakhutdinov (2010) Srebro, Nathan and Salakhutdinov, Ruslan. Collaborative filtering in a nonuniform world: Learning with the weighted trace norm. In Advances in Neural Information Processing Systems, pp. 2056–2064, 2010.
 Srebro et al. (2004) Srebro, Nathan, Rennie, Jason, and Jaakkola, Tommi S. Maximummargin matrix factorization. Advances in neural information processing systems, pp. 1329–1336, 2004.
Appendix
For the convenience of the reader, we formalize here the hardness of learning feedforward neural network mentioned in the Introduction. The results are presented in a way that is appropriate for feedforward networks with RELU activations, but they are really a direct implication of recent results about learning intersections of halfspaces. For historical completeness we note that hardness of learning logarithmic depth networks was already established by
Kearns & Valiant (1994), and that the more recent results we discuss here (Sherstov, 2006; Daniely et al., 2014) establish also hardness of learning depth two networks, subject to perhaps simpler cryptographic assumptions. The presentation and construction here is similar to that of Livni et al. (2014).Question
Is there a sample complexity function and an algorithm that takes as input and returns a description of a function such that the following is true:
For any , any and any distribution over , if:

There exists a feedforward neural network with RELU actications with inputs and hidden units implementing a function such that (i.e. for , the label can be perfectly predicted by a network of size ).

The input to is drawn i.i.d. from .

(i.e. is provided with enough training data).
then

With probability at least
, algorithm returns a function such that (that is, at least half the time the algorithm succeeds in learning a function with nontrivial error). 
for some (i.e. the sample complexity required by the algorithm is polynomial in the network size—if we needed a superpolynomial number of samples, we would have no hope of learning in polynomial time).

The function that corresponds to the description returned by can be computed in time from its description (i.e. the representation used by the learner can be a feedforward network of any size polynomial in and , or of any other representation that can be efficiently computed).

runs in time
Theorem.
Subject for the cryptographic assumptions in Daniely et al. (2014), there is no algorithm that satisfies the conditions in the above question.
In fact, there is no algorithm satisfying the conditions even if we require that the labels can be perfectly predicted by a network with a single hidden layer with any superconstant, e.g. , number of hidden units.
Proof.
We show that every intersection of homogeneous halfspaces over with normals in can be realized with unit margin by a feedfowrad neural networks with
hidden units in a single hidden layer. For each hyperplane
, where , we include two units in the hidden layer: and . We set all incoming weights of the output node to be . Therefore, this network is realizing the following function:Since all inputs and all weights are integer, the outputs of the first layer will be integer, will be zero or one, and realizes the intersection of the halfspaces with unit margin. Hence, the hypothesis class of neural intersection of halfspaces is a subset of hypothesis class of feedforward neural networks with hidden units in a single hidden layer. We complete the proof by applying Theorem 5.4 in Daniely et al. (2014) which states that for any , subject for the cryptographic assumptions in Daniely et al. (2014), the hypothesis class of intersection of homogeneous halfspaces over with normals in is not efficiently PAC learnable (even improperly)^{3}^{3}3Their Theorem 5.4 talks about unrestricted halfspaces, but the construction in Section 7.2 uses only data in and halfspaces specified by with . ∎
We proved here that even for no algorithm can satisfy the condition in the question. A similar result can be shown for subject to weaker cryptographic assumptions in Sherstov (2006).
The Theorem tells us not only that we cannot expect to fit a small network to data even if the data is generated by the network (since doing so would give us an efficient learning algorithm, which contradicts the Theorem), but that we also can’t expect to learn by using a much larger network. That is, even if we know that labels can be perfectly predicted by a small network, we cannot expect to have a learning algorithm that learns a much larger (but poly sized) network that will have nontrivial error. In fact, being representable by a small network is not enough to ensure tractable learning no matter what representation the learning algorithm uses (e.g. a much larger network, a mixture of networks, a tree over networks, etc). This is a much stronger statement than just saying that fitting a network to data is hard. Also, precluding the possibility of tractable learning if the labels are exactly explained by some small unknown network of course also precludes the possibility of achieving low error when the labels are only approximately explained by some small unknown network (i.e. of noisy or “agnostic” learning).