Central to any form of learning is an inductive bias that induces some sort of capacity control (i.e. restricts or encourages predictors to be “simple” in some way), which in turn allows for generalization. The success of learning then depends on how well the inductive bias captures reality (i.e. how expressive is the hypothesis class of “simple” predictors) relative to the capacity induced, as well as on the computational complexity of fitting a “simple” predictor to the training data.
Let us consider learning with feed-forward networks from this perspective. If we search for the weights minimizing the training error, we are essentially considering the hypothesis class of predictors representable with different weight vectors, typically for some fixed architecture. Capacity is then controlled by the size (number of weights) of the network111
The exact correspondence depends on the activation function—for hard thresholding activation the pseudo-dimension, and hence sample complexity, scales as, where is the number of weights in the network. With sigmoidal activation it is between and (Anthony & Bartlett, 1999).. Our justification for using such networks is then that many interesting and realistic functions can be represented by not-too-large (and hence bounded capacity) feed-forward networks. Indeed, in many cases we can show how specific architectures can capture desired behaviors. More broadly, any time computable function can be captured by an sized network, and so the expressive power of such networks is indeed great (Sipser, 2006, Theorem 9.25).
At the same time, we also know that learning even moderately sized networks is computationally intractable—not only is it NP-hard to minimize the empirical error, even with only three hidden units, but it is hard to learn small feed-forward networks using any learning method (subject to cryptographic assumptions). That is, even for binary classification using a network with a single hidden layer and a logarithmic (in the input size) number of hidden units, and even if we know the true targets are exactly captured by such a small network, there is likely no efficient algorithm that can ensure error better than 1/2 (Sherstov, 2006; Daniely et al., 2014)—not if the algorithm tries to fit such a network, not even if it tries to fit a much larger network, and in fact no matter how the algorithm represents predictors (see the Appendix). And so, merely knowing that some not-too-large architecture is excellent in expressing reality does not explain why we are able to learn using it, nor using an even larger network. Why is it then that we succeed in learning using multilayer feed-forward networks? Can we identify a property that makes them possible to learn? An alternative inductive bias?
Here, we make our first steps at shedding light on this question by going back to our understanding of network size as the capacity control at play.
Our main observation, based on empirical experimentation with single-hidden-layer networks of increasing size (increasing number of hidden units), is that size does not behave as a capacity control parameter, and in fact there must be some other, implicit, capacity control at play. We suggest that this hidden capacity control might be the real inductive bias when learning with deep networks.
In order to try to gain an understanding at the possible inductive bias, we draw an analogy to matrix factorization and understand dimensionality versus norm control there. Based on this analogy we suggest that implicit norm regularization might be central also for deep learning, and also there we should think of infinite-sized bounded-norm models. We then also demonstrate how (implicit) weight decay in an infinite two-layer network gives rise to a “convex neural net”, with an infinite hidden layer and (not ) regularization in the top layer.
2 Network Size and Generalization
Consider training a feed-forward network by finding the weights minimizing the training error. Specifically, we will consider a network with real-valued inputs , a single hidden layer with rectified linear units, and outputs ,
where is the rectified linear activation function
and are the weights learned by minimizing
a (truncated) soft-max cross entropy loss222When using soft-max
cross-entropy, the loss is never exactly zero for correct
predictions with finite margins/confidences. Instead, if the data
is seperable, in order to minimize the loss the weights need to be
scaled up toward infinity and the cross entropy loss goes to zero,
and a global minimum is never attained. In order to be able to say
that we are actually reaching a zero loss solution, and hence a
global minimum, we use a slightly modified soft-max which does not
noticeably change the results in practice. This truncated loss
returns the same exact value for wrong predictions or correct
prediction with confidences less than a threshold but returns zero
for correct predictions with large enough margins: Let
be the scores for possible labels and be
the correct labels. Then the soft-max cross-entropy loss can be
written as but we instead
use the differentiable loss function
but we instead use the differentiable loss functionwhere for and otherwise. Therefore, we only deviate from the soft-max cross-entropy when the margin is more than , at which point the effect of this deviation is negligible (we always have )—if there are any actual errors the behavior on them would completely dominate correct examples with margin over , and if there are no errors we are just capping the amount by which we need to scale up the weights. on labeled training examples. The total number of weights is then .
What happens to the training and test errors when we increase the network size
? The training error will necessarily decrease. The test error might initially decrease as the approximation error is reduced and the network is better able to capture the targets. However, as the size increases further, we loose our capacity control and generalization ability, and should start overfitting. This is the classic approximation-estimation tradeoff behavior.
Consider, however, the results shown in Figure 1
, where we trained networks of increasing size on the MNIST and CIFAR-10 datasets. Training was done using stochastic gradient descent with momentum and diminishing step sizes, on the training error and without any explicit regularization. As expected, both training and test error initially decrease. More surprising is that if we increase the size of the network past the size required to achieve zero training error, the test error continues decreasing! This behavior is not at all predicted by, and even contrary to, viewing learning as fitting a hypothesis class controlled by network size. For example for MNIST, 32 units are enough to attain zero training error. When we allow more units, the network is not fitting the training data any better, but the estimation error, and hence the generalization error, should increase with the increase in capacity. However, the test error goes down. In fact, as we add more and more parameters, even beyond the number of training examples, the generalization error does not go up.
We also further tested this phenomena under some artificial mutilations to the data set. First, we wanted to artificially ensure that the approximation error was indeed zero and does not decrease as we add more units. To this end, we first trained a network with a small number of hidden units ( on MNIST and on CIFAR) on the entire dataset (train+test+validation). This network did have some disagreements with the correct labels, but we then switched all labels to agree with the network creating a “censored” data set. We can think of this censored data as representing an artificial source distribution which can be exactly captured by a network with hidden units. That is, the approximation error is zero for networks with at least hidden units, and so does not decrease further. Still, as can be seen in the middle row of Figure 2, the test error continues decreasing even after reaching zero training error.
Next, we tried to force overfitting by adding random label noise to the data. We wanted to see whether now the network will use its higher capacity to try to fit the noise, thus hurting generalization. However, as can be seen in the bottom row of Figure 2, even with five percent random labels, there is no significant overfitting and test error continues decreasing as network size increases past the size required for achieving zero training error.
What is happening here? A possible explanation is that the optimization is introducing some implicit regularization. That is, we are implicitly trying to find a solution with small “complexity”, for some notion of complexity, perhaps norm. This can explain why we do not overfit even when the number of parameters is huge. Furthermore, increasing the number of units might allow for solutions that actually have lower “complexity”, and thus generalization better. Perhaps an ideal then would be an infinite network controlled only through this hidden complexity.
We want to emphasize that we are not including any explicit regularization, neither as an explicit penalty term nor by modifying optimization through, e.g., drop-outs, weight decay, or with one-pass stochastic methods. We are using a stochastic method, but we are running it to convergence—we achieve zero surrogate loss and zero training error. In fact, we also tried training using batch conjugate gradient descent and observed almost identical behavior. But it seems that even still, we are not getting to some random global minimum—indeed for large networks the vast majority of the many global minima of the training error would horribly overfit. Instead, the optimization is directing us toward a “low complexity” global minimum.
Although we do not know what this hidden notion of complexity is, as a final experiment we tried to see the effect of adding explicit regularization in the form of weight decay. The results are shown in the top row of figure 2. There is a slight improvement in generalization but we still see that increasing the network size helps generalization.
3 A Matrix Factorization Analogy
To gain some understanding at what might be going on, let us consider a slightly simpler model which we do understand much better. Instead of rectified linear activations, consider a feed-forward network with a single hidden layer, and linear activations, i.e.:
This is of course simply a matrix-factorization model, where and . Controlling capacity by limiting the number of hidden units exactly corresponds to constraining the rank of , i.e. biasing toward low dimensional factorizations. Such a low-rank inductive bias is indeed sensible, though computationally intractable to handle with most loss functions.
However, in the last decade we have seen much success for learning with low norm factorizations. In such models, we do not constrain the inner dimensionality of , and instead only constrain, or regularize, their norm. For example, constraining the Frobenius norm of and corresponds to using the trace-norm as an inductive bias (Srebro et al., 2004):
Other norms of the factorization lead to different regularizers.
Unlike the rank, the trace-norm (as well as other factorization norms) is convex, and leads to tractable learning problems (Fazel et al., 2001; Srebro et al., 2004). In fact, even if learning is done by a local search over the factor matrices and (i.e. by a local search over the weights of the network), if the dimensionality is high enough and the norm is regularized, we can ensure convergence to a global minima (Burer & Choi, 2006). This is in stark contrast to the dimensionality-constrained low-rank situation, where the limiting factor is the number of hidden units, and local minima are abundant (Srebro & Jaakkola, 2003).
Furthermore, the trace-norm and other factorization norms are well-justified as sensible inductive biases. We can ensure generalization based on having low trace-norm, and a low-trace norm model corresponds to a realistic factor model with many factors of limited overall influence. In fact, empirical evidence suggests that in many cases low-norm factorization are a more appropriate inductive bias compared to low-rank models.
We see, then, that in the case of linear activations (i.e. matrix factorization), the norm of the factorization is in a sense a better inductive bias than the number of weights: it ensures generalization, it is grounded in reality, and it explain why the models can be learned tractably.
Let us interpret the experimental results of Section 2 in this light. Perhaps learning is succeeding not because there is a good representation of the targets with a small number of units, but rather because there is a good representation with small overall norm, and the optimization is implicitly biasing us toward low-norm models. Such an inductive bias might potentially explain both the generalization ability and the computational tractability of learning, even using local search.
Under this interpretation, we really should be using infinite-sized networks, with an infinite number of hidden units. Fitting a finite network (with implicit regularization) can be viewed as an approximation to fitting the “true” infinite network. This situation is also common in matrix factorization: e.g., a very successful approach for training low trace-norm models, and other infinite-dimensional bounded-norm factorization models, is to approximate them using a finite dimensional representation Rennie & Srebro (2005); Srebro & Salakhutdinov (2010). The finite dimensionality is then not used at all for capacity (statistical complexity) control, but purely for computational reasons. Indeed, increasing the allowed dimensionality generally improves generalization performance, as it allows us to better approximate the true infinite model.
4 Infinite Size, Bounded Norm Networks
In this final section, we consider a possible model for infinite sized norm-regularized networks. Our starting point is that of global weight decay, i.e. adding a regularization term that penalizes the sum of squares of all weights in the network, as might be approximately introduced by some implicit regularization. Our result in this Section is that this global regularization is equivalent to a Convex Neural Network (Convex NN; Bengio et al. (2005))—an infinite network with regularization on the top layer. Note that such models are rather different from infinite networks with regularization on the top layer, which reduce to linear methods with some specific kernel (Cho & Saul, 2009; Bach, 2014)
. Note also that our aim here is to explain what neural networks are doing instead of trying to match the performance of deep models with a known shallow model as done by, e.g.,Lu et al. (2014).
For simplicity, we will focus on single output networks (), i.e. networks which compute a function . We first consider finite two-layer networks (with a single hidden layer) and show that regularization on both layers is equivalent to and constraint on each unit in the hidden layer, and regularization on the top unit:
Let be a loss function and be training examples.
is the same as
By the inequality between the arithmetic and geometric means, we have
The right-hand side can always be attained without changing the input-output mapping by the rescaling and . The reason we can rescale the weights without changing the input-output mapping is that the rectified linear unit is piece-wise linear and the piece a hidden unit is on is invariant to rescaling of the weights. Finally, since the right-hand side of the above inequality is invariant to rescaling, we can always choose the norm to be bounded by one. ∎
First we recall the definition of a convex NN. Let be a fixed “library” of possible weight vectors and and be positive (unnormalized) measures over representing the positive and negative part of the weights of each unit. Then a “convex neural net” is given by predictions of the form
with regularizer (i.e. complexity measure) . This is simply an infinite generalization of network with hidden units and regularization on the second layer: if is finite, (6) is equivalent to
with and regularizer . Training a convex NN is then given by:
Moreover, even if is infinite and even continuous, there will always be an optimum of (8) which is a discrete measure with support at most (Rosset et al., 2007). That is, (8) can be equivalently written as:
which is the same as (5), with .
The difference between the network (5), and the infinite network (8) is in learning versus selecting the hidden units, and in that in (8) we have no limit on the number of units used. That is, in (8) we have all possible units in available to us, and we merely need to select which we want to use, without any constraint on the number of units used, only the over norm. But the equivalence of (8) and (9) establishes that as long as the number of allowed units is large enough, the two are equivalent:
In summary, learning and selecting is equivalent if we have sufficiently many hidden units and Theorem 1 gives an alternative justification for employing regularization when the input to hidden weights are fixed and normalized to have unit norm, namely, it is equivalent to regularization, which can be achieved by weight decay or implicit regularization via stochastic gradient descent.
The above equivalence holds also for networks with multiple output units, i.e. , where the regularization on is replaced with the group lasso regularizer . Indeed, for matrix factorizations (i.e. with linear activations), such a group-lasso regularized formulation is known to be equivalent to the trace norm (3) (see Argyriou et al., 2007).
- Anthony & Bartlett (1999) Anthony, Martin and Bartlett, Peter L. Neural network learning: Theoretical foundations. Cambridge University Press, 1999.
- Argyriou et al. (2007) Argyriou, Andreas, Evgeniou, Theodoros, and Pontil, Massimiliano. Multi-task feature learning. Advances in neural information processing systems, pp. 41–48, 2007.
Breaking the curse of dimensionality with convex neural networks.http://www.di.ens.fr/~fbach/fbach_cifar_2014.pdf, 2014.
- Bengio et al. (2005) Bengio, Yoshua, Roux, Nicolas L., Vincent, Pascal, Delalleau, Olivier, and Marcotte, Patrice. Convex neural networks. Advances in neural information processing systems, pp. 123–130, 2005.
- Burer & Choi (2006) Burer, Samuel and Choi, Changhui. Computational enhancements in low-rank semidefinite programming. Optimization Methods and Software, 21(3):493–512, 2006.
- Cho & Saul (2009) Cho, Youngmin and Saul, Lawrence K. Kernel methods for deep learning. Advances in neural information processing systems, pp. 342–350, 2009.
- Daniely et al. (2014) Daniely, Amit, Linial, Nati, and Shalev-Shwartz, Shai. From average case complexity to improper learning complexity. STOC, 2014.
Fazel et al. (2001)
Fazel, Maryam, Hindi, Haitham, and Boyd, Stephen P.
A rank minimization heuristic with application to minimum order system approximation.Proceedings of American Control Conference, pp. 4734–4739, 2001.
- Kearns & Valiant (1994) Kearns, Michael and Valiant, Leslie. Cryptographic limitations on learning boolean formulae and finite automata. Journal of the ACM (JACM), 41(1):67–95, 1994.
- Livni et al. (2014) Livni, Roi, Shalev-Shwartz, Shai, and Shamir, Ohad. On the computational efficiency of training neural networks. Advances in Neural Information Processing Systems, pp. 855–863, 2014.
- Lu et al. (2014) Lu, Zhiyun, May, Avner, Liu, Kuan, Garakani, Alireza Bagheri, Guo, Dong, Bellet, Aurlien, Fan, Linxi, Collins, Michael, Kingsbury, Brian, Picheny, Michael, and Sha, Fei. How to scale up kernel methods to be as good as deep neural nets. Technical Report, arXiv:1411.4000, 2014.
Rennie & Srebro (2005)
Rennie, Jasson DM and Srebro, Nathan.
Fast maximum margin matrix factorization for collaborative
Proceedings of the 22nd international conference on Machine learning, pp. 713–719. ACM, 2005.
- Rosset et al. (2007) Rosset, Saharon, Swirszcz, Grzegorz, and Srebro, Nathan. regularization in infinite dimensional feature spaces. In COLT, pp. 544–558. Springer, 2007.
- Sherstov (2006) Sherstov, Adam R Klivansand Alexander A. Cryptographic hardness for learning intersections of halfspaces. In Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, pp. 553–562. IEEE, 2006.
Introduction to the Theory of Computation. Thomson Course Technology, 2006.
- Srebro & Jaakkola (2003) Srebro, Nathan and Jaakkola, Tommi S. Weighted low-rank approximations. ICML, pp. 720–727, 2003.
- Srebro & Salakhutdinov (2010) Srebro, Nathan and Salakhutdinov, Ruslan. Collaborative filtering in a non-uniform world: Learning with the weighted trace norm. In Advances in Neural Information Processing Systems, pp. 2056–2064, 2010.
- Srebro et al. (2004) Srebro, Nathan, Rennie, Jason, and Jaakkola, Tommi S. Maximum-margin matrix factorization. Advances in neural information processing systems, pp. 1329–1336, 2004.
For the convenience of the reader, we formalize here the hardness of learning feed-forward neural network mentioned in the Introduction. The results are presented in a way that is appropriate for feed-forward networks with RELU activations, but they are really a direct implication of recent results about learning intersections of halfspaces. For historical completeness we note that hardness of learning logarithmic depth networks was already established byKearns & Valiant (1994), and that the more recent results we discuss here (Sherstov, 2006; Daniely et al., 2014) establish also hardness of learning depth two networks, subject to perhaps simpler cryptographic assumptions. The presentation and construction here is similar to that of Livni et al. (2014).
Is there a sample complexity function and an algorithm that takes as input and returns a description of a function such that the following is true:
For any , any and any distribution over , if:
There exists a feed-forward neural network with RELU actications with inputs and hidden units implementing a function such that (i.e. for , the label can be perfectly predicted by a network of size ).
The input to is drawn i.i.d. from .
(i.e. is provided with enough training data).
With probability at least, algorithm returns a function such that (that is, at least half the time the algorithm succeeds in learning a function with non-trivial error).
for some (i.e. the sample complexity required by the algorithm is polynomial in the network size—if we needed a super-polynomial number of samples, we would have no hope of learning in polynomial time).
The function that corresponds to the description returned by can be computed in time from its description (i.e. the representation used by the learner can be a feed-forward network of any size polynomial in and , or of any other representation that can be efficiently computed).
runs in time
Subject for the cryptographic assumptions in Daniely et al. (2014), there is no algorithm that satisfies the conditions in the above question.
In fact, there is no algorithm satisfying the conditions even if we require that the labels can be perfectly predicted by a network with a single hidden layer with any super-constant, e.g. , number of hidden units.
We show that every intersection of homogeneous halfspaces over with normals in can be realized with unit margin by a feed-fowrad neural networks with
hidden units in a single hidden layer. For each hyperplane, where , we include two units in the hidden layer: and . We set all incoming weights of the output node to be . Therefore, this network is realizing the following function:
Since all inputs and all weights are integer, the outputs of the first layer will be integer, will be zero or one, and realizes the intersection of the halfspaces with unit margin. Hence, the hypothesis class of neural intersection of halfspaces is a subset of hypothesis class of feed-forward neural networks with hidden units in a single hidden layer. We complete the proof by applying Theorem 5.4 in Daniely et al. (2014) which states that for any , subject for the cryptographic assumptions in Daniely et al. (2014), the hypothesis class of intersection of homogeneous halfspaces over with normals in is not efficiently PAC learnable (even improperly)333Their Theorem 5.4 talks about unrestricted halfspaces, but the construction in Section 7.2 uses only data in and halfspaces specified by with . ∎
We proved here that even for no algorithm can satisfy the condition in the question. A similar result can be shown for subject to weaker cryptographic assumptions in Sherstov (2006).
The Theorem tells us not only that we cannot expect to fit a small network to data even if the data is generated by the network (since doing so would give us an efficient learning algorithm, which contradicts the Theorem), but that we also can’t expect to learn by using a much larger network. That is, even if we know that labels can be perfectly predicted by a small network, we cannot expect to have a learning algorithm that learns a much larger (but poly sized) network that will have non-trivial error. In fact, being representable by a small network is not enough to ensure tractable learning no matter what representation the learning algorithm uses (e.g. a much larger network, a mixture of networks, a tree over networks, etc). This is a much stronger statement than just saying that fitting a network to data is -hard. Also, precluding the possibility of tractable learning if the labels are exactly explained by some small unknown network of course also precludes the possibility of achieving low error when the labels are only approximately explained by some small unknown network (i.e. of noisy or “agnostic” learning).