Is deeper better? It depends on locality of relevant features

05/26/2020 ∙ by Takashi Mori, et al. ∙ 0

It has been recognized that a heavily overparameterized artificial neural network exhibits surprisingly good generalization performance in various machine-learning tasks. Recent theoretical studies have made attempts to unveil the mystery of the overparameterization. In most of those previous works, the overparameterization is achieved by increasing the width of the network, while the effect of increasing the depth has been less well understood. In this work, we investigate the effect of increasing the depth within an overparameterized regime. To gain an insight into the advantage of depth, we introduce local and global labels as abstract but simple classification rules. It turns out that the locality of the relevant feature for a given classification rule plays an important role; our experimental results suggest that deeper is better for local labels, whereas shallower is better for global labels. We also compare the results of finite networks with those of the neural tangent kernel (NTK), which is equivalent to an infinitely wide network with a proper initialization and an infinitesimal learning rate. It is shown that the NTK does not correctly capture the depth dependence of the generalization performance, which indicates the importance of the feature learning, rather than the lazy learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has achieved an unparalleled success in various tasks of artificial intelligence such as image classification [1, 2] and speech recognition [3]. Remarkably, in modern machine learning applications, impressive generalization performance has been observed in an overparameterized regime, in which the number of parameters in the network is much larger than that of training data samples. Contrary to what we learn in the classical learning theory, an overparameterized network fits random labels and yet generalizes very well without serious overfitting [4]. We do not have general theory that explains why deep learning works so well.

Recently, the learning dynamics and the generalization power of heavily overparameterized wide neural networks have extensively been studied. It has been reported that training of an overparameterized network easily achieves zero training error without getting stuck in local minima of the loss landscape [4, 5]. Mathematically rigorous results have also been obtained [6, 7]. From a different point of view, theory of the neural tangent kernel (NTK) has been developed as a new tool to investigate an overparameterized network with an infinite width [8, 9], which simply explains the reason why a sufficiently wide neural network can achieve a global minimum of the training loss.

As for generalization, “double-descent” phenomenon has attracted much attention [10, 11]

. The standard bias-variance tradeoff picture predicts a U-shaped curve of the test error 

[12]

, but instead we find the double-descent curve, which tells us that the increased model capacity beyond the interpolation threshold results in improved performance. This finding triggered detailed studies on the behavior of the bias and variance in an overparameterized regime 

[13, 14]. The double-descent phenomenon is not explained by traditional complexity measures such as the Vapnik-Chervonenkis dimension and the Rademacher complexity [15], and hence one seeks for new complexity measures of deep neural networks that can prove better generalization bounds [16, 17, 18, 19, 20, 21].

These theoretical efforts mainly focus on the effect of increasing the network width, while benefits of the network depth remain unclear. It is known that expressivity of a deep neural network grows exponentially with the depth rather than the width [22]. See also [23, 24]. However, it is far from clear whether exponential expressivity really leads to better generalization [25, 26]. It is also nontrivial whether typical problems encountered in practice require such high expressivity. Although some works [27, 28] have shown that there exist simple and natural functions that are efficiently approximated by a network with two hidden layers but not by a network with one hidden layer, a recent work [29] has demonstrated that a deep network can only learn functions that are well approximated by a shallow network by using a gradient-based optimization algorithm, which indicates that benefits of depth are not due to high expressivity of deep networks. Some other recent works have reported no clear advantage of the depth in an overparameterized regime [30, 31].

To gain an insight into the advantage of the depth, in the present paper, we report our experimental study on the depth and width dependences of generalization in abstract but simple, well-controlled classification tasks with fully connected neural networks. We find that whether a deep network outperforms a shallow one depends on the property of relevant features for a given classification rule.

In this work, we introduce local labels and global labels

, both of which give simple mappings between inputs and output class labels. By “local”, we mean that the label is determined only by a few components of the input vector. On the other hand, a global label is given by a sum of local terms and determined by all components of the input. Our experiments show strong depth-dependences of the generalization error for those simple input-output mappings. In particular, we find that

deeper is better for local labels, while shallower is better for global labels. The implication of this result is that the depth is not always advantageous, but the locality of relevant features would give us a clue for understanding the advantage the depth brings about.

We also compare the generalization performance of a trained network of a finite width with that of the kernel method with the NTK. The latter corresponds to the infinite-width limit of a fully connected network with a proper initialization and an infinitesimal learning rate [8], which is referred to as the NTK limit. It is found that even if the width increases, in many cases the generalization error with an optimal learning rate does not converge to the NTK limit. In such a case, a finite-width network shows much better generalization compared with the kernel learning with the NTK. In the NTK limit, the network parameters stay close to their initial values during training, which is called the lazy learning [32], and hence the result mentioned above indicates the importance of the feature learning, in which network parameters change to learn relevant features.

2 Setting

We consider a classification task with a training dataset , where is an input data and is its label. In this work, we consider the binary classification, , unless otherwise stated.

2.1 Dataset

Each input is a

-dimensional vector taken from i.i.d. Gaussian random variables of zero mean and unit variance, where

is the transpose of a vector . For each input , we assign a label according to one of the following rules.

-local label

We randomly fix integers with . In the “-local” label, the relevant feature is given by the product of the components of an input , that is, the label is determined by

(1)

This label is said to be local in the sense that is completely determined by just the components of an input .111The locality here does not necessarily imply that points are spatially close to each other. Such a use of the terminology “-local” has been found in the field of quantum computation [33]. For fully connected networks considered in this paper, without loss of generality, we can choose , ,… because of the permutation symmetry with respect to indices of input vectors.

-global label

We again fix . Let us define

(2)

where the convention is used. The -global label for is defined by

(3)

The relevant feature for this label is given by a uniform sum of the product of components of the input vector. Every component of contributes to this “-global” label, in contrast to the -local label with .

2.2 Network architecture

In the present work, we consider fully connected feedforward neural networks with hidden layers of width . We call and the depth and the width of the network, respectively. The output of the network for an input vector is determined as follows:

(4)

where

is the component-wise ReLu activation function,

is the output of the th layer, and

(5)

are the weights and the biases, respectively. Let us denote by the set of all the weights and biases in the network. We focus on an overparameterized regime, i.e., the number of network parameters (the number of components of ) exceeds , the number of training data points.

2.3 Supervised learning

The network parameters

are adjusted to correctly classify the training data. It is done by minimizing the softmax cross-entropy loss

given by

(6)

where the th component of is denoted by

. The main results of our paper do not change for other standard loss functions such as the mean-squared error.

The training of the network is done by the stochastic gradient descent (SGD) with learning rate

and the mini-batch size . That is, for each mini-batch with , the network parameter at time is updated as

(7)

Throughout the paper, we fix . Meanwhile, we optimize before training (explain the detail later). Biases are initialized to be zero, and weights are initialized using the Glorot initialization [34].222We also tried the He initialization [35] and confirmed that results are similar to the ones obtained by the Glorot initialization, in particular when input vectors are normalized as .

The trained network classifies an input to the class given by . Let us then define the training error as

(8)

that is the miss-classification rate for the training data . We train our network until is achieved, i.e., all the training data samples are correctly classified, which is possible in an overparameterized regime.

For a training dataset , we first perform the 10-fold cross validation to optimize the learning rate under the Bayesian optimization method [36], and then perform the training via the SGD by using the full training dataset. In the optimization of , we try to minimize the miss-classification ratio for the validation data.

The generalization performance of a trained network is measured by computing the test error. We prepare the test data independently from the training data . The test error is defined as the miss-classification ratio for , i,.e.,

(9)

where is the prediction of our trained network. In our experiment discussed in Sec. 3, we fix .

2.4 Neural Tangent Kernel

Suppose a network of depth and width with the output . When the network is sufficiently wide and the learning rate is sufficiently small, the network parameters stay close to their randomly initialized values during training, and hence is approximated by a linear function of : . As a result, the minimization of the mean-squared error , where is the one-hot representation of the label , is equivalent to the kernel regression with the NTK () that is defined as

(10)

where denotes the average over random initializations of  [8].

Let us consider a network whose biases and weights are randomly initialized as with and with for every respectively, where

is the number of neurons in the

th layer, i.e., , . The parameter controls the impact of bias terms, and we set in our numerical experiment following Jacot et al. [8]. By using the ReLu activation function, we can give an explicit expression of the NTK that is suited for numerical calculations. Such formulas are given in Supplimentary Material.

It is shown that the NTK takes the form , and the minimization of the mean-squared error with an infinitesimal weight decay yields the output function

(11)

where is the inverse matrix of the Gram matrix . An input data is classified to .

3 Experimental results

We now present our experimental results. For each data point, the training dataset is fixed and we optimize the learning rate via the 10-fold cross validation with the Bayesian optimization method (we used the package provided in [37]). We used the optimized

to train our network. At every 50 epochs we compute the training error

, and we stop the training if . For the fixed dataset and the optimized learning rate

, the training is performed 10 times and calculate the average and the standard deviation of test errors

.

3.1 1-local and 1-global labels

(a) 1-local (b) 1-global
Figure 1: Test error against the number of training data samples for several network architectures specified by the depth and width for (a) the 1-local label and (b) the 1-global label. Test errors calculated by the NTK of depth 1 and 7 are also plotted. Error bars are smaller than the symbols.

In the 1-local and 1-global labels, the relevant feature is a linear function of the input vector. Therefore, in principle, even a linear network can correctly classify the data. Figure 1

shows the generalization errors in nonlinear networks of the varying depth and width as well as those in the linear perceptron (the network of zero depth). The input dimension is set to be

. We also plotted test errors calculated by the NTK, but we postpone the discussion about the NTK until Sec. 3.3.

Figure 1 shows that in both 1-local and 1-global labels, the test error decreases with the network width, and a shallower network () shows better generalization compared with a deeper one (). The linear perceptron shows the best generalization performance, which is natural because it is the simplest network that is capable of learning the relevant feature associated with the 1-local or 1-global label. Remarkably, test errors of nonlinear networks ( and ) are not too large compared with those of the linear perceptron, although nonlinear networks are much more complex than the linear perceptron.

For a given network architecture, we do not see any important difference between the results for 1-local and 1-global labels, which would be explained by the fact that these labels are transformed to each other via the Fourier transformation of input vectors.

3.2 Opposite depth dependences for -local and -global labels with

For , it turns out that experimental results show opposite depth dependences for -local and -global labels. Let us first consider -local labels with . Figure 2 (a) and (b) show test errors for varying in various networks for the 2-local and the 3-local labels, respectively. The input dimension is set to be in the 2-local label and in the 3-local label. We see that the test error strongly depends on the network depth. A deeper network () generalizes better than a shallower one (). It should be noted that for , the network of and contains about trainable parameters, the number of which is much larger than that of trainable parameters () in the network of and . In spite of this fact, the latter outperforms the former in the 2-local label as well as in the 3-local label with large , which implies that increasing the number of trainable parameters do not necessarily implies better generalization. In -local labels with , the network depth is more strongly correlated to generalization compared with the network width.

From Fig. 2 (b), it is obvious that the network of and fails to learn the 3-local label for small . We also see that error bars of the test error are large in the network of and . The error bar represents the variance due to initialization and training. By increasing the network width , both variances and test errors decrease. This result is consistent with the recent observation in the lazy regime that increasing the network width results in better generalization because it reduces the variance due to initialization [14].

(a) 2-local (b) 3-local
(c) 2-global (d) 3-global
Figure 2: Test error against the number of training data samples for several network architectures specified by the depth and the width for (a) the 2-local label, (b) the 3-local label, (c) 2-global label, and (d) 3-global label. Error bars indicate the standard deviation of the test error for 10 iterations of the network initialization and the training. Test errors calculated by the NTK of the depth of 1 and 7 are also plotted.
Figure 3: Depth dependence of the test error for training samples with 2-local and 2-global labels. The dimension of input vectors is set to be in the 2-local label and in the 2-global label. The network width is fixed to be 500. An error bar indicates the standard deviation over 10 iterations of the training using the same dataset.

Next, we consider -global labels with and 3. The input dimension is set as for the 2-global label and for the 3-global label. We plot test errors against in Fig. 2 for (c) the 2-global label and (d) the 3-global label. Again we find strong depth dependences, but now shallow networks () outperform deep ones (), which is contrary to the results for -local labels. For , we also find strong width dependences; the test error of a wider network more quickly decreases with . In particular, in the 3-global label, an improvement of the generalization with is subtle for and . By increasing the width, the decrease of the test error with becomes much faster [see the result for and in Fig. 2 (d)].

To see more details of the effect of depth, we also plot the depth dependence of the test error for fixed training data samples. We prepare training data samples for the 2-local and 2-global labels, respectively. The input dimension is for the 2-local label and for the 2-global label. By using the prepared training data samples, networks of the depth and the width are trained up to . The test errors of trained networks are shown in Fig. 3. In the 2-local label, the test error decreases with , whereas the test error increases with in the 2-global label. Thus, Fig. 3 clearly shows the opposite depth dependences for local and global labels.

3.3 Comparison between finite networks and NTKs

In Figs. 1 and 2, test errors calculated by using the NTK are also plotted. In the case of (Fig. 1), the generalization performance of the NTK is comparable with that of finite networks. For the 2-global label [Fig. 2 (c)], the test error obtained by the NTK is comparable or lower than that of finite networks.

The crucial difference is seen in the case of -local label with and 3 and the 3-global label. In Fig. 2 (a) and (b), we see that the NTK almost completely fails to classify the data, although finite networks succeed in doing so. In the case of the 3-global label, the NTK of depth correctly classifies the data, while the NTK of depth fails [see Fig. 2 (d)]. In those cases, the test error calculated by a finite network does not seem to converge to that obtained by the NTK as the network width increases.

The NTK has been proposed as a theoretical tool to investigate the infinite-width limit, but it should be kept in mind that the learning rate has to be sufficiently small to achieve the NTK limit [8, 9]. The discrepancy between a wide network and the NTK in Fig. 2 stems from the strong learning-rate dependence of the generalization performance. In our experiment, the learning rate has been optimized by performing the 10-fold cross validation. If the optimized learning rate is not small enough for each width, the trained network may not be described by the NTK even in the infinite-width limit.

In Fig. 4 (a), we plot the learning-rate dependence of the test error for the 2-local label and the 2-global label in the network of the depth and the width . We observe a sharp learning-rate dependence in the case of the 2-local label in contrast to the case of the 2-global label. In Fig. 4 (b), we compare the learning-rate dependences of the test error for and in the case of the 3-global label (in both cases ). We see that the learning-rate dependence for is much stronger than that for , which is consistent with the fact that the NTK fails only for . It should be noted that Fig. 4 (b) shows that the deep network () outperforms the shallow one () in the regime of small learning rates, while the shallow one performs better than the deep one at their optimal learning rates.

Figure 4 also shows that the test error for a sufficiently small learning rate approaches the one obtained by the corresponding NTK. Therefore, the regime of small learning rates is identified as a lazy learning regime, while larger learning rates correspond to a feature learning regime. Sharp learning-rate dependences found here provide theoretical and practical importance of the feature learning.

(a) (b)
Figure 4: Learning-rate dependence of the test error. (a) Numerical results for the 2-local and 2-global labels in the network with the depth of 1 and the width of 2000. (b) Numerical results for the 3-global label in the networks with the depth of 1 and 7 (the network width is set at 2000 for both cases). The dotted lines show the test error calculated by the NTK. When the learning rate is sufficiently small, the test error in a finite network approaches that of the corresponding NTK. Each data is plotted up to the maximum learning rate beyond which the zero training error is not achieved within 2500 epochs (in some cases training fails due to divergence of network parameters during the training). Error bars indicate the standard deviation over 10 iterations of the training.

4 Conclusion

In this work, we have studied the effect of increasing the depth in classification tasks. Instead of using real data, we have employed abstract setting with random inputs and simple classification rules because such a simple setup helps us understand under what situations deeper networks perform better or worse. We find that the locality of the relevant feature for a given classification rule plays a key role.

We note that the advantage of the depth in local labels is not due to high expressivity of deep networks. If a network can accurately classify the data with the -local label and the input dimension , it can in principle classify the data with an arbitrarily large input dimension . This is because the -local label depends only on the components among components. By using this fact, it is confirmed that a small network with one hidden layer of the width of about 10-100 can express the 2-local label and the 3-local label almost perfectly.333This fact does not mean at all that such a small network can actually learn the local label for large by a gradient-based algorithm. In other words, learning the -local label for small does not require high expressive power. Nevertheless, a deeper network outperforms a shallower one.

It is also an interesting observation that shallower networks do better than deeper ones for the -global label. This result shows that the depth is not always beneficial. In future studies, we hope to investigate which properties of the data other than locality studied here result in (dis)advantage of the depth.

Broader Impact

It is an important practical problem to find neural architecture designing principles depending on specific machine-learning tasks. Although our paper is motivated by a theoretical question, i.e., why deep networks perform so well compared with shallow networks, this work will impact the practical problem mentioned above. Our work indicates that a deeper architecture is better for local features, whereas a shallower architecture is better for global features. It is expected that local features are important in typical image classification tasks, and our work suggests that a deep architecture should be used for such a task, which is consistent with our experience. Furthermore, if we could find a transformation of local features to global ones, then even shallow networks should be able to classify the data with a great accuracy after the transformation, which is of practical merit. It would be also important to identify problems where labels are effectively global (e.g., where Fourier analysis works well, finance, weather forecasting). The present research can lead to better solutions in these fields. In short, our theoretical work suggests a guiding principle for future studies on architecture designing principles.

References

Appendix A Explicit expression of the NTK

We consider a network whose biases and weights are randomly initialized as with and with for every , where is the number of neurons in the th layer, i.e., , . In the infinite-width limit , the pre-activation at every hidden layer tends to an i.i.d. Gaussian process with covariance that is defined recursively as

(12)

for . We also define

(13)

where is the derivative of . The NTK is then expressed as , where

(14)

The derivation of this formula is given by Arora et al. [9].

Using the ReLu activation function , we can further calculate and , obtaining

(15)

and

(16)

For , we obtain . By solving eqs. (15) and (16) iteratively, the NTK in eq. (14) is obtained.444When (no bias), the equations are further simplified; and , where is the angle between and , and is iteratively determined by