DeepAI
Log In Sign Up

Interplay between depth of neural networks and locality of target functions

01/28/2022
by   Takashi Mori, et al.
0

It has been recognized that heavily overparameterized deep neural networks (DNNs) exhibit surprisingly good generalization performance in various machine-learning tasks. Although benefits of depth have been investigated from different perspectives such as the approximation theory and the statistical learning theory, existing theories do not adequately explain the empirical success of overparameterized DNNs. In this work, we report a remarkable interplay between depth and locality of a target function. We introduce k-local and k-global functions, and find that depth is beneficial for learning local functions but detrimental to learning global functions. This interplay is not properly captured by the neural tangent kernel, which describes an infinitely wide neural network within the lazy learning regime.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

05/26/2020

Is deeper better? It depends on locality of relevant features

It has been recognized that a heavily overparameterized artificial neura...
12/08/2020

Analyzing Finite Neural Networks: Can We Trust Neural Tangent Kernel Theory?

Neural Tangent Kernel (NTK) theory is widely used to study the dynamics ...
06/12/2019

Learning Curves for Deep Neural Networks: A Gaussian Field Theory Perspective

A series of recent works suggest that deep neural networks (DNNs), of fi...
11/26/2018

A Differential Topological View of Challenges in Learning with Feedforward Neural Networks

Among many unsolved puzzles in theories of Deep Neural Networks (DNNs), ...
07/09/2020

Maximum-and-Concatenation Networks

While successful in many fields, deep neural networks (DNNs) still suffe...
05/09/2022

Statistical Guarantees for Approximate Stationary Points of Simple Neural Networks

Since statistical guarantees for neural networks are usually restricted ...
09/07/2020

A Neural Network Perturbation Theory Based on the Born Series

Deep Learning has become an attractive approach towards various data-bas...

1 Introduction

Deep neural networks (DNNs) have achieved unparalleled success in various tasks of artificial intelligence such as image classification 

[1, 2] and speech recognition [3]. Empirically, DNNs often outperform other machine learning methods such as kernel methods and Gaussian processes, but little is known about the underlying mechanism of outstanding performance of DNNs.

To elucidate benefits of depth, numerous studies have investigated properties of DNNs from various perspectives. The approximation theory focuses on the expressive power of DNNs [4]. Although the universal approximation theorem states that a sufficiently wide neural network with a single hidden layer can approximate any continuous functions, expressivity of a DNN grows exponentially with increasing the depth rather than the width [5, 6, 7, 8]

. In statistical learning theory, the decay rate of the generalization error in large sample asymptotics has been analyzed. For learning generic smooth functions, shallow networks or other standard methods with linear estimators such as kernel methods already give the optimal rate 

[9], and hence benefits of depth are not obvious. On the other hand, for learning smooth functions with some special properties such as the hierarchical compositional property [10] and spatial inhomogeneity of smoothness [11], or for learning a certain class of non-smooth functions [12], it has been shown that DNNs show faster decay rates of the generalization error compared with linear estimators.

Although those existing theoretical efforts have revealed interesting and nontrivial properties of DNNs, they do not adequately explain the empirical success of deep learning. Crucially, in modern machine learning applications, impressive generalization performance has been observed in an overparameterized regime, in which the number of parameters in the network greatly exceeds the number of training data samples 

[13, 14, 15]. The asymptotic decay rate of the generalization error, which has been studied in statistical learning theory, does not cover an overparameterized regime. As for the approximation theory, it is far from clear whether high expressive power of DNNs are really beneficial in practical applications [16, 17, 18, 19, 20]. A recent work [21] has demonstrated that a DNN trained by a gradient-based optimization algorithm can only learn functions that are well approximated by a shallow network, indicating that benefits of depth are not due to high expressivity of DNNs. Thus, benefits of depth for generalization ability of overparameterized DNNs still remain elusive.

In this work, we numerically investigate the effect of depth in learning simple functions, for which no evidence for benefits of depth is found in existing theories. We here focus on the locality property of target functions, and introduce -local and -global functions. A -local function is given by a product of pre-fixed

entries of the input vector, whereas a

-global function is defined as a global sum of -local functions (we will later consider more general target functions in section 3.3.3). We find that depth is beneficial for learning -local functions but rather detrimental to learning -global functions.

We also show that the effect of depth is not correctly captured by theory of the neural tangent kernel (NTK) [22]

, which describes an infinitely wide neural network optimized by stochastic gradient descent (SGD) with an infinitesimal learning rate. Since the NTK is involved with lazy learning regime 

[23], in which network parameters stay close to their initial values, the failure of the NTK in capturing the effect of depth implies the importance of feature learning, in which parameters change to learn relevant features.

1.1 Our contribution

We summarize our contribution below.

  • We find that benefits of depth in an overparameterized regime are present even for very simple functions such as -local ones (section 3.1). Although it is sometimes emphasized that DNNs can express complex functions, this result shows that benefits of depth are not solely attributed to high expressivity.

  • We find that depth is not always beneficial as is clearly demonstrated for learning -global functions (section 3.1).

  • Those results are not explained by the NTK, which describes the lazy learning regime (section 3.1). This fact implies the importance of the feature learning regime, which corresponds to large learning rates (section 3.2).

  • The opposite depth dependence of -local and -global functions is also observed for noisy labels (section 3.3.1), the classification task with the cross-entropy loss (section 3.3.2), and more general local and global functions (section 3.3.3). Thus, our results are robust.

2 Setup

We consider supervised learning of a target function

with a training dataset , where is a -dimensional input data and is its label. Each input vector is assumed to be a -dimensional random Gaussian vector: , where

denotes the Gaussian distribution of mean

and covariance , and denotes the

-dimensional identity matrix. We mainly consider noiseless data, but we will consider noisy labels in

section 3.3.1.

In the following, we summarize the setup of our experiments.

2.1 Target functions

In this work, instead of developing a general mathematical theory for a wide class of target functions , we show experimental results for concrete target functions. We focus on the locality of target functions, and introduce -local and -global functions. Let us randomly fix integers with .111For fully connected neural networks (FNNs) considered in this paper, without loss of generality, we can choose , ,…, because of the permutation symmetry of indices of input vectors. A -local function is then defined as

(1)

i.e., a product of the entries of . This function is “local” in the sense that it depends only on the entries of the input data222This property may also be called “sparsity” rather than “locality”. However, in this work, we say that such a function is local as opposed to “global” functions. (we consider the case of ). On the other hand, a -global function is defined by a global sum of -local functions as follows:

(2)

where we impose the periodic boundary condition . The scaling of is introduced to make typical values of for independent of . In contrast to -local functions, every component of equally contributes to -global functions.

2.2 Network architecture

In this work, we consider fully connected neural networks (FNNs) with hidden layers, each of which has nodes. We call and depth and width of the network, respectively. Weights and biases of the th layer are respectively denoted by and , and let us denote by the set of all the weights and biases in the network. The output of the network is determined as follows: , for , and , where is the output of the th layer and

is the component-wise ReLU activation function.

We fix the number of parameters for different depths. In comparing the performance for different , we fix the number of parameters. Since the number of parameters is roughly given by , is determined for a given as the closest integer satisfying

(3)

In this work, we focus on an overparameterized regime , in which DNNs empirically show astonishing generalization performance [13, 14, 15].

2.3 Training procedure

The network parameters

are adjusted to fit training data samples through minimization of the loss function

(4)

which is nothing but the mean-squared error. The training of the network is carried out by the SGD

(5)

with the learning rate and the mini-batch size (we fix throughout the paper), where satisfying denotes the mini-batch at th step and

(6)

denotes the mini-batch loss.

Biases are initialized at zero, and weights are initialized using the Glorot initialization [24]

. For every 50 epochs, we measure the loss function and stop the training if the measured value falls below

. We checked that our conclusion is not sensitive to the threshold value for stopping the training.

Before the training, we first perform the 10-fold cross validation to optimize the learning rate under the Bayesian optimization method [25] (we used the package provided by Nogueira [26]). We then train the network via the SGD with the optimized . The generalization performance of the trained network is measured by computing the test error

(7)

where denotes the parameters of the trained network, and is a test dataset independent of the training dataset , where and . Throughout the paper, we set .

2.4 Neural tangent kernel

Following Arora et al. [27] and Cao and Gu [28], let us consider a FNN of depth and width whose biases and weights are randomly initialized as with and with for every , where is the number of nodes in the th layer, i.e. , . The parameter controls the impact of bias terms, and we follow Jacot et al. [22] to set in our numerical experiments. Let us denote by the set of all the scaled weights and biases . The network output is written as .

When the network is sufficiently wide and the learning rate for is sufficiently small333Here we remark that the scaled learning rate for can be finite in the large-width limit [29, 30]. This means that the original learning rate for should be proportional to in order to enter the NTK regime., the scaled parameters stay close to their random initialized values during training, and hence is approximated by a linear function of :

(8)

As a result, the minimization of the loss function is equivalent to the kernel regression with the NTK defined as

(9)

where denotes the average over random initializations of . By using the ReLU activation, we can give an explicit expression of the NTK that is suited for numerical calculations. See Appendix A for the detail.

It is shown that the minimization of the loss function using the NTK yields the output function

(10)

where is the inverse matrix of the Gram matrix .

Figure 1: Depth dependence of the test error for the 1-local and 1-global target functions. Solid curves show numerical results in neural networks for various values of the depth with a fixed number of parameters . Error bars are typically smaller than symbols. Dashed lines show numerical results for the NTK.

3 Experimental results

We now present our experimental results. First, we show the depth dependence of the test error for the optimized learning rate. We will see that depth is beneficial for local functions but not for global ones. This nontrivial interplay between depth and locality is not explained by the NTK. Next, we investigate the dependence on the learning rate. We will see that although results for small learning rates agree with those for the NTK, the optimal learning rate is often found in the feature learning regime, which is not described by the NTK. This result implies the importance of the feature learning in understanding benefits of depth in DNNs.

Figure 2: Depth dependence of the test error for the -local and -global target functions with (left panel), and (right panel). Solid curves show numerical results in neural networks for various values of the depth with a fixed number of parameters . Error bars are typically smaller than symbols. Dashed lines show numerical results for the NTK.

3.1 Opposite depth dependence for -local and -global functions

We now investigate the depth dependence of the test error. It turns out that results for linear target functions () qualitatively differ from those for nonlinear ones (). We therefore first show experimental results for , and then discuss more intriguing cases of .

Numerical results for are shown in Fig. 1, where we set , , and . We find similar depth dependences for the 1-local and 1-global functions, which indicates that the locality does not matter for linear functions. We find that a shallow network () outperforms DNNs with , although the test error shows non-monotonicity with respect to . The NTK also predicts that a shallow network is better, but does not reproduce the non-monotonicity.

Results qualitatively change for non-linear target functions with . We show numerical results for -local and -global functions with (left) and 3 (right) in Fig. 2. The input dimension and the number of training samples are set as for the 2-local function, for the 2-global function, for the 3-local functions, and for the 3-global function. In all cases, we set . Since the values of and are chosen differently for different target functions, it is not meaningful to quantitatively compare test errors for different target functions. Rather, we shall focus on the depth dependence of the test error, which is not sensitive to the choice of and .

For local functions, the test error for a shallow network of is much higher than that for DNNs. We find that the test error tends to decrease as the depth increases, which means that depth is beneficial for learning -local functions. On the other hand, for global functions, a shallow network shows much better performance than DNNs, which means that depth is rather detrimental to learning global functions.

These results tell us that depth is beneficial even for very simple functions, but does not always help generalization. Thus, it depends on the locality of target functions (or relevant features within data) whether we should use DNNs.

Remarkably, the NTK is not a good approximation of a neural network at the optimal learning rate (compare solid and dashed lines in Fig. 2), except for the 2-global target function and the 3-global target function with . The interplay between depth and locality is not correctly captured by the NTK. For example, in the 2-local function, the test error calculated by the NTK increases with depth, although it decreases in neural networks. In the 3-global function, the NTK seems to be a relatively good approximation for large , but the NTK predicts that a shallow network of generalizes poorer than DNNs, which is not the case in neural networks.

Figure 3: Learning-rate dependence of the test error for the 3-local (left panel) and 3-global (right panel) functions. Solid lines show numerical results in neural networks with the width . Error bars are typically smaller than symbols. Dashed lines show test errors calculated by using the NTK.

3.2 Learning rate dependence

In section 3.1, we find that the NTK does not correctly explain the depth dependence of the test error at an optimal learning rate. The fact that the NTK describes the lazy learning regime corresponding to small learning rates [29] indicates that generalization strongly depends on the learning rate, and the optimal learning rate should be in the feature learning regime [30].

We shall investigate the learning-rate dependence of the test error. Numerical results for -local and -global functions with and 3 are shown in Fig. 3. In Fig. 3, instead of fixing the number of parameters, we fix the width of hidden layers. Each data in Fig. 3 is plotted up to the maximum learning rate beyond which the loss value smaller than is not achieved within 2500 epochs (for large learning rates training often fails due to divergence of the network parameters).

We find that the NTK (dashed lines in Fig. 3) is a good approximation in a small learning-rate regime, but not in a large learning-rate regime. Figure 3 also shows that an optimal learning rate is often found in the large learning-rate regime, which is the reason why the NTK cannot capture the interplay between locality and depth.

3.3 Robustness of the interplay between depth and locality

In this section, we show that the interplay between depth and locality observed in section 3.1 is robust. We will show that the same depth dependence is observed for (i) noisy labels, (ii) a classification task with the cross-entropy loss, and (iii) more general local and global functions.

3.3.1 Noisy labels

Figure 4: Depth dependence of the test error for the -local (left panel) and -global (right panel) functions. Error bars are typically smaller than symbols. Numerical results for neural networks with a fixed number of the parameters are presented.

We discuss the effect of noise in the label of the training dataset : , where is the Gaussian noise and characterizes the noise strength. The loss function is now given by . The generalization performance is measured by the test error for noiseless test dataset , i.e. Eq. (7) is used. The depth dependences of the test error in the case of the 2-local and the 2-global functions are presented for various values of in Fig. 4. We find that the noise does not change the conclusion that depth is beneficial for local functions but not for global functions.

3.3.2 Classification

Figure 5: Depth dependence of the misclassification rate in a test dataset for a classification task in terms of the sign of the -local and -global functions with (left panel) and (right panel). Error bars are typically smaller than symbols.

The opposite depth dependences of the generalization performance for local and global functions are also found in the classification setup. Now we consider the binary classification problem based on the parity of the -local or -global function . The label for an input is now . We employ the cross-entropy loss as a loss function. At every 50 epochs, we measure the training accuracy and stop the training if 100% accuracy is achieved (we have checked that continuing further training does not change the conclusion). The generalization performance is measured by the misclassification rate for the test dataset.

The depth dependence of the misclassification ratio is shown for -local and -global functions with (left) and (right) in Fig. 5. Here, are chosen as (500,10000) for the 2-local function, (100,10000) for the 2-global function, (100,20000) for the 3-local function, and (40,20000) for the 3-global function. We find that the generalization performance is improved by increasing depth for -local functions, whereas it is worsened for -global functions.

This conclusion is identical to that in the regression setup discussed so far. The interplay of depth and locality is not limited to such a specific setup.

3.3.3 More general local and global functions

Figure 6: Depth dependence of the test error for generalized 2-local and 2-global functions with (left panel) and (right panel). Error bars are typically smaller than symbols. Dashed lines show test errors calculated by the NTK.

So far, we have investigated the depth dependence of the generalization performance of DNNs for learning specific -local and -global functions given by Eqs. (1) and (2). We have seen that depth is beneficial for -local functions, but rather detrimental to learning -global functions. Here, we extend the notions of -local and -global functions and test whether this conclusion is still true for those extended local and global functions.

Let us introduce a certain (possibly smooth) function with a positive integer . We assume . For a fixed set of indices with , a -local function is written as

(11)

The corresponding -global function is defined as

(12)

Equations (1) and (2) correspond to a simple choice . Equations (11) and (12) are thus extensions of Eqs. (1) and (2), respectively.

The interplay between depth and locality observed in section 3.1 is expected to be true for a more general class of . It is clearly an important problem to theoretically support this statement. The fact that the NTK fails to explain this interplay implies that we should investigate generalization performance in the feature learning regime. It will require new theoretical tools, and hence we postpone it to future studies. Instead, we shall numerically test whether the same conclusion holds for other two examples of : and .

Numerical results are shown in Fig. 6 for local and global functions corresponding to these two examples of . When , we set for the local function (11) and for the global function (12). When , we set for the local function (11) and for the global function (12). We find that, in both cases, DNNs outperform a shallow network when the target function is local, whereas a shallow network outperforms DNNs when the target function is global. In this way, the interplay between depth and locality that is observed in section 3.1 is robust against the change of the function .

4 Conclusion

We have seen that depth is beneficial for local functions but not for global functions in an overparameterized regime. In previous works [5, 6, 7, 8], benefits of DNNs have been partially attributed to their high expressivity, which indicates that benefits of depth are expected to be evident for highly complex target functions. However, our -local functions given by Eq. (1) are very simple, which clearly shows that benefits of depth presented in our work are not due to high expressive powers of DNNs.

It would also be an interesting observation that depth is rather detrimental to learning global target functions. While there are many studies on benefits of depth, it is also important to figure out when depth is disadvantageous.

As is demonstrated in section 3.3, the above conclusion is robust against some changes of setting. It indicates that some underlying fundamental mechanism exists. In particular, results in section 3.3.3 show that the interplay of depth and locality is not a special property of specific functions of Eqs. (1) and (2). Rather, this interplay will be a general property in a certain class of local and global functions written in the form of Eqs. (11) and (12), respectively. It is an open problem to theoretically understand such a fundamental mechanism.

Since this interplay is not observed in the lazy learning regime, in which the NTK is an adequate theoretical tool, we should theoretically investigate the feature learning regime to understand the mechanism behind it. A new theoretical tool will be required, and so we leave it as an important future problem.

Here, we have to be content with just presenting an intuitive argument towards this direction. Since information on an input vector is lost by propagating through the network layer by layer [6, 31], it is expected that DNNs are suited for local target functions, in which most elements of an input vector are irrelevant (we should be willing to throw away information on the data). By utilizing the chaoticity of information processing in DNNs [6, 31], we can successively amplify a local change of an input vector through hidden layers while throwing away irrelevant information. In contrast, global target functions depend on all elements of an input vector, and hence information on the input should be kept at the output layer. In that case, depth can rather be detrimental to generalization.

The above argument is still primitive. It is a challenging theoretical problem to establish a precise mathematical theory.

References

Appendix A Explicit expression of the NTK

We consider a network whose biases and weights are randomly initialized as with and with for every , where

is the number of neurons in the

th layer, i.e., , . In the infinite-width limit , the pre-activation at every hidden layer tends to an i.i.d. Gaussian process with covariance which is defined recursively as

(13)

for . We also define

(14)

where is the derivative of . The NTK is then expressed as

(15)

The derivation of this formula is given by Arora et al. [27].

Using the ReLU activation function , we can further calculate and  [29], obtaining

(16)

and

(17)

For , we obtain . By solving Eqs. (16) and (17) iteratively, we obtain the NTK in Eq. (15).444When (no bias), the equations are further simplified as and , where is the angle between and , and is determined recursively by .