and computer vision(Gao and Ji, 2019). One major reason behind the great success is that the classification function of a deep neural network can be highly nonlinear and express a highly complicated function (Bengio and Delalleau, 2011). Consequently, a fundamental question lies in how nonlinear and how complex the function of a deep neural network is. Model complexity measures (Montufar et al., 2014; Raghu et al., 2017) address this question. The recent progress in model complexity measure directly facilitates the advances of many directions of deep neural networks, such as model architecture design, model selection, performance improvement (Hayou et al., 2018), and overfitting detection (Hawkins, 2004). The challenges in measuring model complexity are tackled from different angles. For example, the influences of model structure on complexity have been investigated, including layer width, network depth, and layer type. The power of width is discussed and a single hidden layer network with a finite number of neurons is proved to be an universal approximator (Hornik et al., 1989; Barron, 1993). With the exploration of deep network structures, some recent studies pay attention to the effectiveness of deep architectures in increasing model complexity, known as depth efficiency (Lu et al., 2017; Bengio and Delalleau, 2011; Cohen et al., 2016; Eldan and Shamir, 2016). The bounds of model complexity of some specific model structures are proposed, from sum-product networks (Delalleau and Bengio, 2011) to piecewise linear neural networks (Pascanu et al., 2013; Montufar et al., 2014). Model parameters (e.g., weight, bias of layers) also play important roles in model complexity. For example, may be considered more complex than according to their function forms. However if the parameters of the two functions are , , , and , and are then two coincident lines. This example demonstrates the importance of model parameters on complexity. Raghu et al. (2017) propose a complexity measure for neural networks with piecewise linear activation functions by measuring the number of linear regions through a trajectory path between two instances. Their proposed complexity measure reflects the effect of model parameters to some degree. However, the approach of (Raghu et al., 2017) cannot be directly generalized to neural networks with curve activation functions, such as Sigmoid (Kilian and Siegelmann, 1993), Tanh (Kalman and Kwasny, 1992)
. At the same time, in some specific applications, curve activation functions are found superior than piecewise linear activation functions. For example, many financial models use Tanh rather than ReLU(Ding et al., 2015). A series of state-of-the-art studies speed up and simplify the training of neural networks with curve activation functions (Ioffe and Szegedy, 2015). This motivates our study on model complexity of deep neural networks with curve activation functions.
In this paper, we develop a complexity measure for deep fully-connected neural networks with curve activation functions. Previous studies on deep models with piecewise linear activation functions use the number of linear regions to model the nonlinearity and measure model complexity (Pascanu et al., 2013; Montufar et al., 2014; Raghu et al., 2017; Novak et al., 2018). To generalize this idea, we develop a piecewise linear approximation to approach target deep models with curve activation functions. Then, we measure the number of linear regions of the approximation as an indicator of the target model complexity. The piecewise linear approximation is designed under two desiderata. First, to guarantee the approximation degree , we require a direct approximation of the function of the target model rather than simply mimicking the behavior or performance, such as the mimic learning approach (Hinton et al., 2015). The rationale is that two functions having the same behavior on a set of data points may still be very different, as illustrated in Figure 1. Therefore, approximation using the mimic learning approach (Hinton et al., 2015) is not enough. Second, to compare the complexity values of different models, the complexity measure has to be principled. The principle we follow is to minimize the number of linear regions given an approximation degree threshold. Under these two desiderata, the minimum number of linear regions constrained by a certain approximation degree can be used to reflect the model complexity. Technically we propose the linear approximation neural network (LANN for short), a piecewise linear framework to approximate a target deep model with curve activation functions. A LANN shares the same layer width, depth and parameters with the target model, except that it replaces every activation function with a piecewise linear approximation. An individual piecewise linear function is designed as the activation function on every neuron to satisfy the above two desiderata. We analyze the approximation degree of LANNs with respect to the target model, then devise an algorithm to build LANNs to minimize the number of linear regions. We provide an upper bound on the number of linear regions formed by LANNs, and define the complexity measure using the upper bound. To demonstrate the usefulness of the complexity measure, we explore its utility in analyzing the training process of deep models, especially the problem of overfitting (Hawkins, 2004). Overfitting occurs when a model is more complicated than the ultimately optimal one, and thus the learned function fits too closely to the training data and fails to generalize, as illustrated in Figure 1. Our results show that the occurrence of overfitting is positively correlated to the increase of model complexity. Besides, we observe that regularization methods for preventing overfitting, such as and regularizations (Goodfellow et al., 2016), constrain the increase of model complexity. Based on this finding, we propose two simple yet effective approaches for preventing overfitting by directly constraining the growth of model complexity. The rest of the paper is organized as follows. Section 2 reviews related work. In Section 3 we provide the problem formulation. In Section 4 we introduce the linear approximation neural network framework. In Section 5 we develop the complexity measure. In Section 6 we explore the training process and overfitting in the view of complexity measure. Section 7 concludes the paper.
2. Related Work
The studies of model complexity dates back to several decades. In this section, we review related works of model complexity of neural networks from two aspects: model structures and parameters.
2.1. Model Structures
Model structures may have strong influence on model complexity, such as width, layer depth, and layer type. The power of layer width of shallow neural networks is investigated (Hornik et al., 1989; Barron, 1993; Cybenko, 1989; Maass et al., 1994) decades ago. Hornik et al. (1989) propose the universal approximation theorem, which states that a single layer feedforward network with a finite number of neurons can approximate any continuous function under some mild assumptions. Some later studies (Barron, 1993; Cybenko, 1989; Maass et al., 1994) further strengthen this theorem. However, although with the universal approximation theorem, the layer width can be exponentially large. Lu et al. (2017) extend the universal approximation theorem to deep networks with bounded layer width. Recently, deep models are empirically discovered to be more effective than a shallow one. A series of studies focus on exploring the advantages of deep architecture in a theoretical view, which is called depth efficiency (Bengio and Delalleau, 2011; Cohen et al., 2016; Poole et al., 2016; Eldan and Shamir, 2016). Those studies show that the complexity of a deep network can only be matched by a shallow one with exponentially more nodes. In other words, the function of deep architecture achieves exponential complexity in depth while incurs polynomial complexity in layer width. Some studies bound the model complexity with respect to certain structures or activation functions (Delalleau and Bengio, 2011; Du and Lee, 2018; Montufar et al., 2014; Bianchini and Scarselli, 2014; Poole et al., 2016). Delalleau and Bengio (2011) study sum-product networks and use the number of monomials to reflect model complexity. Pascanu et al. (2013) and Montufar et al. (2014) investigate fully connected neural networks with piecewise linear activation functions (e.g. ReLU and Maxout), and use the number of linear regions as a representation of complexity. However, the studies on model complexity only from structures are not able to distinguish differences between two models with similar structures, which are needed for problems such as understanding model training.
Besides structures, the value of model parameters, including layer weight and bias, also play a central role in model complexity measures. Complexity of models is sensitive to the values of parameters. Raghu et al. (2017) propose a complexity measure for DNNs with piecewise linear activation functions. They follow the previous studies on DNNs with piecewise linear activation functions and use the number of linear regions as a reflection of model complexity (Pascanu et al., 2013; Montufar et al., 2014). To measure how many linear regions a data manifold is split, Raghu et al. (2017)
build a trajectory path from one input instance to another, then estimate model complexity by the number of linear region transitions through the trajectory path. Their trajectory length measure not only reflects the influences of model structures on model complexity, but also is sensitive to model parameters. They further study Batch Norm(Ioffe and Szegedy, 2015) using the complexity measure. Later, Novak et al. (2018) generalize the trajectory measure to investigate the relationship between complexity and generalization of DNNs with piecewise linear activation functions. However, the complexity measure using trajectory (Raghu et al., 2017) cannot be directly generalized to curve activation functions. In this paper, we propose a complexity measure to DNNs with curve activation functions by building its piecewise linear approximation. Our proposed measure can reflect the influences of both model structures and parameters.
3. Problem Formulation
A deep (fully connected) neural network (DNN for short) consists of a series of fully connected layers. Each layer includes an affine transformation and a nonlinear activation function. In classification tasks, let represent a DNN model, where is the number of features of inputs, and the number of class labels. For an input instance , can be written in the form of
, respectively, are the weight matrix and the bias vector of the output layer,is the output vector corresponding to the class labels, is the number of hidden layers, and is -th hidden layer in the form of
where and are the weight matrix and the bias vector of the -th hidden layer, respectively. is the activation function. In this paper, if is a vector, we use to represent the vector obtained by separately applying to each element of . The commonly used activation functions can be divided into two groups according to algebraic properties. First, a piecewise linear activation function is composed of a finite number of pieces of affine functions. Some commonly used piecewise linear activation functions include ReLU (Nair and Hinton, 2010) and hard Tanh (Nwankpa et al., 2018). With a piecewise linear , the DNN model is a continuous piecewise linear function. Second, a curve activation function is a continuous nonlinear function whose geometric shape is a smooth curved line. Commonly used curve activation functions include Sigmoid (Kilian and Siegelmann, 1993) and Tanh (Kalman and Kwasny, 1992). With a curvilinear , the DNN model is a curve function. In this paper, we are interested in fully connected neural networks with curve activation functions. We focus on two typical curve activation functions, Sigmoid (Kilian and Siegelmann, 1993), Tanh (Kalman and Kwasny, 1992). Our methodology can be easily extended to other curve activation functions.
Given a target model, which is a trained fully connected neural network with curve activation functions, we want to measure the model complexity. Here, the complexity reflects how nonlinear, or how curved the function of the network achieves. Our complexity measure should take both the model structure and the parameters into consideration. To measure the model complexity, our main idea is to obtain a piecewise linear approximation of the target model, then use the number of linear segments of approximation to reflect the target model complexity. This idea is inspired by the previous studies on DNNs with piecewise linear activation functions (Montufar et al., 2014; Novak et al., 2018; Raghu et al., 2017). To make our idea of measuring by approximation feasible, the approximation should satisfy two requirements. First, the quality/degree of approximation should be guaranteed. To make the idea of measuring complexity by the nonlinearity of approximation feasible, a prerequisite is that the approximation should be highly close to the function of the target model. In this case, the mimic learning approach (Hinton et al., 2015), which approximates by learning a student model under the guidance of the target model outputs, is not suitable, since it learns the behavior of the target model on a specific dataset and cannot guarantee the generalizability, as illustrated in Figure 1. To ensure the closeness of the approximation functions to the target models, we propose linear approximation neural network (LANN). A LANN is an approximation model that builds piecewise linear approximations to activation functions in the target model. To make the approximation degree controllable and flexible, we design an individual approximation function for the activation function on every neuron separately according to their status distributions (Section 4.1). Furthermore, we define a measure of approximation degree in terms of approximation error and analyze through error propagation (Section 4.2). Second, the approximation should be constructed in a principled manner. To understand the rationale of this requirement, consider an example in Figure 2, where the target model is a curved line (the solid curve). One approximation (the red line in Figure 2(a)) is built using as few linear segments as possible. Another approximation (the red line in Figure 2(b)) evenly divides the input domain into small pieces and then approximates each piece using linear segments. Both of them can approximate the target model to a required approximation degree and can reflect the complexity of the target model. However, we should not use on some occasions and use on some other occasions to measure the complexity of the target model, since they are built following different protocols. To make the complexity measure comparable, the approximation should be constructed under a consistent protocol. We suggest constructing approximations under the protocol of using as few linear segments as possible (Section 4.3), an thus the minimum number of linear segments required to satisfy the approximation degree can reflect the model complexity.
4. LANN Architecture
To develop our complexity measure, we propose LANN, a piecewise linear approximation to the target model. In this section, we first introduce the architecture of LANN. Then, we discuss the degree of approximation. Last, we propose the algorithm of building a LANN.
4.1. Linear Approximation Neural Network
The function of a deep model with piecewise linear activation functions is piecewise linear, and has a finite number of linear regions (Montufar et al., 2014). The number of linear regions of such a model is commonly used to assess the nonlinearity of the model, i.e., the complexity (Montufar et al., 2014; Raghu et al., 2017). Motivated by this, we develop a piecewise linear approximation of the target model with curve activation functions, then use the number of linear regions of the approximation model as a reflection of the complexity of the target model. The approximation model we propose is called the linear approximation neural network (LANN).
Definition 0 (Linear Approximation Neural Network).
Given a fully connected neural network , a linear approximation neural network is an approximation of in which each activation function in is replaced by a piecewise linear approximation function .
A LANN shares the same layer depth, width as well as weight matrix and bias vector as the target model, except that it approximates every activation function using an individual piecewise linear function. This brings two advantages. First, designing an individual approximation function for each neuron makes the approximation degree of a LANN to the target model flexible and controllable. Second, the number of subfunctions of neurons is able to reflect the nonlinearity of the network. These two advantages will be further discussed in Section 4.2 and Section 5, respectively. A piecewise linear function consisting of subfunctions (linear regions) can be written in the following form.
where are the parameters of the -th subfunction. Given a variable , the -th subfunction is activated if , denote by . Let and be the parameters of the activated subfunction. We have . Let be the activation function of the neuron , which represents the -th neuron in -th layer. Then, is the approximation of . Let be the set of approximation functions for -th hidden layer, is the width of -th hidden layer. The -th layer of a LANN can be written as
Then, a LANN is in the form of
Since the composition of piecewise linear functions is piecewise linear, a LANN is a piecewise linear neural network. A linear region of the piecewise linear neural network can be represented by the activation pattern (this term follows the convention in (Raghu et al., 2017)):
Definition 0 (Activation pattern).
An activation pattern of a piecewise linear neural network is the set of activation statuses of all neurons, denoted by where is the activation status of neuron .
Given an arbitrary input , the corresponding activation pattern is determined. With the fixed , the transformation of of any layer
is reduced to a linear transformation that can be written in the following square matrix.
where and are the parameters of the activated subfunction of neuron , and are determined by . The piecewise linear neural network is reduced to a linear function with
An activation pattern corresponds to a linear region of the piecewise linear neural network. Given two different activation patterns, the square matrix of at least one layer are different, so are the corresponding linear functions. Thus, a linear region of the piecewise linear neural network can be expressed by an unique activation pattern. That is, the activation pattern represents the linear region including .
4.2. Degree of Approximation
We measure the complexity of models with respect to approximation degree. We first define a measure of approximation degree using approximation error. Then, we analyze approximation error of LANN in terms of neuronal approximation functions.
Definition 0 (Approximation error).
Let be an approximation function of . Given input , the approximation error of at is . Given a deep neural network and a linear approximation neural network learned from . We define the approximation error of to as the expectation of the absolute distance between their outputs:
A LANN is learned by conducting piecewise linear approximation to every activation function. The approximation of every activation may produce an approximation error. The approximation error of a LANN is the accumulation of all neurons’ approximation errors. In literature (Raghu et al., 2017; Eldan and Shamir, 2016), approximation error of activation is treated as a small perturbation added to a neuron, and is observed to grow exponentially through forward propagation. Based on this, we go a step further to estimate the contribution of perturbation of every neuron to the model output by analyzing error propagation. Consider a target model and its LANN approximation . According to Definition 3, the approximation error of of corresponding to neuron can be rewritten as . Suppose the same input instance is fed into and simultaneously. After the forward computation of the first hidden layers, let be the output difference of the -th hidden layer between and , and for the -th layer. Let denote the input to the -th layer, also the output of the -th layer of . We can compute by
The absolute value of is
To keep the discussion simple, we write . The first term of the righthand side of Eq. (10) is
where is a vector consisting of every neuron’s approximation error of the -th layer. Applying the first-order Taylor expansion to the second term of Eq. (10), we have:
Assuming and being independent, the expectation of is
where the error , where denotes the error in , in other words, the disturbances of on the distribution of . Since is a vector where the elements correspond to the neurons in the -layer layer, the expectation of is
is probability density function (PDF) of neuron. We notice that consists of a linear transformation followed by activation . Therefore, the Jacobian matrix can be computed by . The -th row of is
where the subscript means the -th row of the matrix. The above process describes the propagation of approximation error through the -th hidden layer. Applying the propagation calculation recursively from the first hidden layer to the output layer, we have the following result.
Theorem 4 (Approximation error propagation).
Given a deep neural network and a linear approximation neural network learned from . The approximation error
where, for ,
here sums up the -th columns, is the amplification coefficient of reflecting its amplification in the subsequent layers to influence the output, and is independent from the approximation of and is only determined by . When is small and the approximation of is very close to , the error can be ignored, is roughly considered a linear combination of with amplification coefficient .
4.3. Approximation Algorithm
We use the LANN with the smallest number of linear regions that meets the requirement of approximation degree, which measured by approximation error , to assess the complexity of a model. Unfortunately, the actual number of linear regions corresponding to data manifold (Bishop, 2006) in the input-space is unknown. To tackle the challenge, we notice that a piecewise linear activation function with subfunctions contributes hyperplanes to the input-space partition (Montufar et al., 2014). Motivated by this, we propose to minimize the number of hyperplanes under the expectation of minimizing the number of linear regions. Formally, under a requirement of approximation degree , our algorithm learns a LANN model with minimum . Before presenting our algorithm, we first introduce how we obtain the PDF of neuron .
4.3.1. Distribution of activation function
In Section 4.2, in order to compute and , we introduce the probability density function of neuron . To compute , the distribution of activation function is involved. The distribution of an activation function is how outputs (or inputs) of a neuronal activation function distribute with respect to the data manifold. It is influenced by the parameters of previous layers and the distribution of input data. Since the common curve activation functions are bounded to a small output range, to simplify the calculation, we study the posterior distribution of an activation function (Frey and Hinton, 1999; Ioffe and Szegedy, 2015)
instead of the input distribution. To estimate the posterior distribution, we use kernel density estimation (KDE)(Silverman, 2018) with Gaussian kernel, and use the output of activation function on training dateset as the distributed samples . we have where the bandwidth is chosen by the rule-of-thumb estimator (Silverman, 2018). To compute and , we uniformly sample points within the output range of , where . We then use the expectation on these samples as an estimation of .
The output of is smooth and in small range. Setting large sample size does not lead to obvious improvement in the expectation estimation. In our experiments, we set . Notice that is the output of . The corresponding input is . Thus, . is computed in the same way.
4.3.2. Piecewise linear approximation of activation
To minimize , the piecewise linear approximation function of an arbitrary neuron is initialized with a linear function (). Then every new subfunction is added to to minimize the value of . Every subfunction is a tangent line of . The initialization is the tangent line at , which corresponds to the linear regime of the activation function (Ioffe and Szegedy, 2015). A new subfunction is added to the next tangent point , which is found from the set of uniformly sampled points . That is,
where subscript means that with additional tangent line of is used in computing . Algorithm 1 shows the pseudocode of determining the next tangent point.
4.3.3. Building LANNs
To minimize , the algorithm starts with initializing every approximation function with a linear function (). Then, we iteratively add a subfunction to the approximation function of a certain neuron to decrease to the most degree in each step. In Eq. (20), when building a LANN, the error cannot be ignored because is large. The amplification coefficient of lower layer is exponentially larger than that of the upper layer. Otherwise, error grows exponentially from lower to upper layer. Deriving this formula to get the exact weight of is complicated. A simple way is to roughly consider each to be equally important in the algorithm. Specifically, for a neuron from the first layer, small is desired due to a large magnitude of even through . Another neuron from the last hidden layer, its amplification coefficient is with the lowest magnitude over all layers but is not ignorable and may influence the distribution of neuron status, thus approximation with small is desired to decrease the value of and .
Algorithm 2 outlines the LANN building algorithm. To reduce the calculation times, we set up the batch size to batch processing a group of neurons. The complexity of the algorithm (Algorithm 2) is . The time cost of the first loop is . The second loop repeats times, within each loop the computation cost is , where is the sample size of , is the number of instances of .
5. Model Complexity
The number of linear regions in LANN reflects how nonlinear, or how complex the function of the target model is. In this section, we propose an upper bound to the number of linear regions, then propose the model complexity measure based on the upper bound. The idea of measuring model complexity using the number of linear regions is common in piecewise linear neural networks (Pascanu et al., 2013; Raghu et al., 2017; Montufar et al., 2014; Novak et al., 2018). We generalize their results to the LANN model, of which the major difference is that, in LANN, each piecewise linear activation function has different form and different number of subfunctions.
Theorem 1 (Upper bound).
Given a linear approximation neural network with hidden layers. Let be the width of the -th layer and the number of subfunctions of . The number of linear regions of is upper bounded by .
Please see Appendix A.1 for the proof of Theorem 1. This theorem indicates that the number of linear regions is polynomial with respect to layer width and exponential with respect to layer depth. This is consistent with the previous studies on the power of neural networks (Bianchini and Scarselli, 2014; Bengio and Delalleau, 2011; Poole et al., 2016; Eldan and Shamir, 2016). Meanwhile, the value of reflects the nonlinearity of the corresponding neuron according to the status distribution of activation functions. The distribution is influenced by both model parameters and data manifold. Thus, this upper bound reflects the impact of model parameters on complexity. Based on this upper bound, we define the complexity measure.
Definition 0 (Complexity measure).
Given a deep neural network and a linear approximation neural network learned from with approximation degree , the -approximation complexity measure of is
This complexity measure is essentially a simplification of our proposed upper bound by logarithm. We recommend to select from the range of when converges to a constant. (Appendix A.2)
6. Insight from Complexity
In this section, we take several empirical studies to shed more insights on the complexity measure. First, we investigate various contributions of hidden neurons to model stability. Then, we examine the changing trend of model complexity in the training process. After that, we study the occurrence of overfitting and and regularizations. Finally, we propose two new simple and effective approaches to prevent overfitting.
|Sec 6.1||Sec 6.2||Sec 6.3, 6.4|
|MNIST||L3M300||L3M100, L6M100, L3M200||-|
|CIFAR||L3M300||L3M200, L6M200, L3M400||L3M(768,256,128)|
Our experiments and evaluations are conducted on both synthetic (Two-Moons111The synthetic dataset is generated by sklearn.datasets.make_moons API. ) and real-world datasets (MNIST (LeCun et al., 1998), CIFAR-10 (Krizhevsky and Hinton, 2009)). To demonstrate that the reliability of the complexity measure does not depend on model structures, we design multiple model structures. We use for complexity measure in all experiments, which sits in our suggested range for all models we used. Table 1 summarizes the model structures we used, where L3 indicates the network is with 3 hidden layers, M300 means each layer contains 300 neurons while M(32,128,16) means that the first, second, and third layers contain 32, 128, and 16 neurons, respectively. Subscripts and stand for the activation functions Sigmoid and Tanh, respectively.
6.1. Hidden Neurons and Stability
As discussed in Section 4.2, the amplification coefficient (Eq. 20) is defined by the multiplication of through subsequent layers. measures the magnification effect of the perturbation on neuron in subsequent layers. In other words, the amplification coefficient reflect the effect of a neuron on model stability. Figure 3 visualizes amplification coefficients of trained models on the MNIST and CIFAR datasets, showing that neurons from the lower layers have greater amplification factors. To exclude the influence of variant layer widths, each layer of the models has the equal width. Besides amplification coefficient, we also visualize , the error accumulation of all previous layers. According to our analysis, is expected to have the opposite trend with : of upper layers is expected to be exponentially larger than lower layers. Figure 4 shows error accumulation on the same models. To verify that a small perturbation at a lower layer can cause greater influence on the model outputs than at a upper layer, we randomly ablate neurons (i.e., fixing the neuron output to 0) from one layer of a well-trained model and observe the number of instances whose prediction labels are consequently flipped. The results of ablating different layers are shown in Figure 5.
6.2. Complexity in Training
In this experiment, we investigate the trend of changes in model complexity in the training process. Figure 6 shows the periodically-recorded model complexity measure during training based on the 0.1-approximation complexity measure . From this figure, we can observe the soaring model complexity along with the training, which indicates that the learned deep neural networks become increasingly complicated. Figure 6 sheds light on how the model structure influences the complexity measure. Particularly, it is clear to see that increases in both width and depth can increase the model complexity. Furthermore, with the same number of neurons, the complexity of a deep and narrow model (L6M100 on MNIST, L6M200 on CIFAR) is much higher than a shallow and wide one (L3M200 on MNIST, L3M400 on CIFAR). This agrees with the existing studies on the effectiveness of width and depth of DNNs (Montufar et al., 2014; Pascanu et al., 2013; Bengio and Delalleau, 2011; Eldan and Shamir, 2016).
6.3. Overfitting and Complexity
The complexity measure through LANNs can be used to understand overfitting. Overfitting usually occurs when training a model that is unnecessarily flexible (Hawkins, 2004). Due to the high flexibility and strong ability to accommodate curvilinear relationships, deep neural networks suffer from overfitting if they are learned by maximizing the performance on the training set rather than discovering the patterns which can be generalized to new data (Goodfellow et al., 2016). Previous studies (Hawkins, 2004) show that an overfitting model is more complex than not overfitting ones. This idea is intuitively demonstrated by the polynomial fit example in Figure 1
. Regularization is an effective approach to prevent overfitting, by adding regularizer to the loss function, especiallyand regularization (Goodfellow et al., 2016). regularization results in a more sparse model, and
regularization results in a model with small weight parameters. A natural hypothesis is that these regularization approaches can succeed in restricting the model complexity. To verify this, we train deep models on the MOON dataset with and without regularization. After 2,000 training epochs, their decision boundaries and complexity measureare shown in Figure 7. The results demonstrate the effectiveness of and regularizations in preventing overfitting and constraining increase of the model complexity. We also measure model complexity during the training process, after each epoch of CIFAR, with or without and regularizations. The results are shown in Figure 8. Figure 8(a) is the overfitting degree measured by , Figure 8(b) is the corresponding complexity measure . The results verify the conjecture that regularizations constrain the increase of model complexity.
6.4. New Approaches for Preventing Overfitting
Motivated by the well-observed significant correlation between the occurrence of overfitting and the increasing model complexity, we propose two approaches to prevent overfitting by directly suppressing the rising trend of the model complexity during training.
6.4.1. Neuron Pruning
From the definition of complexity (Def. 2), we know that constraining model complexity , i.e., restraining the variable for each neuron, is equivalent to constraining the non-linearity of the distribution of a neuron. Thus, we can periodically prune neurons with a maximum value of , after each training epoch. This is inspired by the fact that a larger value of implies the higher probability that the distribution is located at the nonlinear range and therefore requires a larger . Pruning neurons with a potentially large degree of non-linearity can effectively suppress the rising of model complexity. At the same time, pruning a limited number of neurons unlikely significantly decreases the model performance. Practical results demonstrate that this approach, though simple, is quite efficient and effective.
6.4.2. Customized Regularization
This is to give customized coefficient to every column of weight matrix when doing regularization. Each column corresponds to a specific neuron and with coefficient:
One explanation is that equals to the expectation of first-order derivative of . With a larger value of , the distribution is with a higher probability located at the linear range of the activation function (). The customized approach assigns larger sparse penalty weights to more linearly distributed neurons. The neurons with more nonlinear distributions can maintain their expressive power. Another view to understand this approach is to using Eq. (19), . That is, the formulation of customized can be interpreted as the constraint of , which will obviously result in smaller as well as smaller . Customized is more flexible than the normal regularization, thus behaves better with large penalty weight.
Figure 9 compares the respective decision boundaries of the models trained with different regularization approaches on the MOON dataset. Table 2 records the corresponding complexity measure and the number of split linear regions over the input space. Figure 10 shows the overfitting and complexity measures in the training process of models on CIFAR. In our experiments, the neuron pruning percentage set to . These figures demonstrate that neuron pruning can constrain overfitting and model complexity, and still retain satisfactory model performance. We scale the customized coefficient to so that its mean value is equal to the penalty weight of , denoted by . Our results shows that, with a small penalty weight, the customized approach behaves close to normal . With a large penalty weight, the performance of model is affected, test accuracy decrease by . The customized approach retains the performance (Appendix A.3).
In this paper, we develope a complexity measure for deep neural networks with curve activation functions. Particularly, we first propose the linear approximation neural network (LANN), a piecewise linear framework, to both approximate a given DNN model to a required approximation degree and minimize the number of resulting linear regions. After providing an upper bound to the number of linear regions formed by LANNs, we define the complexity measure facilitated by the upper bound. To examine the effectiveness of the complexity measure, we conduct empirical analysis, which demonstrated the positive correlation between the occurrence of overfitting and the growth of model complexity during training. In the view of our complexity measure, further analysis revealed that regularizations indeed suppress the increase of model complexity. Based on this discovery, we finally proposed two approaches to prevent overfitting through directly constraining model complexity: neuron pruning and customized regularization. There are several future directions, including generalizing the usage of our proposed linear approximation neural network to other network architectures (i.e. CNN, RNN).
Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory 39 (3), pp. 930–945. Cited by: §1, §2.1.
- On the expressive power of deep architectures. In International Conference on Algorithmic Learning Theory, pp. 18–36. Cited by: §1, §2.1, §5, §6.2.
On the complexity of neural network classifiers: a comparison between shallow and deep architectures. IEEE transactions on neural networks and learning systems 25 (8), pp. 1553–1565. Cited by: §2.1, §5.
- Pattern recognition and machine learning. springer. Cited by: §4.3.
- State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE ICASSP, pp. 4774–4778. Cited by: §1.
- . In Conference on Learning Theory, pp. 698–728. Cited by: §1, §2.1.
- Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2 (4), pp. 303–314. Cited by: §2.1.
- Shallow vs. deep sum-product networks. In Advances in NIPS, pp. 666–674. Cited by: §1, §2.1.
- Deep learning for event-driven stock prediction. In Proceeding of the 24th IJCAI, Cited by: §1.
- On the power of over-parametrization in neural networks with quadratic activation. arXiv preprint arXiv:1803.01206. Cited by: §2.1.
- The power of depth for feedforward neural networks. In Conference on learning theory, pp. 907–940. Cited by: §1, §2.1, §4.2, §5, §6.2.
- Variational learning in nonlinear gaussian belief networks. Neural Computation 11 (1), pp. 193–213. Cited by: §4.3.1.
- Graph representation learning via hard and channel-wise attention networks. In Proceedings of the 25th ACM SIGKDD, pp. 741–749. Cited by: §1.
- Deep learning. MIT press. Cited by: §1, §6.3.
- The problem of overfitting. Journal of chemical information and computer sciences 44 (1), pp. 1–12. Cited by: §1, §1, §6.3.
- On the selection of initialization and activation function for deep neural networks. arXiv preprint arXiv:1805.08266. Cited by: §1.
- Distilling the knowledge in a neural network. stat 1050, pp. 9. Cited by: §1, §3.
- Multilayer feedforward networks are universal approximators. Neural networks 2 (5), pp. 359–366. Cited by: §1, §2.1.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §1, §2.2, §4.3.1, §4.3.2.
- Why tanh: choosing a sigmoidal function. In [Proceedings 1992] IJCNN, Vol. 4, pp. 578–581. Cited by: §1, §3.
On the power of sigmoid neural networks.
Proceedings of the 6th annual conference on Computational learning theory, pp. 137–143. Cited by: §1, §3.
- Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto. Cited by: §6.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §6.
- The expressive power of neural networks: a view from the width. In Advances in NIPS, pp. 6231–6239. Cited by: §1, §2.1.
- A comparison of the computational power of sigmoid and boolean threshold circuits. In Theoretical Advances in Neural Computation and Learning, pp. 127–151. Cited by: §2.1.
- On the number of linear regions of deep neural networks. In Advances in NIPS, pp. 2924–2932. Cited by: §A.1, §1, §1, §2.1, §2.2, §3, §4.1, §4.3, §5, §6.2.
- Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th ICML, pp. 807–814. Cited by: §3.
- Sensitivity and generalization in neural networks: an empirical study. In ICLR, Cited by: §1, §2.2, §3, §5.
- Activation functions: comparison of trends in practice and research for deep learning. arXiv preprint arXiv:1811.03378. Cited by: §3.
- On the number of response regions of deep feed forward networks with piece-wise linear activations. arXiv preprint arXiv:1312.6098. Cited by: §A.1, §1, §1, §2.1, §2.2, §5, §6.2.
- Exponential expressivity in deep neural networks through transient chaos. In Advances in NIPS, pp. 3360–3368. Cited by: §2.1, §5.
- On the expressive power of deep neural networks. In Proceedings of the 34th ICML-Volume 70, pp. 2847–2854. Cited by: §A.1, §1, §1, §2.2, §3, §4.1, §4.1, §4.2, §5.
- Density estimation for statistics and data analysis. Routledge. Cited by: §4.3.1.
- Unsupervised clinical language translation. Proceedings of the 25th ACM SIGKDD. Cited by: §1.
Appendix A Proof and Discussions
a.1. Proof of Theorem 1
First of all, according to (Montufar et al., 2014; Raghu et al., 2017; Pascanu et al., 2013), the total number of linear regions divided by hyperplanes in the input space is upper bounded by , whose upper bound can be obtained using binomial theorem:
. Now consider the first hidden layer of a LANN model. A piecewise linear function consisting of subfunctions contributes hyperplanes to the input space splitting. The first layer contains neurons, with -th neuron consisting of subfunctions. So contributes hyperplanes to the input space splitting, and divides into linear regions with upper bound (Eq. 25):
Now move to the second hidden layer . For each linear region divided by the first layer, it can be divided by the hyperplanes of to at most ( smaller regions. Thus, the total number of linear regions generated by is at most
. Recursively do this calculation until the last hidden layer . Finally, the number of linear regions divided by is at most
a.2. Suggested Range of
In this section we provide a suggestion of the range of when using LANN for complexity measure. A suitable value of makes the complexity measure trustworthy and stable. When the value of is large, the measure may be unstable and unable to reflect the real complexity. It seems small value of is prefered, however small value calls for higher cost to construct the LANN approximation. And how small should be? Based on analyzing the curve of approximation error, we provide an empeircal range.
We first analyze the curve of approximation error in several aspects. Approximation error is the optimization object in building LANN algorithm (Algorithm 2), so obviously it goes decreasing during training epochs (Figure 11(a)). Meanwhile, the absolute of first-order derivative of , which represents the contribution of current epoch’s operation to the decrease of apporixmation error , is called approximation gain here, and denoted by . Our algorithm ensures that, at any time is expected to be larger than all remaining possible operations. Figure 11(b) shows the curve of approximation gain. Because we ignore the error in the algorithm, the curve of approximation gain in practice has a small range of jitter, but the decreasing trend can be guaranteed. We also consider the derivative of , formally the absolute of second-order derivative of approximation error , denoted by . The second-order derivative reflects the changing trend of the approximation gain . It is easy to prove that, the trend of goes decrease with training epoch increases: If not, after a finite number of epochs we have . But in fact, since will never decrease to 0, operation of each epoch brings non-zero influence to , thus will not be 0. Figure 11(c) shows the change trend of .
See from Figure 11, the changing trends of , and are close to each other. The trend decreases quickly at the beginning then gradually flatten to convergence. This agrees with our algorithm design. After goes flatten, the following relationships are established: , , . Suppose there is an epoch in the flatten region of , are its first-order, second-order derivative. We show changing trends of flatten regions in Figure 12. According to Figure 11 and the above analysis, the curve after is basically stable. We estimate the total gain of approximation error that can be brought by remaining epochs. Suppose there exists a that after epochs from , goes 0. Then the gain of remaining epochs are the gain of the next epochs. Suppose is constant, . the gain of remaining epochs is estimated by . We analyze and from the view of the remaining gain estimation. In practice, and keep decrease. If and goes stable and with very close decreasing trend, the estimation of remaining gain of should be close to the estimation of epochs around . Suppose the above condition is true, we have: , where is the derivative of . This is, the downward trend of and are basically similar, and is true. As a result, of an epoch almost equalling to the calculated value of its neighbors demonstrates that, the derivative of and are almost the same. The gain of remaining epoches are expected to be relatively stable, each afterward epoch will not bring much influence to the value of . In this case, the is relatively stable. The conclusion is, for the construction of a LANN based on a specific target model, is suggested where is the starting point of converging to a constant.
For the comparable of two LANNs, find such which satisfying and . This to some degree ensures the stability of complexity measure of the target model, the estimated gain of remaining epochs of two LANNs are almost similar. In practical experiments, the value of is used to check if the value of is reasonable. In our experiments, we choose a uniform and verify its rationality. From our experimental results, it seems for relatively simple network (e.g. 3 layers, hundreds of width), is good enough since the goes convergence. In Figure 13 we show the changing trends on the CIFAR to demontrate that is a reasonable value in our experiments.
a.3. More Experimental Results
a.3.1. Extension of Section 6.4
In Section 6.4, we report that customized regularization is more flexible than normal regularization, such that behaves better with large weight penalty. We indicate that customized maintains the prediction performance on the CIFAR test dataset while is about lower. Below in Figure 14 we show the corresponding prediction accuracy on training and test dataset.
a.3.2. Complexity Measure is Data Insensitive
To verify if our complexity measure by LANN is data sensitive, we measure the approximation error of LANNs on test dataset. Below in Table 3 we compare approximation errors on training dataset (the dataset used to build LANNs) and test dataset. The results show that LANNs achieve very close approximation error on training and test dataset, which demonstrates that our complexity measure is data dependence but data insensitive.