Currently, the structure of neural network is mainly developed by hand-crafted design[simonyan2014very, szegedy2015going] or neural architecture searching [baker2016designing]. A theoretical guidance is still lacking for understanding deep network behaviors. One of the most successful architectures, residual network (ResNet) [he2016deep]
, introduces identity mappings to enable training a very deep model. ResNets are also used as the base model for a series of computer vision tasks such as scene segmentation[chen2017deeplab], and action recognition [tran2018closer]. Despite the huge success, the understanding of ResNets is mainly supported by empirical analyses and experimental evidences, other than some attempts from an optimization view [li2018optimization]. Recently, the connection between ResNet and dynamical system has inspired researchers to unravel the physics of residual networks using the rich theories and techniques in differential equations [weinan2017proposal, haber2017stable].
, it is noted that the Euler method for ordinary differential equations (ODEs) has the same formulation as ResNet iterative updates, and ResNet is viewed as a discrete dynamical system. In this way, the parameter learning in neural networks is translated into its continuous counterpart as an optimal control problem[chen2018neural, li2018icml, li2017maximum, behrmann2018invertible]. Based on the Euler method, some studies introduce multi-step or higher-oder discretization [Lu2018, he2019ode], and fractional optimal control [jia2019focnet] to construct more powerful network structures for different tasks. Other studies analyze the stability of residual networks and propose more stable and robust structures [haber2017stable, chang2018reversible, ruthotto2018deep, eldad2019].
An appropriate time stepping for discretization methods of ODEs is crucial for the stability and efficiency [ascher1998computer]. A small step size is able to render an accurate solution, but requires more steps for a fixed evolution time. Chang et al. adopt a multi-level strategy to adjust the time step size for ResNet training [chang2017multi]. In a recent study, a small step size is suggested for more stable and robust ResNets [zhang2019towards]. However, an overly small step size would smooth the feature learning. From a dynamical system view, the evolution time for network with a fixed depth and a small step size is too short for the system to evolve from the initial state to the final linearly separable state. In numerical methods for ODEs, adaptive time stepping strategies, such as the Runge-Kutta-Fehlberg (RKF) method [hairer1991solving] as shown in Figure 1, are able to attain a good trade-off between stability and cost. Can we also design an adaptive time stepping for ResNets to ensure both stability and performance?
In this study, we analyze the effects of time stepping on the stability and performance of residual networks, and point out that each step size should be aware of previous steps and the weight parameters in the current step. We develop an adaptive controller, which connects different steps as an LSTM, and takes the parameters of each step as input, to output a set of coefficients that decide the current step size. In doing so, the network is trained with variable step sizes and evolution time, so that our time stepping is optimized jointly to render the network a better stability and performance. More importantly, because the controller is data-independent, our performance gains come with no additional cost in inference phase.
The contributions of this study can be listed as follows:
We analyze the correspondence between ODEs and ResNets, establish a stability condition for ResNets with step sizes and weight parameters, and point out the effects of step sizes on the stability and performance.
Based on our analyses, we develop a self-adaptive time stepping controller to enable optimizing variable step sizes and evolution time jointly with the network training.
Experiments on ImageNet and CIFAR demonstrate that our method is able to improve both stability and accuracy. The improvements come with no additional cost in inference phase. We also test the application of our method to two non-residual network structures.
View ResNet as a Dynamical System
The forward propagation in a residual network block [he2016deep] can be written as:
where is the residual function for each step, and is the network depth. Here we add the in a multiplicative way with the residual branch. Usually a unit of has the form of , where is the non-linear activation. When , it reduces to the original form of ResNets. Regarding as a fixed step size, we see that Eq. (1) can be interpreted as the forward Euler method discretization for the following initial value problem (IVP) [weinan2017proposal, haber2017stable]:
where features and parameters are viewed in their continuous limit as a function of time . The evolution time corresponds to the network depth . In doing so, residual networks are interpreted as the discrete counterpart of dynamical systems, and parameter learning is equivalent to solving an optimal control problem with respect to the ODE system in Eq. (2) [chen2018neural, li2018icml, haber2018learning]. Related studies use the stability condition of the forward Euler method to analyze the stability of ResNets and propose better structures [chang2018reversible, eldad2019]. We show that time stepping is crucial for the stability and performance of ResNets.
Time Stepping for the Euler method
Given a linear problem , we have its forward Euler’s discretization as , where is the fixed step size and . Assuming that is the initial value suffered from a perturbation , we have:
which indicates that when , the perturbation is controllable if . As a more general case, the forward Euler method for non-linear system Eq. (2) is stable when the following condition holds [ascher1998computer]:
where denotes the
-th eigenvalue of the Jacobian matrix defined as. From the stability condition, we can see that a small step size is required to obtain a stable solution. In practical implementations, the step size should satisfy a stable solution, while being as large as possible to reduce the amount of iterative steps for a fixed evolution time . Thus, the choice for a time stepping scheme is crucial for both stability and efficiency.
The Runge-Kutta-Fehlberg (RKF) method [hairer1991solving] as an adaptive time stepping is able to attain a good trade-off between the stability and efficiency. It uses the -th (usually =4) order Runge-Kutta method to compute the current solution , and the local truncation error is:
where is the current step size. The is approximated by the -th order form, denoted as . Then the new step size can be adjusted as:
where is a factor and is a tolerance error. The method adaptively increases or reduces the next step size according to the agreement between and .
As a simple example, we consider the problem , , whose analytical solution can be easily derived as . As shown in Figure 1, the adaptive method RKF is able to offer a stable solution, but requires significantly less number of steps than a small step size for the evolution period. If we use a large step size to reduce the number of steps, the solution will be unstable. Thus, an adaptive time stepping scheme is crucial for the stability and efficiency of solution to ODE systems.
Time Stepping for ResNets
As a discrete counterpart of dynamical systems, ResNet has similar behaviors to the discretization method for ODEs. We show that time stepping causes similar effects on the stability and performance of ResNets.
Consider a ResNet with residual blocks and variable step sizes for each step. Let be the perturbation coming from noise or adversary and satisfies . We have:
where denotes the spectral norm of weight matrix in each residual block.
See Appendix A for its proof.
We note that the robustness of ResNets to perturbation is affected by the network depth, spectral norm of each weight matrix, and each step size. Eq. (7) shows that the stability of ResNets is conditioned on each layer in a stacking way, which is consistent with previous findings [veit2016residual]. For each layer, it is suggested that the weight matrix should have a small spectral norm. Since , weight decay that regularizes the Frobenius norm is effective to train models with robustness to input perturbations. However, it shrinks the weight matrix in all directions, and discards information of input features. Some studies propose spectral norm regularizer [yoshida2017spectral, miyato2018spectral] or Jacobian regularizer [sokolic2017robust] to improve the stability.
, an overly small step size would smooth the feature learning. Denoting loss function as, in ResNets with variable step sizes
, we have the gradient backpropagated to layeras:
which shows that the backpropagated information for each layer comes from two terms. When the step size is too small, the second term would vanish, and gradients for each layer would be the same as . This would make the network inefficient and lacking in representation power. From the dynamical system perspective, if the network with a fixed depth has a small step size, its corresponding optimal control problem would have a short period of evolution time, which increases the difficulty of transforming the data space from the initial state to the expected linearly separable state.
Therefore, similar to the discretization for ODE systems, ResNets also need an adaptive time stepping to enable a good trade-off between the stability and performance. A related study [zhang2018smooth] proposes to optimize step sizes as explicit parameters. We note that independent step sizes are not self-aware and cannot be adjusted adaptively. Inspired by our analyses, we propose a self-adaptive time stepping controller that is dependent on the weight matrices and aware of previous steps.
In this section, we first introduce our design in the optimal control view, and then describe the components of our self-adaptive time stepping controller. Finally, we analyze the complexity of our method and show implementation details.
The Optimal Control View
In our analyses, we note that the product of step size and spectral norm of weight matrix in each layer decides the stability. Directly calculating the spectral norm requires the SVD decomposition, which makes the training inefficient. Here we introduce a controller that outputs the current step size dependent on the convolution parameters in this layer. Besides, similar to the design of RKF method, the new step size should remember previous step sizes to avoid sharp increment or reduction. In line with these views, denoting the controller as parameterized by , we have the corresponding optimal control problem as:
where is the loss function, is the regularization, is the label of input image , and is number of samples. The optimal control problem in discrete time [kwakernaak1972linear] looks for the best control parameters for this dynamical system that aims to minimize the cost . From Eq. (9) we can see that, the system has variable step sizes and evolution time . In implementations, the controller is jointly optimized with the network training, so that an optimal time stepping can be searched to render the network better stability and performance.
Self-adaptive Time Stepping Controller
Since the time stepping controller takes the convolution parameters as input and is aware of previous steps, we parametrize the controller as an LSTM that connects different steps. In implementations, we split the step size as a vectorwith the same channel number as the feature in current step. The product between residual branch and step size is replaced with a channel-wise multiplication. We find this helps to improve the training stability and accuracy.
An illustration of our method is shown in Figure 2. Denote the convolution parameters of the -th layer as , where , are the kernel sizes, and , are the number of channels for the input and output, respectively. In order to acquire representative information of the residual function, we average by projecting along the input dimension and get after reshaping. We concatenate them if there are multiple convolution layers in the residual branch. The first goes through a transformation layer to reduce the dimension into a reduction of of the channels in the current layer:
refers to the ReLU function,is the transformation matrix,
is the bias vector, andis the input for LSTM. The hidden unit of LSTM has the same dimension as , i.e. . The interaction between inner states and gates at -th layer goes through the following steps:
refers to the sigmoid function anddenotes element-wise multiplication. After the above processes, another fully connected layer transforms the hidden unit into the step size vector :
where , , and denotes the non-linear sigmoid function that restricts the elements in the current step size between range . The forward propagation in the current step is:
where represents the channel-wise multiplication. We have one controller for each size stage in ResNet, and all parameters above keep shared in the same size stage.
We note that the channel-wise attention technique [hu2017] has a similar formulation to Eq. (13). Our method differs from theirs in that our controller is data independent and does not rely on the feature space. What our time stepping aims to optimize is part of the structural information. When training finishes, our method discards the controller and has no considerable addition cost in inference phase (except a little calculation of multiplying step sizes). This cannot be realized by attention methods that are feature dependent. Besides, our experiments show that our method is compatible with the attention method.
We denote the time stepping controller using LSTM as TSC. In addition to this structure, we also consider two other versions as its counterparts. We analyze their complexities in this subsection and compare their performance in experiments. The first one removes the LSTM layers Eq. (11), and only keeps the input and output transformation layers but does not share their parameters. It is similar to the two fully-connected layers module in [hu2017]. We denote this version as TSC. This structure is dependent on the convolution parameters but not aware of previous steps. The other one does not use a controller, and only introduces the step sizes as explicit parameters that are independent of weight matrices and previous steps. This version is denoted as TSC.
When training finishes, only the optimized step sizes should be stored and the controller can be removed, which leads to little additional cost in inference phase. As for training complexity, TSC has the same cost since it does not have a controller. As for TSC and TSC, TSC consumes less parameters because of the weight-sharing in LSTM, while TSC consumes less computation because it does not have the LSTM layers in Eq. (11). Assuming that the network has feature size stages (as an example, for CIFAR and for ImageNet). There are layers in the -th stage, and is the number of channels per layer. We compare the three methods’ parameter complexity of training and inference phase in Table 1. We will compare their performance and overhead in experiments.
|(err.)||(train / infer)||(train / infer)|
In experimental section, we test our proposed methods on the ResNet and its variants. Here we show the details of our implementation. For ResNets structures without bottleneck, there are two convolution layers in each residual block, and we have for the kernel sizes. We concatenate these two parts of projected parameters in layer by , where denotes the output channels in this layer. For ResNets structures with bottleneck, there are three convolution layers in each residual block, and the kernel size is for the first and third layers, and for the bottleneck layer. We only concatenate the first and third projected parameters by , where is the number of bottleneck channels. We set the reduction to for ResNets without bottleneck, and for ResNets with bottleneck. An illustration of our method on ResNet-34 (without bottleneck) and ResNet-50 (with bottleneck) is shown in Appendix B.
We conduct experiments on CIFAR-10, CIFAR-100, and ImageNet to validate our time stepping controller on ResNet and its variants. We also test the application of our method to two non-residual network structures.
Datasets and Training Details
For training sets of the ImageNet dataset, we adopt the standard data augmentation scheme [he2016deep]. A
crop is randomly sampled from the image or its horizontal flip. The input images are normalized by mean and standard deviation for each channel. All models are trained on the training set and we report the single center crop error rate on the validation set. For CIFAR-10 and CIFAR-100, we adopt a standard data augmentation scheme by padding the images 4 pixels filled with 0 on each side and then randomly sampling acrop from each image or its horizontal flip. The images are normalized by mean and standard deviation.
. For the ImageNet dataset, we train for 100 epochs with an initial learning rate of 0.1, and drop the learning rate every 30 epochs. A mini-batch has 256 images among 8 GPUs. For CIFAR-10 and CIFAR-100, we train for 300 epochs with a mini-batch of 64 images. The learning rate is set to 0.1 and divided by 10 atand of the training procedure. For our results on CIFAR, we run for 3 times with different seeds and report mean values.
In order to test the effectiveness of our proposed LSTM time stepping controller TSC, we conduct experiments on ImageNet and compare with the two counterparts, TSC and TSC. We perform our methods with ResNet-50. As shown in Table 2, our re-implementation has a slightly better performance than reported. When armed with TSC, ResNet has a small performance improvement, due to the introduced step sizes as explicit parameters. It reveals that a trainable step size benefits the ResNet performance. TSC and TSC have larger improvements, which shows that the controller dependent on the convolution parameters contributes to a better performance. TSC is further aware of previous steps, because of the memory brought by LSTM, and has an improvement of 0.79% top-1 accuracy than baseline. The training curves of our TSC and baseline are compared in Figure 3. It is shown that our method has a superiority during the whole training procedure. The ablation study demonstrates the effectiveness of our design that the step sizes should be dependent on convolution parameters and aware of previous steps. We use the LSTM controller TSC for our later experiments.
Besides, the optimized step sizes belong to part of the structural information and are not data dependent. Thus, when training finishes, only the step sizes should be stored, and the additional overhead introduced by our controller can be spared in inference. As shown in Table 2, compared with baseline, these three methods have little additional cost in inference. For training, TSC has the same parameters and computation as inference because it does not use a parameterized controller. TSC consumes less parameters but more computation than TSC in training. It is consistent with our analysis in prior section.
Improving the Performance
|Models||re-implementation||with our time stepping controller|
|Error. (%)||Params(M)||GFLOPs||Error. (%)||Params(M)||GFLOPs|
|(gain)||(train / infer)||(train / infer)|
|ResNet-18||29.41||11.69||1.81||28.80 (0.61)||13.52 / 11.69||1.81 / 1.81|
|ResNet-34||26.03||21.80||3.66||25.39 (0.64)||23.63 / 21.80||3.66 / 3.66|
|ResNet-50||24.42||25.56||3.86||23.63 (0.79)||27.83 / 25.57||3.89 / 3.86|
|ResNet-101||22.94||44.55||7.58||22.24 (0.70)||46.82 / 44.58||7.64 / 7.58|
|ResNeXt-50||22.84||25.03||3.77||22.23 (0.61)||27.30 / 25.04||3.80 / 3.77|
|ResNeXt-101||21.88||44.18||7.51||21.17 (0.71)||46.45 / 44.21||7.57 / 7.51|
|SENet-50||23.27||28.09||3.87||22.75 (0.52)||30.36 / 28.10||3.90 / 3.87|
|SENet-101||22.37||49.33||7.60||21.82 (0.55)||51.60 / 49.36||7.66 / 7.60|
We add our time stepping controller on ResNet families with different depths and different variants, including ResNeXt [xie2016aggregated] and SENet [hu2017], to validate the ability of improving performance. As shown in Table 3, for fair comparison, we re-implement the baseline methods and most of our re-implementation performance are superior to the reported numbers.
When armed with our time stepping controller, it is shown that ResNets in different depths consistently have a 0.6%-0.8% improvement on performance. ResNet-50 has the largest accuracy gain. We see that our methods introduce a small number of parameters and computation for training, and nearly no considerable extra cost for inference.
We also add our time stepping controller on ResNet variants to test the scalability. The implementation of ResNeXt is similar to ResNet. It is shown that our method is also effective to ResNeXts. For ResNeXt-50, it has a top-1 accuracy improvement of 0.61%, while ResNeXt-101 has a larger gain of 0.71%.
Compared with feature operating modules, such as the channel-wise attention in SENet, our method may not have strong advantages for improving the performance, because the attention methods are data dependent, while ours are searching for adaptive step sizes, which are independent from features and belong to structural information. In spite of this, we note that TSC-ResNet-101 (22.24% top-1 error rate) has surpassed the performance of SENet-101 (22.37% top-1 error rate) using less parameters and computation. Our performance gain has little extra cost in inference, which cannot be realized by the feature dependent method SENet. We also show that our time stepping controller is compatible with SENets, even if they share a similar propagation formulation as Eq. (13). Our method reduces a top-1 error rate of 0.52% for SENet-50, and 0.55 % for SENet-101.
From Table 3, we observe that a deeper model benefits more from our time stepping controller in general. We believe that the reason is that a deeper network suffers more from the effects of step sizes. As indicated by Eq. (7), when depth increases, the cumulative influence of the spectral norm of weight matrices and step sizes become larger. In this case, an inappropriate step size would heavily impede the stability and performance of the network. It is in line with our intuition that deeper networks have more difficulties of training. Our method has an adaptive time stepping controller to adjust the steps sizes jointly with the network training, and thus helps more for deeper networks.
Improving the Stability
In order to test the ability of our time stepping controller to improve the stability, we conduct experiments to show the controller’s robustness to perturbations and increasing depths.
We train ResNet-56 on CIFAR-10 with different step sizes (0.01 and 1), and our time stepping controller. After training, we inject different level Gaussian noise to the input for inference on the test set. The level of perturbation is decided by the standard deviation of the synthetic Gaussian noise. As shown in Figure 4, when the noise level increases, the accuracies of and both drop quickly. The loss of has a sharp increment compared with that of . It is in line with our analysis in Eq. (7) that a small step size in each layer helps to bound the adverse effect caused by perturbations. However, the performance of is significantly worse than . As a comparison, our time stepping controller offers adaptive step sizes. It is shown that TSC has a moderate loss increment with the noise level rising. Although TSC has a higher loss than when noise level is larger than 0.5, the accuracies of TSC are consistently better than and . This demonstrates that our stepping controller improves the ResNet robustness to perturbations, and offers a good trade-off between the stability and performance.
We further test the robustness of our method to increasing depths. As shown in Figure 5, we train ResNet and TSC-ResNet with different depths on the CIFAR-10 and CIFAR-100 datasets. For shallow residual networks, TSC-ResNet and ResNet have similar performances. When depth increases, the accuracies of ResNets approach a plateau, and then have a drop for very deep networks. A similar result is also reported in [zhang2019towards]. As a comparison, the performance of TSC-ResNet is more stable. It keeps a slow increment in accuracy for large depths. We note that the performance gap between TSC-ResNet and ResNet is larger for deeper networks in general, which is consistent with our observation on the ImageNet experiments in the prior section. This demonstrates our analysis that deeper networks suffer from an accumulated instability, and thus gain more benefits from our adaptive time stepping controller.
As shown in Table 4, we average the optimized time step size vectors of TSC-ResNet-50 and TSC-ResNet-101 in different layers. We found that the step sizes in shallow layers of both TSC-ResNet-50 and TSC-ResNet-101 are centered around 0.5. For the layers in the last size stage (Conv-5_1, 5_2, 5_3), the step sizes are approaching 1. Conv-3_1 and 4_1 keep the initial state and are mildly affected by our method, because each of our time stepping controller consider previous steps but they correspond to the first time step in each size stage. Besides, the shallow layers should have small step sizes to avoid accumulated instability. But for layers Conv-5_1, 5_2, and 5_3, they should enlarge step sizes to achieve strong feature transformations for the final representation.
We also observe that, deeper layers in Conv-5_1, 5_2, and 5_3 converge to larger step sizes. For the same layers in Conv-5_1, 5_2, and 5_3, TSC-ResNet-101 converges to a larger step size. It reveals that deeper layer or deeper model requires a larger step size. It is in line with our experimental results that deeper models gain more benefits from our method in general. We believe that the reason lies in that deeper network corresponds to longer evolution time and suffer more from inappropriate time stepping. We also conjecture that our adaptive time stepping’s effects on shallow layers mainly ensure stability, while the adjustments for deep layers contribute to the performance gains.
Extension to Non-residual Networks
Our analyses and the development of our time stepping controller are based on the correspondence between residual networks and discrete dynamical systems. In order to test the scalability of our method to other networks, we add the controller to two non-residual network structures, DenseNet [huang2016densely] and CliqueNet [Yang_2018_CVPR]. The results are shown in Appendix C.
In this study, we use the correspondence between residual networks and discrete dynamical systems to unravel the physics of ResNets. We analyze the stability condition of the Euler method and ResNet propagation, and point out the effects of step sizes on the stability and performance of ResNets. Inspired by the adaptive time stepping in numerical methods of ODEs, we develop an adaptive time stepping controller that is dependent on the parameters of the network and aware of previous steps to adaptively adjust the step sizes and evolution time. Experiments on ImageNet, CIFAR-10, and CIFAR-100 show that our method is able to improve both performance and stability, without introducing much overhead in inference phase. Our method can also be applied to other non-residual network structures.
Z. Lin is supported by NSF China (no.s 61625301 and 61731018), Major Scientific Research Project of Zhejiang Lab (no.s 2019KB0AC01 and 2019KB0AB02), and Beijing Academy of Artificial Intelligence. J. Wu is supported by the Fundamental Research Funds of Shandong University and SenseTime Research Fund for Young Scholars.
Appendix A: Proof of Proposition 1
Consider a ResNet with residual blocks and variable step sizes for each step. Let be the perturbation coming from noise or adversary and satisfies . We have:
where denotes the spectral norm of weight matrix in each residual block.
We have the propagation of ResNet in the -th layer as:
Denote the perturbation in the -th layer as . Then we have,
We simplify the residual branch in the -th layer as a composite function composed of the linear operator , and the ReLU non-linear activation , which is a diagonal matrix. The value in equals to one if the corresponding element in is positive, otherwise equals to zero. As pointed out by [yoshida2017spectral], the non-linearity in neural networks usually comes from piecewise linear functions, such as ReLU, maxpooling, etc. Because the perturbation is small, can be considered as a neighborhood of , and the residual function behaves as a linear operator near , then we have:
where denotes the matrix spectral norm defined as:
Appendix B: TSC-ResNet Structure on ImageNet
We show the structures of TSC-ResNet-34 (without bottleneck) and TSC-ResNet-50 (with bottleneck) in Table 6 in the next page.
|112 112||conv, 7
|5656||max pool, 33, stride 2|
|5656||conv-, fc-||conv-, fc-|
|2828||conv-, fc-||conv-, fc-|
|1414||conv-, fc-||conv-, fc-|
|77||conv-, fc-||conv-, fc-|
|11||global average pool, fc-1000, softmax|
Appendix C: Extension to Non-residual Networks
Our analyses and the development of our time stepping controller are based on the correspondence between residual networks and discrete dynamical systems. In order to test the scalability of our method to other networks, we add the controller to two non-residual network structures, DenseNet [huang2016densely] and CliqueNet [Yang_2018_CVPR]. Although the two network structures do not directly have the identity mapping as a short cut path, they introduce the dense connection, by which the addition in ResNet propagation is replaced with concatenation. Similarly, we use the parameters that generate the new feature at each step as the input of our controller. The new feature is channel-wise multiplied with the output step size, and then concatenated with old features to form the current layer. For DenseNet, we use the setting of and , where is the number of channels for each new feature, and denotes the network depth. For CliqueNet, we use their setting of and , where has the same meaning as DenseNet, and is the number of layers in the network.
As shown in Table 5, when armed with our time stepping controller, both DenseNet and CliqueNet have performance improvements on the CIFAR-10 and CIFAR-100 datasets. Although the two networks do not have the identity mapping to be closely connected with the discretization of dynamical systems, our experiments indicate that our method is scalable and can be applied to other non-residual networks.