We consider learning problems that are based on the continuous neural network formulation as in Haber_2017 ; ChRuBeDu2018 , and note that there has been considerable interest in related ODE- and PDE-inspired formulations as of late weinan2017proposal ; chang2018reversible ; chaudhari2018deep ; lu2017beyond ; ruthotto2018deep . In this setting, the neural network is cast as a time-dependent ordinary differential equation (ODE) which describes the flow of data elements (the featurevectors) through the neural network as
Here, describes the state of a neural network of width , and represents the network parameters (weights). maps the input feature vectors to the network dimension; the right-hand-side then determines the flow of the data element through the network. typically consists of an affine transformation that is parameterized by
, and a nonlinear transformation that is applied element-wise using an activation function, i.e.
is a linear transformation, such as a convolution, applied to the state, is a bias added to the state, and
denotes the nonlinear activation function applied componentwise, such as the ReLU activation.
In contrast to traditional neural networks that transform the network state at prescribed layers, the continuous formulation prescribes the rate of change of the network state using an ODE parameterized by . This control function
is to be learned during training by solving a constrained optimization problem. For example, in supervised learning, the optimization problem aims to match the network outputto a desired output that is given from the training data set:
denotes a loss function that measures the distance of the network output at final timeto the desired output, and denotes a regularization term, such as a Tikhonov regularization on and/or its time derivative, that stabilizes the optimization problem numerically. Training is generally considered successful, if the network parameters generalize well to new, previously unseen data, which is represented as a validation data set.
To solve the neural network flow numerically, classical time-integration schemes are applied that discretize the ODE in equation (1) on a given time-grid , and solve the resulting equations one after another for each time step . In this scenario, each time-step is associated with one layer of a classical artificial neural network, leading to the traditional network propagation with and being the network state and weights at layer , respectively. In fact, it has been observed that many state-of-the-art artificial neural networks can be interpreted as discrete time-integration schemes of the parametrized ODE in equation (1) (see e.g. LuZhLiDo2017 ). For example, an explicit Euler scheme to discretize (1) gives
which resembles the classical ResNet HeZaReSu2016 with an additional time-step size . We note that in the classical ResNet.
The continuous network formulation opens the door for leveraging various optimization schemes and optimal control theory that has been developed for ODE and PDE constraints Li1971 ; Tr2010 ; BiGhHeBl2003 . In particular, stability analysis of explicit time-integration schemes suggests favoring many-layer networks utilizing a small step size in order to ensure a numerically stable network propagation. Networks based on numerical time-integration schemes can therefore easily involve hundreds or thousands of discrete layers (time-steps) GuRuScCyGa2018 ; ChRuBeDu2018 ; Haber_2017 .
However, training such huge networks comes with both numerical and computational challenges. First of all, the serial nature of time evolution creates a barrier for parallel scalability. If network states are propagated through the network in a serial manner, as is done with classical training methods, an increase in the number of layers (i.e. more time steps, and smaller steps sizes) results in an equally larger time-to-solution. In order to address this serial bottleneck, a layer-parallel multigrid scheme has been developed in GuRuScCyGa2018 to replace the serial forward and backward network propagation. In this scheme, the network layers are distributed onto multiple compute units, and an iterative multigrid method is applied to solve the network propagation equations inexactly in parallel, across the layers. This iterative scheme converges to the same solution as serial network propagation. The result is that runtimes remain nearly constant when the numbers of layers and compute resources are increased commensurately (weak scaling), and that runtimes can be significantly reduced for a fixed number of layers and increasing compute resources (strong scaling). The layer-parallel multigrid method will be summarized in Section 2.1.
In addition, many-layer networks increase the complexity of the underlying optimization problem. In particular, considering more and more layers and hence more and more network parameters creates highly non-convex optimization landscapes which require proper initialization in order to be trained effectively. A number of schemes have been developed for initializing plain and residual neural networks. Commonly implemented techniques include Glorot glorot10 and He he2015delving ; however, this is still an active area of research, with new methods being proposed (e.g. hanin2018start ; humbird2018deep ; boxinit2019 ).
In this paper, we investigate a multilevel network initialization strategy that successively increases the number of network layers during layer-parallel multigrid training. The first use of such a multilevel (or “cascadic”) initialization strategy for an ODE network was done in the context of layer-serial training in Haber_2017
. Here, a sequence of increasingly deeper network training problems are solved, and each new deeper network is initialized with the interpolated learned parameters from the previous coarser network. We refer to this process as a “nested iteration”, because of this terminology’s long history. Nested iteration broadly describes a process (originally for numerical PDEs/ODEs) that starts by finding the solution for a relatively coarse problem representation, where the solution cost is relatively cheap. This coarse solution is then used as an inexpensive initial guess for the same problem, but at a finer resolution. Nested iteration was first discussed in a multigrid context at least in 1981hackbusch1981convergence , and the concept of nested iterations for solving numerical PDEs goes back at least to 1972 kronsjo1972design ; Kr1975 . Nested iteration is often called “full multigrid” in the multigrid context, and famously provides optimal solutions to some elliptic problems with unknowns BrHeMc2000 ; TrOo2001 . Nested iteration, especially when only regions with large error are marked for refinement, can lead to remarkably efficient solvers AdMaMcNoRuTa2011 ; DeMa2008 . We will take advantage of this inherent efficiency to cheaply find good initializations for ODE networks, by successively refining the time grid of the ODE network, adding more and more time-grid levels to the multigrid hierarchy.
In this work, we put the nested iteration (or cascadic) initialization idea into the context of an iterative layer-parallel multigrid solver, which re-uses the coarser grid levels during multigrid cycling. We investigate two interpolation strategies (constant and linear) and their influence on the layer-parallel multigrid convergence and training performance. We further investigate the effect of multilevel initialization as a regularization force, with the desire that deep network training can become less sensitive towards hyperparameters and variation in network parameter initializations. In other words, the goal is that with a better initial guess for the network parameters, the training process becomes more robust and less sensitive.
2 Methods: Layer-Parallel Training and Nested Iteration
2.1 Layer-Parallel Training via Multigrid
In this section, we summarize the layer-parallel training approach as presented in GuRuScCyGa2018 . For a full description of the multigrid training scheme, the reader is referred to the original paper, and references therein.
At the core of the layer-parallelization technique is a parallel multigrid algorithm that is applied to the discretized ODE network, replacing the serial forward (and backward) network propagation. Consider the discretized ODE network propagation to be written as
For example, one can choose , thus letting denote the right hand side of (6) for a ResNet with step size and on a uniform time grid. Classical sequential training solves (7) forward in time, starting from for either a general feature vector , or for a batch of feature vectors , finally stopping at the network output . On the other hand, the layer-parallel multigrid scheme solves (7) by simultaneously computing all layers with an iterative nonlinear multigrid scheme (FAS, BrHeMc2000 ) applied to the network layer domain. This multigrid scheme essentially computes inexact forward- and backward-propagations in an iterative fashion, such that the process converges to the layer-serial solution.
To compute these inexact propagations, a hierarchy of ever coarser layer-grid levels is created, which in turn accelerate convergence to the layer-serial solution on the finest level. A coarser level is created by assigning every -th layer to the next coarser level, giving the step size at the -th level, with level being the finest. For each level, this assignment results in a partitioning of the layers into - (fine-grid) and - (coarse-grid) layers.
Each multigrid iteration then cycles through the different layer-grid levels while applying smoothing operations to update (improve) the network state at - and - layers, which can be updated in parallel in an alternating fashion. Each - or -layer smoothing operation consists of an application of the layer propagator , using the step size of the current grid, to update the - or - layer states, see Figure 1. Additionally, each coarse level has a correction term that involves the residual of states between successive layer-grid levels, which is a part of the FAS method. Typically, -smoothing is applied on each grid level which refers to a sucessive application of -smoothing, then -smoothing, then -smoothing again. Note that these smoothing operations are highly parallel as - and -point sweeps are applied locally on each layer interval and independently from each other. Serial propagation only occurs on the coarsest level, where the problem size is trivial.
At convergence, the layer-parallel multigrid solver reproduces the same network output as serial propagation (up to a tolerance). However it enables concurrency across the layers. This concurency creates a cross-over point in terms of computational resources, after which the layer-parallel multigrid approach provides speedup over the layer-serial computation.
The same layer-parallel multigrid strategy can be applied to parallelize both the forward propagation and the backpropagation to compute the gradient with respect to the network weights. For the latter, the layer propagator at each grid level propagates partial derivatives of with respect to and backwards through the time domain (through the layers), again locally and in parallel.
The layer-parallel solvers could in principle be substituted for forward- and backpropagation in any gradient-based optimization scheme for solving (4) when training. One would only replace the sequential forward- and backpropagation with the layer-parallel multigrid iterations. However due to its iterative nature, the layer-parallel scheme is very well suited for simultaneous optimization approaches that utilize inexact gradient information to update the network weights. In GuRuScCyGa2018 , the simultaneous One-shot method is applied, which performs a fixed number of multigrid iterations for the network state and its derivative, with being as low as
, before each update of the network weights. Therein, a deterministic optimization approach has been applied, involving second order Hessian information. However the use of this approach with stochastic optimization, e.g., stochastic gradient descent method (SGD), is possible, although not yet tested numerically.
2.2 Nested Iteration (Multilevel) Initialization of Deep Neural Networks
Our proposed answer for initializing deep networks is nested iteration, where a trained coarse network, with fewer time-steps, is interpolated to initialize a finer network, with more time-steps. Our nested iteration algorithm is depicted in Figure 2 and Algorithm 1, with the following notation. Let the total number of nested iteration levels be , where is the finest level (i.e., largest network) and the superscript denotes quantities on nested iteration level , e.g., and are the coarsest-level state and control (weights and biases) variables for all time steps (layers). Additionally, let be a set of size containing level dependent optimization iteration counts, e.g., denotes the number of optimization iterations to carry out on the final, finest level . Line 5 of Algorithm 1 represents one optimization iteration of the layer-parallel training approach to update the current weights , applying inner layer-parallel multigrid iterations.
Finally, we define the interpolation operator to interpolate the weights and biases to the next finer level . The interpolation here is a uniform refinement in time with refinement factor 2. That is, there are exactly twice as many time-steps on the finer grid as on the coarser grid . Thus, if and the initial number of layers were , the final network would have layers, as depicted in Figure 2. We allow for two types of interpolation, piece-wise constant in time and linear in time, and thus define as
for with being the number of layers on nested iteration level .
The weights on the coarsest nested iteration level are initialized according to the overall hyperparameters, e.g., zero weights for internal layers, and random weights from a certain distribution for the opening and closing layers. These strategies are discussed in the results Section 3. Additionally, the choice between linear and piece-wise constant interpolation now becomes another hyperparameter to choose. In our experiments, both strategies performed similarly, in terms of multigrid convergence and training and validation accuracy. Thus, we report results only for piece-wise constant interpolation in Section 3.
One key detail in Algorithm 1 is the choice of the number of inner layer-parallel multigrid iterations . It has been observed numerically, that a small choice of during early optimization iterations can lead to steep drops in the training and validation accuracy after the interpolation step in line 8. Rather than being related to the interpolation itself, we observed that the true cause of these steep drops were inaccurate multigrid solves which lead to big errors in the gradient, and thus poor updates to the weights. We therefore enforce more multigrid iterations immediately after interpolation, in particular we choose for the first optimization iterations after interpolation to the new grid level, and after that (i.e. after ). In general, we recommend to apply enough multigrid iterations that ensure to drop the multigrid error below a relative error tolerance, such as guaranteeing a relative error drop of the multigrid residual of 4 orders of magnitude. It is important to note that this issue does not occur when using nested iteration with sequential training, as in Haber_2017 .
Other possible enhancements to the algorithm, such as refinement by other factors than 2 and other interpolation formulas are topics for future research.
We use two machine learning problems to demonstrate the performance of the nested iteration algorithm:
The first problem is referred to as “peaks”, and suggested in Haber_2017
. The task is to classify particles as members of five distinct sets. We train ondata points consisting of particle positions
, while membership in the sets is defined by a vector of probabilities(unit vectors).
The second example is a hyperspectral image segmentation problem using the Indian Pines image data set indianpinesdata . The classification task for this problem is to assign each pixel of a pixel image to one of classes representing the type of land-cover (e.g. corn, soy, etc…), based on spectral reflectance bands representing portions of the electromagnetic spectrum. We choose pixels for training.
For both examples, we choose a ResNet architecture as in (6), with a ReLU activation function that is smoothed around zero. The linear transformations inside each layer consist of a dense matrix of network weights.
In order to provide a fair basis for comparison between the nested iteration and non-nested iteration training simulations we introduce a common “work unit.” In all examples below, the work unit is defined as the average wall-clock run time for each iteration of the non-nested iteration on the fine grid. Thus the number of work units required for a non-nested iteration is equal to the number of optimization iterations. For the nested iteration, this rescaling of the run time provides a common basis for comparison to the non-nested iteration. In addition, the metric used for comparison below is the validation accuracy. This is measured by withholding a subset of the data from training and checking the performance of inference on those values. The percentage (ranging from to ), of correctly classifying members of that data set is the validation accuracy.
3.1 Peaks Example
Two versions of the peaks problem are run with residual networks of width and . For both the nested and non-nested cases, processors are used for the inner layer-parallel solve. The nested iteration is run with a schedule starting with coarse steps, steps, and fine level steps. For non-nested iteration optimization steps are taken (the number of steps was chosen so the run time of non-nested was nearly the same as nested iteration). The final simulation time for the peaks problem is set to .
A challenge facing neural networks usage is the range of parameters associated with both their design and training. In addition, due to the variability of the loss surface, the initial guess for the controls and state variables can have a dramatic impact on the quality of the training. The tables in Figure 3 show statistics for independent training runs, for each of sets of parameters, yielding a total of uniquely trained neural networks. The parameters selected for this study were the initial magnitude of the randomly initialized weights denoted , and the Tikhonov regularization parameter denoted . Based on a larger hyperparameter scan, we selected two values for each parameter that yield the greatest validation accuracy for nested iteration and non-nested iteration. As a result and , yielding a total of four parameters.
The top table shows results for a network with width , while the bottom shows the results for width . In both cases, there are
residual layers, plus an opening and classification layer. The tables show that the nested iteration achieves better validation accuracy on average for both network configurations. In addition, there is less variation as measured by both the standard deviation and the range of extrema using nested iteration. We attribute this to the use of a sequence of coarse grid problems to define an improved initial guess for the fine grid. At the coarse level, because of the reduced parameter space, the variation seen in the objective surface is potentially not as large. When only the fine simulation is used in non-nested iteration, the likelihood that the training algorithm gets stuck in a local minima early in the process is likely increased. In effect, the nested iteration is behaving like a structural regularization approach for the objective surface.
|Std. Dev||5.18 %||6.08 %|
Figure 4 shows the validation accuracy of the peaks problem as a function of work units for the individual best runs of nested iteration and non-nested iteration. The non-nested iteration corresponds to a single red line. The three levels of the nested iteration are plotted in a sequence of colors that show the achievement for each level of the algorithm. Again this is a function of work units, so these levels are scaled relative to the cost of a single fine level optimization step (some variability may occur due to the use of a backtracking algorithm).
In both the top image (for width residual networks) and the lower image (width ), the nested iteration has clearly superior validation accuracy as compared to the non-nested iteration. Moreover, the accuracy achieved for any number of work units is larger with nested iteration.
3.2 Indian Pines Example
For the Indian Pines example, we use a residual neural network with width , that contains at the fine level residual layers, plus opening and classification layers. For nested iteration, we use three levels with the coarsest having residual layers. The final simulation time is . All runs were performed on processors, implying the coarse grid contains residual layer per processor.
Figure 5 compares the validation accuracy of training with the non-nested algorithm to two nested iteration strategies. For all cases, the validation accuracy is plotted as a function of work units performed throughout the optimization solver. The first nested iteration strategy uses a schedule that performs optimization iterations on the coarse level, on the medium, and just iterations on the fine. This approach is designed to reduce the run time assuming that most of the work of training can be done on coarser levels. Note, as a result of not iterating on the fine grid as much, this may result in a lower achieved validation accuracy (indeed this is born out by the results). The top plot in the figure shows the non-nested iteration (red line), and the three different levels of the nested iteration (multiple colors). Considering only the coarse problem, it is clear that the nested iteration achieves higher validation accuracy in less computational time. This can be attributed to the much greater number of iterations taken. Where the increase in speed is a result of a shallower network. Moreover considering the entire run, the nested iteration has larger validation accuracy after just work units.
The lower image shows a similar story. However, this time the schedule for the nested iteration uses optimization steps at each level. Here again its clear that training for many steps on the coarse level yields rapid improvements. Overall, higher validation accuracy is still achieved in all cases for a fixed number of work units. Relative to the previous schedule (top image), the validation accuracy of this uniform schedule is improved, though at the cost of longer training times.
In this work, a nested iteration strategy for network initialization that enhances recent advances in layer-parallel training methodologies is developed for ODE networks. This approach uses a training algorithm where a sequence of neural networks are successively trained for a geometrically increasing number of layers. Interpolation operators are defined that transfer the weights between the levels. Results presented for the Peaks and Indian Pines classification example problems show that nested iteration can achieve greater accuracy at less computational cost. An exciting additional benefit was observed for the peaks problem. In this case the nested iteration also provided a structural regularization effect that resulted in reduced variation over repeated runs in a hyperparameter sweep. A more thorough investigation of this result, and greater improvements to the nested iteration and layer-parallel algorithms are the subject of future work.
The work of E. C. Cyr was supported by Sandia National Laboratories and the DOE Early Career Research Program. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. The views expressed in the article do not necessarily represent the views of the U.S. Department of Energy or the United States Government. S. Günther was supported by Lawrence Livermore National Laboratory. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07-NA27344. LLNL-PROC-798920.
This document was prepared as an account of work sponsored by an agency of the United States government. Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes.
- (1) Adler, J., Manteuffel, T.A., McCormick, S.F., Nolting, J., Ruge, J.W., Tang, L.: Efficiency based adaptive local refinement for first-order system least-squares formulations. SIAM Journal on Scientific Computing 33(1), 1–24 (2011)
- (2) Baumgardner, M.F., Biehl, L.L., Landgrebe, D.A.: 220 band aviris hyperspectral image data set: June 12, 1992 indian pine test site 3 (2015). DOI doi:/10.4231/R7RX991C. URL https://purr.purdue.edu/publications/1947/1
- (3) Biegler, L.T., Ghattas, O., Heinkenschloss, M., van Bloemen Waanders, B.: Large-scale pde-constrained optimization: An introduction. In: L.T. Biegler, M. Heinkenschloss, O. Ghattas, B. van Bloemen Waanders (eds.) Large-Scale PDE-Constrained Optimization, pp. 3–13. Springer Berlin Heidelberg (2003)
- (4) Briggs, W.L., Henson, V.E., McCormick, S.F.: A multigrid tutorial, 2nd edn. SIAM, Philadelphia, PA, USA (2000)
Chang, B., Meng, L., Haber, E., Ruthotto, L., Begert, D., Holtham, E.:
Reversible architectures for arbitrarily deep residual neural networks.
In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Chaudhari, P., Oberman, A., Osher, S., Soatto, S., Carlier, G.: Deep relaxation: partial differential equations for optimizing deep neural networks.Research in the Mathematical Sciences 5(3), 30 (2018)
- (7) Chen, T.Q., Rubanova, Y., Bettencourt, J., Duvenaud, D.K.: Neural ordinary differential equations. In: Advances in neural information processing systems, pp. 6571–6583 (2018)
- (8) Cyr, E.C., Gulian, M., Patel, R., Pergeo, M., Trask, N.: Robust training and initialization of deep neural networks: An adaptive basis viewpoint. In: Submitted to the MSML2020 (Mathematical and Scientific Machine Learning Conference) (2019)
- (9) De Sterck, H., Manteuffel, T., McCormick, S., Nolting, J., Ruge, J., Tang, L.: Efficiency-based -and -refinement strategies for finite element methods. Numerical Linear Algebra with Applications 15(2–3), 89–114 (2008)
- (10) Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics 9, 249–256 (2010)
- (11) Günther, S., Ruthotto, L., Schroder, J., Cyr, E., Gauger, N.: Layer-parallel training of deep residual neural networks. SIAM Journal on Data Science (2019 (submitted)). ArXiv preprint arXiv:1812.04352
- (12) Haber, E., Ruthotto, L.: Stable architectures for deep neural networks. Inverse Problems 34(1), 014004 (2017). DOI 10.1088/1361-6420/aa9a90. URL http://dx.doi.org/10.1088/1361-6420/aa9a90
- (13) Hackbusch, W.: On the convergence of multi-grid iterations. Beiträge Numer. Math 9, 213–239 (1981)
- (14) Hanin, B., Rolnick, D.: How to start training: The effect of initialization and architecture. In: Advances in Neural Information Processing Systems, pp. 571–581 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.
In: Proceedings of the IEEE international conference on computer vision, pp. 1026–1034 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Humbird, K.D., Peterson, J.L., McClarren, R.G.: Deep neural network initialization with decision trees.IEEE transactions on neural networks and learning systems 30(5), 1286–1295 (2018)
- (18) Kronsjö, L.: A note on the nested iterations method. BIT Numerical Mathematics 15(1), 107–110 (1975)
- (19) Kronsjö, L., Dahlquist, G.: On the design of nested iterations for elliptic difference equations. BIT Numerical Mathematics 12(1), 63–71 (1972)
- (20) Lions, J.L.: Optimal control of systems governed by partial differential equations (1971)
- (21) Lu, Y., Zhong, A., Li, Q., Dong, B.: Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. arXiv preprint arXiv:1710.10121 (2017)
- (22) Lu, Y., Zhong, A., Li, Q., Dong, B.: Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. arXiv preprint arXiv:1710.10121 (2017)
- (23) Ruthotto, L., Haber, E.: Deep neural networks motivated by partial differential equations. arXiv preprint arXiv:1804.04272 (2018)
- (24) Tröltzsch, F.: Optimal control of partial differential equations: theory, methods, and applications, vol. 112. American Mathematical Soc. (2010)
- (25) Trottenberg, U., Oosterlee, C., Schller, A.: Multigrid. Academic Press, London, UK (2001)
- (26) Weinan, E.: A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics 5(1), 1–11 (2017)