1 Introduction
As transistors get smaller the amount of power per unit volume no longer remains constant as Dennard and coauthors predicted in 1974 (Dennard et al., 1974). After over thirty years, Dennard’s scaling law ended in 2005, and CPU manufacturers are no longer able to increase clock frequencies significantly (Koomey et al., 2011)
. The physical limitations of siliconebased microprocessors gave rise to computing architectures with many cores. The reduction in cost and wide availability of multicore processors completely revolutionized the field of neural networks, especially for computer vision tasks. Therefore, the recent success of neural networks in many learning tasks can, in large part, be attributed to advances in computer architectures. Alternative computing technologies (e.g., quantum computers) are not likely to be available soon. Thus any future advances in more efficient training of neural networks will come from algorithms that can exploit distributed computer architectures.
Currently, the greatest obstacle in the development of distributed optimization algorithms for neural network training is the entirely serial nature of forward and backward propagation. The parameters of the neural network can only be updated after the forward propagation algorithm propagates the data from the first to the last layer and the backpropagation algorithm propagates the gradient information back to the first layer through all the layers of the network. This entirely serial nature of the training process severely hinders the efficiency of distributed optimization algorithms for deep neural networks. As a result, all the widely used frameworks for training neural networks only offer data parallelism, and the problem of layerwise parallelism remains open.
We propose a novel distributed optimization algorithm that breaks the serial nature of forward/backward propagation and allows for layerwise parallelism. The proposed algorithm exploits the interpretation of neural networks, and Residual Neural Networks in particular, as dynamical systems. Several authors have recently adopted the dynamical systems point of view (see, e.g., (Haber & Ruthotto, 2018; E, 2017; Chen et al., 2018; Li et al., 2017) and Section 1.1 for a discussion of related work). Reformulating the problem of Residual Neural Network (RNN) optimization as a continuoustime optimal comtrol problem allows us to model the layers of a neural network as the discretization timepoints of a continuoustime dynamical system. Thus RNN training can be interpreted as a classical optimal control problem. The results of this paper hold for Residual Neural Networks. It is possible to extend the dynamical systems view to different architectures but in this paper we focus on Residual Neural Networks (RNNs). Using the interpretation of RNN training as an optimal control problem, we decompose the neural network across time (layers) and optimize the different subsystems in parallel. The central insight of this paper is that if the state and costate (adjoint) of the dynamical system are approximately known in intermediate points, then we can parallelize the time dimension of the system and perform forward/backward propagations in parallel. This description justifies the name “parallelintime” method because we parallelize across the time dimension of the optimal control model. A significant challenge is to produce approximately correct state and costate information for the optimal control problem. We address this problem by using a coarse discretization of the problem with a phase we call global prediction phase. The global prediction phase is followed by a local correction phase that attempts to improve the predicted optimal state and costates of the control model solution.
1.1 Related Work & Contributions
Layerwise optimization of NNs. Several authors have identified the limitations that backpropagation imposes on distributed algorithms. As a result, many approaches have been proposed including ADMM, blockcoordinate descent (Zeng et al., 2018) delayed gradients(Huo et al., 2018), synthetic gradients(Jaderberg et al., 2017), proximal (Lau et al., 2018), penalty based methods(CarreiraPerpinan & Wang, 2014; Huo et al., 2018), and alternating minimization methods (Choromanska et al., 2018). Our method (especially the local correction phase) is related to the synthetic gradient approach in (Jaderberg et al., 2017). The major differences between synthetic gradients of (Jaderberg et al., 2017) and our approach are that we exploit the dynamicalsystems view and a multilevel discretization scheme to approximate the costate (adjoint) variables quickly. We also establish the convergence of our method with weaker conditions than in (Jaderberg et al., 2017). In Section 5 we also show that our method outperforms even an improved version of the synthetic gradient method.
The dynamical systems point of view. Several authors have formulated neural network training as an optimal control problem in continuous time. For example, the authors in (Haber & Ruthotto, 2018) have adopted this point of view to develop more stable architectures for neural networks. In (Li et al., 2017) the authors used this approach to propose a maximumprinciple based method as an alternative to backpropagation. The work in (Li et al., 2017)
also allows layerwise optimization, but unfortunately, it was slower than stochastic gradient descent (SGD). In
(Chen et al., 2018) the authors proposed to use classical ODE solvers to compute the forward and backward equations associated with NN training. We note that none of the existing approaches used the dynamical systems viewpoint to allow more parallelization in the optimization process. A recent exception is the method in (Günther et al., 2018). In (Günther et al., 2018) the authors used the socalled parareal method to parallelize across the time (layer) dimension of neural networks. However, the parareal method is known to be problematic for nonlinear systems (Gander & Hairer, 2008) and the efficiency reported by the authors is lower than our method and other methods such as synthetic and delayed gradients.Contributions. We exploit coarse and fine discretizations of continuous systems to compute predictions for optimal states/costates of optimal control problems. The state/costate predictions allows us to develop a layerwise (parallelintime) optimization algorithm for RNNs (Section 3). We exploit the structure of the neural network training problem to show that the value function of the problem can be decomposed (Lemma 2.1). We also discuss the relationship between stochastic gradient descent, backpropagation and the costate (adjoint) variables in optimal control (Lemma 2.2). We establish appropriate conditions under which the proposed algorithm will converge (Theorems 4.1 & 4.2). We report encouraging numerical results in Section 5.
2 Optimal Control & Residual Neural Networks
The focus of this paper is Residual Neural Networks (RNNs) because they can achieve stateoftheart performance in many learning tasks and can be reformulated as continuoustime optimal control problems. This reformulation offers several benefits such as stable classifiers, memory efficient algorithms and further insights into how and why they work so well (see, e.g.,
(E, 2017; Li et al., 2017; Haber & Ruthotto, 2018; Chen et al., 2018)). In this section, we review the dynamical systems viewpoint of RNN training. We review the links between the costate (dual) variables of optimal control problems and the most widely used method to train neural networks, namely stochastic gradient descent and backpropagation (see Lemma 2.2). In Lemma 2.1 we make a simple observation that enables the decomposition of the optimal control problem across different initial conditions. This observation greatly simplifies the proof for a randomized version of our algorithm.For the origins, and advantages of the formulation below we refer to (E, 2017; Haber & Ruthotto, 2018), where training with residual neural networks is reformulated as the following optimal control problem.
(1) 
Where is a data fidelity term, and is a regularizer. We note that the label for the datapoint has been subsumed in the definition of . The regularizer function could also be a function of the state () and time . Regularization is typically used to prevent overfitting and therefore applied to the parameters () of the network. Hence we assume that the regularizer is only a function of the control parameter . The notation is used to denote the initial conditions of the dynamics in (1). We use to denote the element in . The function
describes the activation function applied at time
. For example, for the parameters , the activation function may be defined as , where is applied element wise. For the usual activation, regularization and data fidelity terms used in practice, the optimal control problem above is well defined (see also (Bardi & CapuzzoDolcetta, 1997)).The function defined in (1) is called the value function of the problem. The value function, and its derivatives, will play a crucial role in this paper, and below we make a simple observation regarding the dimensionality of the value function. Using the principle of optimality (Bertsekas, 2017) we can rewrite the original model in (1) as follows,
(2) 
Where is defined as follows,
(3) 
We take advantage of the structure of the problem to show that the value function can be written as a sum of identical functions from to .
Lemma 2.1.
Let denote the optimal value function of (3). Then for any , there exists a function such that:
The result above simplifies the problem dramatically. It allows us to approximate the dimensional value function with a sum of dimensional functions.
2.1 Discretization
To approximate the problem in (1) we use a discretization scheme to obtain the following finite dimensional optimization problem.
(4) 
Where we used a discretization parameter , such that with we obtain the discretetime optimal control problem in (4) with timesteps. We use the notation to denote an explicit discretization scheme of the dynamics in (1). For example, if an explicit Euler scheme is used then . When we use the simple explicit Euler scheme, with the formulation above reduces to the standard RNN architecture. The deficiencies in terms of numerical stability of the explicit Euler scheme are well known in the numerical analysis literature (Butcher, 2016). In (Haber & Ruthotto, 2018) the authors argue that the use of the explicit Euler scheme for RNNs explains numerical issues related to the training of NNs such as exploding/vanishing gradients and classifier instability. To resolve these numerical issues the authors in (Haber & Ruthotto, 2018) propose to use stable discretization schemes. We follow the same line of reasoning, and show that a further advantage of appropriately defined discretization schemes is that they can be used to design convergent distributed algorithms to solve (4). Before we describe our approach we review serialintime methods and explain why it is difficult to design efficient distributed algorithms for (4).
Multilevel Discretization. Since (1) is a continuous time model we can approximate it using different levels of discretization. To keep the notation simple and compact we assume that we want to solve (1) using a stepsize of . This gives rise to the model in (4), with timesteps/layers. With our terminology the number of timesteps refers to the number of layers in the NN. We use the terms timesteps and layers interchangeably. We call the model with the a stepsize of the fine model. Later on we will take advantage of a coarse discretization of (1), and we refer to the resulting model as the coarse model. The timestep parameter of the coarse model is denoted by . We use to denote the number of timesteps in the coarse model. The coarse model has less timesteps than the fine model, and it is therefore faster to optimize. As a practical example, suppose we are interested in the solution of (1) using a fine model with 1024 steps (layers). Suppose we also use a coarse model with 64 steps/layers (for example). A forward/backward pass of the coarse model will roughly be 16 times faster. The use of multiple levels of discretization is a well known technique in the solution of optimal control problems, and has its origins in the multigrid literature(Briggs et al., 2000)
. From the multigrid literature we will use the idea of interpolation operators in order to transform a trajectory from the coarse discretization grid to the fine grid (see
(Briggs et al., 2000; Haber & Ruthotto, 2018))2.2 SerialinTime Optimization
The most popular optimization algorithm for (4) is batch stochastic gradient descent(Curtis & Scheinberg, 2017; Bottou et al., 2018). Stochastic first order methods compute the gradient of the objective function in (4) with respect to the parameters using backpropagation. In optimal control algorithms, the same gradient information is computed by solving backwards the costate equations associated with the dynamics in (4). The connections between costates, adjoints, Lagrange multipliers and backpropagation are well known (see e.g. (Baydin et al., 2018; LeCun et al., 1988)). We adopt the language used in scientific computing, and call the forward propagation the forward solve, and the backward propagation the backward solve. The forward solve is specified in Algorithm 1. This algorithm plays the same role as forward propagation of conventional NN algorithms. The difference is that we do not rely on explicit Euler discretization but we use a discretization scheme with a forward propagator that is stable and consistent with the dynamics in (1). We use stable in the sense used in numerical analysis (see (Butcher, 2016)) and consistent in the sense used in optimization (see (Polak, 1997)). The backwards solve is specified in Algorithm 2. The purpose of the backward solve is to generate the information needed to compute the gradient of the objective function of (4) with respect to the controls . We use a stable and consistent scheme for the backward solve too. After the forward and backward equations are solved, the information generated is used in some algorithmic mapping denoted by . This mapping generates a (hopefully) improved set of controls. A full iteration of a serial (in time) stochastic first order algorithm consists of a forward solve, a backward solve, followed by an update for . We state the standard serialintime algorithm in Algorithm 3. In order to make our terminology more concrete we show that for a specific choice of , the procedure above reduces to the standard stochastic gradient method with backpropagation.
3 A ParallelinTime Method
It is challenging to parallelize optimization algorithms for the model in (4) because the backward solve cannot start before the forward solve finishes. Moreover, it is not possible to parallelize the forward solve because cannot be computed before is computed. Similarly must be computed before in the backward solve. Thus most algorithms only allow for data parallelism (e.g. across batches or in the calculation of ), but not across time/layers (see Section 1.1 for some exceptions).
(5) 
Suppose (somehow) we had an approximately optimal trajectory at some intermediate point and the corresponding costate variable . Suppose we also had two processors, Processor A and B, then we could potentially halve the cost of a full iteration of Algorithm 3. We achieve this impressive reduction in time by using Processor A to do a backward solve from to , followed by a forward solve from to . In parallel, processor B is able to do a forward solve from time to time , followed by a backward solve from time to time . Thus, with two processors we can (potentially) halve the cost of a full iteration of any stochastic firstorder method for (4). In reality, the reduction in time due to the extra processing power will not be halved (due to communication). The difficult issue is how to compute the approximately optimal intermediate points and ? To address this question we proceed using two phases: a prediction phase, followed by a correction phase. To be more precise, we first construct a coarse discretization of (1) which we solve approximately using Algorithm 3. We call this phase the global prediction phase because it generates global information regarding the optimal trajectory. The global prediction phase generates useful information but it is not exact. To correct the prediction we have a second phase we call the local correction phase. The local correction is the timeparallel phase where we solve discretized versions of (2) to find better initial conditions for (3) in parallel. Below we explain these two phases in turn.
3.1 Global Prediction Phase
In the global prediction phase we construct an approximate solution of (1) using a coarse discretization scheme. Our main assumption regarding the discretization scheme is that it provides a consistent approximation in the sense of (Polak, 1997). We use a large stepsize , to approximate the model in (1) with a finite dimensional optimization problem. We use a standard algorithm (e.g. Algorithm 3) to perform some iterations on the coarse model. We call this phase the global prediction phase because we use the coarse model to quickly generate global information about the solution of the model in (1). In this context, the global information is contained in the forward trajectory (good initial conditions to initialize the local correction phase), and the backward trajectory (sensitivity information regarding the initial conditions at time ). We also obtain a good initial point for the parameters . The coarse model is only used to generate predictions regarding the optimal state and costates of the control problem. The predictions are corrected in parallel using the local correction phase described below.
3.2 Local Correction Phase
Using the information generated from the global prediction phase we split the original model into two subsystems. The first subsystem is responsible for identifying the optimal initial conditions for the second subsystem. The second subsystem receives the initial conditions from the first subsystem and is responsible for solving the classification problem. The second subsystem also passes information back to the first subsystem in the form of sensitivity information (from the costate variables).
The local correction phase is shown in Algorithm 4 for the case when the original model is decomposed into two subsubsystems. The left column in Algorithm 4 describes the steps to optimize the first subsystem, and the right column describes the steps for the second subsystem. Note that the two subsystems are solved in parallel. This is why we call the algorithm parallelintime. Our method computes the optimal parameters for time without waiting for information from the past .
We first describe the work that Processor A performs (left column) on the first subsystem i.e. the optimization of (4) from time to using a timestep of . To start a backward solve at iteration from time we need , and . In the first iteration of the local correction phase (), we compute and by simple interpolation from the coarse model. In subsequent iterations and are also available to the local correction phase of the first subsystem. This is because the first subsystem has all the information required from time to time . Unfortunately, the costate variable is not available at time because to compute it we need to perform a backward solve from time to time . Using coarse information only is not sufficient to build a good approximation for the costate variables. Instead, we approximate from state/costate observations from the prediction phase, along with state/costate observations collected from the second subsystem. In the first iteration of the local correction phase we only have the observations from the prediction phase to approximate (information from the second subsystem is not yet available). We use to denote the state/costate pairs observed by the first subsystem at iteration of the local correction phase. The prediction phase, after iterations of the coarse model, produces the following information . We use the state/costate pairs observed so far to (approximately) solve the following regression problem,
(6) 
Using the solution of the linear regression problem above we can approximate
at any state as (where are approximate solutions to the regression problem above). After the backward solve finishes, we update the controls, do a forwardsolve and pass the state to the second subsystem. We next discuss how to solve the problem from time to time T, and how to update the information set .In the first iteration of the local correction phase, Processor (right column) in Algorithm 4, receives the approximate state , and controls from the global prediction phase. Starting from it performs a forward solve, a backward solve and updates its controls. After the backward solve finishes, Processor B passes sensitivity information in the form of the costate variables to Processor A. Processor A then sets . The same steps are then repeated by both processors. We use the notation when the regression problem in (6) is solved with the information set . With a slight abuse of notation we write to denote the approximate costate information obtained using the solution of the regression problem in (6).
Remark 1 (More than two subsystems).
We described the algorithm using only two subsystems. To use more than two subsystems we observe that the second subsystem is a standard optimal control problem from time to time . We can therefore use the same procedure we described in this section to divide the second subsystem into two. We can then continue to divide the system to as many subsystems as required.
Remark 2 (Minibatches and asynchronous computation).
To simplify the notation we did not describe the algorithm using minibatches. In order to use minibatches we just change Algorithms 1 and 2 to use minibatches. We then change the notation so that denotes the state at iteration , batch at time (with similar notation for the control and costate variables). Finally, Algorithm 4 has a synchronization step after each subsystem completes a single iteration. This is how the algorithm was analyzed and implemented and we leave the asynchronous version for future work.
4 Convergence Analysis
In this section we summarize the theoretical convergence results for Algorithm 4. All proofs and technical lemmas appear in the supplementary material. Algorithm 4 is quite general because we do not specify the algorithmic mapping in the update step, or the discretization scheme used to derive (4). In order to keep the convergence analysis as close to the numerical implementation as possible we use the algorithmic mapping specified in Lemma 2.2. It is possible to establish similar convergence results for other schemes too. For example, because our method can decompose a large network into smaller subnetworks, it might be possible to use secondorder methods. We leave such refinements of the scheme to future work. Moreover, as discussed in Remark 2 the fact that the algorithm has a synchronization step simplifies the analysis. This simplifying assumption allows us to analyze the algorithm as if it was run on a single processor.
The starting point of our analysis is the result in Lemma 2.1. It allows us to compute the optimal value function as the sum of dimensional functions as opposed to a single dimensional function. This is an important insight, because we know from the maximum principle (Fattorini, 1999) that at the optimal solution (where denotes the i initial condition in (1)). Since is a function from to , it follows that the adjoint is function from to . In fact, in the proof of Lemma 2.2 we show that,
The equation above explains why the costate variables provide sensitivity information for the first subsystem. The main additional assumption we need to prove the convergence of Algorithm 4 (beyond the assumptions needed for any stochastic firstorder methods) is that there exists an such that,
(7) 
We note that, at least in principle, such an is guaranteed to exist. The more interesting question is, of course, whether this constant is small or not. The theoretical results below assume an arbitrary . In practice, we found that this constant is small. In Section 5 we conjecture that this constant is small because we use data for several levels of discretization to construct . As was mentioned in the introduction, the regression step in the local correction phase has similarities with the the so called synthetic gradient method proposed in (Jaderberg et al., 2017). The conceptual differences between synthetic gradients and the proposed method were detailed in 1.1). However, the convergence analysis and assumptions are different too. For example, in (Jaderberg et al., 2017) the authors assume that (where is the, so called, synthetic gradient). Our assumptions are weaker. In addition, the convergence analysis in (Jaderberg et al., 2017) is for gradient descent and not for stochastic gradient descent. Our first convergence result is for the case where the full gradient is used.
Theorem 4.1 (Reduction in objective function).
Suppose that, . Then, and in particular,
where the scalars are positive and depend only on (the Lipschitz constant of ) and .
We note that if , then Theorem 4.1 gives exactly the same result as the gradient descent method for (4). Our next result deals with the case when the algorithm is run using minibatches.
Theorem 4.2.
We note that for the theorem above to hold some additional restrictions on are required (see the online supplement for additional discussion).
5 Numerical Experiments
In this section, we report preliminary numerical results for Algorithm 4. This paper aims to establish a framework for timeparallel training of neural networks, and not to report on extensive numerical experiments. Still, it is essential to show that the algorithm is promising. We implemented the algorithm on a standard computer with a quadcore processor and 8GB of RAM. We decomposed the system into only two subsystems. We believe that the performance of our algorithm will improve with more than two subsystems, and when ported to systems with a large number of cores. Below we report results from three datasets. The first two datasets (cocentric ellipse, and swiss roll) are relatively low dimensional. To save space, we refer to (Haber & Ruthotto, 2018) for a description of these two datasets. The third dataset is the well known MNIST dataset. For largescale problems, the cost of approximately solving a coarse model becomes a less significant part of the overall cost. Therefore, we believe that the efficiency of our method will improve for largescale problems. Indeed we see that for the MNIST dataset our algorithm achieves good speedups even for relatively shallow neural networks. But such extensive numerical experiments are beyond the scope of the current paper. The code is available from the authors’ GitHub page (and in the online supplement during the review phase).
In Section 1.1 we mentioned several approaches for layer (time) wise parallelization of neural networks. Because the synthetic gradient method of (Jaderberg et al., 2017) is closely related to Algorithm 4
we only compare against a stable version of the synthetic gradient approach. We also compare different variants of our method against the dataparallel implementation of SGD in Pytorch 0.4.1. It will be interesting to compare all the different approaches described in Section
1.1, but this is not the aim of this paper. In (Huo et al., 2018) the authors compared several layerwise parallelization frameworks and concluded that their method, when tested on ResNet architectures on the CIFAR datasets, outperformed others and achieved speedups of 15% to 50%, without significant loss of accuracy. Below we report similar results, but with higher parallel efficiency. In (Huo et al., 2018) the authors compared their method against the synthetic gradient method in (Jaderberg et al., 2017)and found that their method significantly outperforms synthetic gradients. Their implementation of synthetic gradients was based on network architectures that were shown to be unstable in (Haber & Ruthotto, 2018). So in our view, more careful numerical experiments are needed in order to decide the merits of the different approaches. However, these are beyond the scope of this paper.Discretization schemes. We considered two discretization schemes to derive the model in (4). The first one is based on an explicit Euler scheme and the second on symplectic integration using the Verlet method. We chose the explicit Euler scheme because this scheme gives rise to the standard ResNet architecture. We chose the Verlet scheme because it was shown to perform well in previous works (Haber & Ruthotto, 2018). We also tested our method with and without the global prediction phase. When the global prediction phase is used we refer to our method as the multilevel parallelintime algorithm. When we do not use the global prediction phase then we call our method the singlelevel parallelintime method. The regression step in our singlelevel, parallelintime method is similar to the decoupled neural interfaces method with synthetic gradients of (Jaderberg et al., 2017). However, our singlelevel method is implemented with a stable discretization scheme. It is shown below that the discretization scheme makes a significant impact in the parallel efficiency and accuracy of the method. The coarse model we used are exactly half the size of the fine model (e.g. if the fine model has 64 layers (steps) then the coarse model is constructed with 32 layers (steps)).
Accuracy of serial and paralleltime method. The first observation from our results is that the the accuracy of our method (especially with the Verlet discretization scheme) is similar to the accuracy obtained with the dataparallel SGD method. It is clear from Figures 5, 6 and 7 (in the supplement) that the parallelintime method produces results with similar accuracy as Stochastic Gradient Descent (SGD).
Speedup and parallel efficiency results. Figure 1 summarizes the speedup obtained from our method against the dataparallel implementation of SGD in Pytorch. For our parallelintime method we include the time needed to solve the coarse (when used) and fine models in the speedup calculations. We report results for the explicit discretization scheme without using the global prediction phase i.e. using a singlelevel discretization, and in Figure 1 we refer to this method as the singlelevel Euler method. We also report results using the multilevel scheme (i.e. using the global prediction phase) and the Verlet discretization scheme. In Figure 1 we refer to this method as the multilevel Verlet method. We observed similar speedups for both the ellipse and SwissRoll and so to save space we only report the results from the SwissRoll dataset in Figure 1 (see Figure 4 in the online supplement for the ellipse dataset). From Figure 1 we see that for relatively small datasets (e.g. the ellipse and SwissRoll datasets) there is a cutoff point (around 32 to 64 layers) after which our method is faster than the data parallel implementation. Since we only use two processors, we observe that, for deep networks, our method achieves an efficiency of about 75%. The efficiency of our algorithm is substantially better than the speedup efficiency of 50% reported in (Huo et al., 2018). Our results compare even more favorably with speedup efficiencies of 34% reported in (Günther et al., 2018) for an alternative parallelintime method.
In Figure 2 we summarize the results for the MNIST dataset. For the MNIST dataset we see that our method is faster than the SGD even for relatively shallow networks (4 layers). From these results we see that our method achieves much better speedups on MNIST than other speedups reported in the literature. Moreover, the efficiency of over 75% for the MNIST dataset suggest that the communication overheads of our method are small. These results validate our claim that our method will have an advantage over existing methods for larger models. The reason is that, for large models, the time spent solving the coarse model is a small proportion of the total solution time.
Increased stability due to the global prediction phase. To test the impact of the global prediction phase using the coarse model we report the meansquare errors from the regression step in the backward solve of Algorithm 4. When the global prediction step is not used our method is similar to the synthetic gradient method from (Jaderberg et al., 2017). We see from Figure 3 that when the global prediction phase is not used the meansquare error of the regression step varies significantly in the first 20 iterations before converging to a nonzero value. When the global prediction phase is used, we can see from Figure 3 (left yaxis) that the MSE is an order of magnitude lower and eventually converges to zero. These results explain why our method is so efficient. These results also provide empirical validation for the assumption in (7) required to prove the convergence of our method. The results in Figure 3 are for the SwissRoll dataset with 512 layers using the Verlet discretization scheme. Similar behavior was observed in the other models.
6 Conclusions
We proposed a novel parallelintime distributed optimization method for neural networks. The method exploits the dynamical systems view of residual neural networks to parallelize the optimization process across time (layers). We discussed how to take advantage of multilevel discretization schemes in order to predict the optimal states and costates of the control model. We established the convergence of our method. Our initial numerical results suggest that the proposed method has the same accuracy as Stochastic Gradient Descent, reduces computation time significantly and achieves higher parallel efficiency than existing methods. The method can be improved in several ways including an asynchronous implementation, using more than two discretization levels and decomposing the original network to several components. More detailed numerical experiments are needed to understand the behavior of the method for large datasets, but the initial efficiency results are extremely encouraging.
References
 Bardi & CapuzzoDolcetta (1997) Bardi, M. and CapuzzoDolcetta, I. Optimal control and viscosity solutions of HamiltonJacobiBellman equations. Systems & Control: Foundations & Applications. Birkhäuser Boston, Inc., Boston, MA, 1997.
 Baydin et al. (2018) Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. Automatic differentiation in machine learning: a survey. Journal of Marchine Learning Research, 18:1–43, 2018.
 Bertsekas (2017) Bertsekas, D. P. Dynamic programming and optimal control. Vol. I. Athena Scientific, Belmont, MA, fourth edition, 2017. ISBN 9781886529434; 1886529434; 1886529086.
 Bottou et al. (2018) Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for largescale machine learning. SIAM Review, 60(2):223–311, 2018.
 Briggs et al. (2000) Briggs, W. L., McCormick, S. F., et al. A multigrid tutorial, volume 72. Siam, 2000.

Butcher (2016)
Butcher, J. C.
Numerical methods for ordinary differential equations
. John Wiley & Sons, 2016.  CarreiraPerpinan & Wang (2014) CarreiraPerpinan, M. and Wang, W. Distributed optimization of deeply nested systems. In Artificial Intelligence and Statistics, pp. 10–19, 2014.
 Chen et al. (2018) Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. Neural ordinary differential equations. arXiv preprint arXiv:1806.07366, 2018.
 Choromanska et al. (2018) Choromanska, A., Kumaravel, S., Luss, R., Rish, I., Kingsbury, B., Tejwani, R., and Bouneffouf, D. Beyond backprop: Alternating minimization with coactivation memory. arXiv preprint arXiv:1806.09077, 2018.

Curtis & Scheinberg (2017)
Curtis, F. E. and Scheinberg, K.
Optimization methods for supervised machine learning: From linear models to deep learning.
In Leading Developments from INFORMS Communities, pp. 89–114. INFORMS, 2017.  Dennard et al. (1974) Dennard, R., Gaensslen, F., Rideout, V., Bassous, E., and LeBlanc, A. Design of ionimplanted mosfet’s with very small physical dimensions. IEEE Journal of SolidState Circuits, 9(5):256–268, 1974.
 E (2017) E, W. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1):1–11, Mar 2017. ISSN 2194671X. doi: 10.1007/s403040170103z.
 Fattorini (1999) Fattorini, H. O. Infinitedimensional optimization and control theory, volume 62 of Encyclopedia of Mathematics and its Applications. Cambridge University Press, Cambridge, 1999. ISBN 0521451256. doi: 10.1017/CBO9780511574795.
 Gander & Hairer (2008) Gander, M. J. and Hairer, E. Nonlinear convergence analysis for the parareal algorithm. In Domain decomposition methods in science and engineering XVII, pp. 45–56. Springer, 2008.
 Günther et al. (2018) Günther, S., Ruthotto, L., Schroder, J., Cyr, E., and Gauger, N. Layerparallel training of deep residual neural networks. arXiv preprint arXiv:1812.04352, 2018.
 Haber & Ruthotto (2018) Haber, E. and Ruthotto, L. Stable architectures for deep neural networks. Inverse Problems, 34(1):014004, 2018.
 Huo et al. (2018) Huo, Z., Gu, B., Yang, Q., and Huang, H. Decoupled parallel backpropagation with convergence guarantee. arXiv preprint arXiv:1804.10574, 2018.
 Jaderberg et al. (2017) Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O., Graves, A., Silver, D., and Kavukcuoglu, K. Decoupled neural interfaces using synthetic gradients. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1627–1635. JMLR. org, 2017.
 Koomey et al. (2011) Koomey, J., Berard, S., Sanchez, M., and Wong, H. Implications of historical trends in the electrical efficiency of computing. IEEE Annals of the History of Computing, 33(3):46–54, 2011.
 Lau et al. (2018) Lau, T. T.K., Zeng, J., Wu, B., and Yao, Y. A proximal block coordinate descent algorithm for deep neural network training. arXiv preprint arXiv:1803.09082, 2018.
 LeCun et al. (1988) LeCun, Y., Touresky, D., Hinton, G., and Sejnowski, T. A theoretical framework for backpropagation. In Proceedings of the 1988 connectionist models summer school, volume 1, pp. 21–28. CMU, Pittsburgh, Pa: Morgan Kaufmann, 1988.
 Li et al. (2017) Li, Q., Chen, L., Tai, C., and Weinan, E. Maximum principle based algorithms for deep learning. The Journal of Machine Learning Research, 18(1):5998–6026, 2017.
 Polak (1997) Polak, E. Optimization, volume 124 of Applied Mathematical Sciences. SpringerVerlag, New York, 1997. ISBN 0387949712. doi: 10.1007/9781461206637. Algorithms and consistent approximations.
 Zeng et al. (2018) Zeng, J., Lau, T. T.K., Lin, S., and Yao, Y. Block coordinate descent for deep learning: Unified convergence guarantees. arXiv preprint arXiv:1803.00225, 2018.
7 Supplementary Material for: Predict Globally, Correct Locally: ParallelinTime Optimal Control of Neural Networks
7.1 Notation
In this section we briefly define our notation. We use represent the state at time , iteration , batch , with discretization parameter . When it is clear from context we drop and and write instead. When it is relevant to identify or sum over the different timesteps we use the notation
We use
to denote the vector form of
. We use the same conventions for the exact costate variables , the approximate costates and the control parameters/weights . Below all norms are norms.Algorithm 4 has a synchronization step. This assumption implies that if the algorithmic mapping in Lemma 2.2 is used then the control parameters are updated as follows,
(8) 
where,
We note that if gradient descent is used to solve (4) then Algorithm 3 reduces to,
where is the objective function of (4), and the gradient is calculated using backpropagation. Below we also use the following shorthand notation,
We note that with the simplified notation above we have .
When the algorithm is run with minibatches, we use
to denote the total expectation of the random variable
, and is conditional on the information available up to and including iteration .7.2 Assumptions
In this section we outline our main assumptions.

The objective function () in (4) has Lipschitz continuous gradient. We denote the Lipschitz constant with .

The error in the adjoint calculation satisfies the inequality,
(9)
7.3 Proofs for Section 2
Proof.
(Of Lemma 2.1).
This result can easily be established by induction.
At the terminal time , we must have,
Therefore, by taking we establish that the Lemma is true for . Suppose that the Lemma is true for some , we show that it is true for also. Let denote an optimal solution, then by assumption we have,
Let , and the result follows. ∎
Proof.
(Of Lemma 2.2).
At time , it follows from the boundary condition enforced by Algorithm 2 that . If we take one step of the backward solve then for we obtain,
Using the same argument recursively we conclude that,
(10) 
Using the preceding equation we obtain,
It follows that,
and we conclude that SGD with a learning rate of produces the same iterations as Algorithm 3. ∎
7.4 Proofs for Section 4
Proof.
(Of Theorem 4.1) Because has a Lipschitz continuous gradient it follows from (8),
Since the result follows.
∎
Next we discuss the case when Algorithm 4 is run using minibatches. In this case, the iteration in (8) is replaced with the following,
where denotes a randomized version of . The following assumption is standard regrading the sampling process (see e.g. (Bottou et al., 2018)),
(11) 
We now establish some technical lemmas that are needed for the proof of Theorem 4.2.
Lemma 7.1.
Suppose that
Then,
Proof.
Lemma 7.2.
Suppose that
(12) 
Then,
Proof.
Lemma 7.3.
Suppose that Algorithm 4 is run with a fixed stepsize such that,
(14) 
then after iterations the following holds,
Proof.
Taking expectations on the bound obtained in Lemma 7.2 and using the law of total expectation we obtain,
Summing from to iteration we obtain,
Rearranging the inequality above we obtain the required result. ∎
Comments
There are no comments yet.