1 Introduction
In this work, we consider the following optimal control problem (OCP) in the discretetime setting:
(OCP) 
where and represent the state and control at each time step . , and respectively denote the nonlinear dynamics, intermediate cost and terminal cost functions. OCP aims to find a control trajectory, , such that the accumulated cost over the finite horizon is minimized. Problems with the form of OCP
appear in multidisciplinary areas since it describes a generic multistage decision making problem, and have gained commensurate interest recently in deep learning.
Central to the research along this line is the interpretation of DNNs as discretetime nonlinear dynamical systems, in which each layer is viewed as a distinct time step (weinan2017proposal). The dynamical system perspective provides mathematicallysound explanation for the success of certain DNN architectures (lu2019understanding). It also enables principled architecture design by bringing rich analysis from numerical differential equations (lu2017beyond; long2017pde) and discrete mechanism (greydanus2019hamiltonian; zhong2019symplectic) when learning problems inherit physical structures. In the continuum limit of depth, chen2018neural
parametrized an ordinary differential equation (ODE) directly using DNNs, later with
liu2019neural extending the framework to accept stochastic dynamics.From the optimization viewpoint, iterative algorithms for solving OCP have been shown to admit a control interpretation (hu2017control). When the network weight is recast as the control variable, OCP describes without any loss of generalization the training objective composed of layerwise loss (e.g. weight regularization) and terminal loss (e.g. crossentropy). This connection has been mathematically explored in many recent works (han2018mean; seidman2020robust; liu2019deep). Despite they provide theoretical statements for convergence and generalization, the algorithmic development remains relatively limited. Specifically, previous works primarily focus on applying firstorder optimality conditions, provided by the Pontryagin principle (boltyanskii1960theory), to architectures restricted to residual blocks or discrete weights (li2018optimal; li2017maximum).
In this work, we take a parallel path from the Approximate Dynamic Programming (ADP, bertsekas1995dynamic), a technique particularly designed to solve complex Markovian decision processes. For this kind of problems, ADP has been shown numerically superior to direct optimization such as Newton method since it takes into account the temporal structure inherit in OCP (liao1992advantages). The resulting update features a locally optimal feedback policy at each stage (see Fig. 1), which is in contrast to the one derived from the Pontryagin’s principle. In the application of DNNs, we will show through experiments that the policies help improve training convergence and reduce sensitivity to hyperparameter change.
Of our particular interest among practical ADP algorithms is the Differential Dynamic Programming (DDP, jacobson1970differential). DDP is a second order method that has been used extensively for complex trajectory optimization in robotics (tassa2014control; posa2016optimization), and in this work we further show that existing first and second order methods for training DNNs can be derived from DDP as special cases (see Table 1). Such an intriguing connection can be beneficial to both sides. While we can leverage recent advances in efficient curvature factorization of the loss Hessian (martens2015optimizing; george2018fast) to relieve the computational burden in DDP, on the other hand, computing feedback policies stands as an independent module; thus it can be integrated into existing first and second order methods.
The concept of feedback mechanism has already shown up in the study of network design, where the terminology typically refers to connections between modules over training (chung2015gated; wen2018deep) or successive prediction for vision applications (zamir2017feedback; li2019feedback). Conceptually perhaps shama2019adversarial; huh2019feedback are most related to our work, where the authors proposed to reuse the discriminator from a Generative Adversarial Network as a feedback module during training or test time. We note that, however, neither the problem formulation nor the mapping space of the feedback module is the same as ours. Our feedback policy is originated from the optimal control theory in which control update needs to compensate the state disturbance throughout propagation.
The paper is organized as follows: In section 2 we go over background on optimality conditions to OCP and review the DDP algorithm. Connection between DNNs training and trajectory optimization is solidified in section 3, with a practical algorithm demonstrated in section 4. We provide empirical results and future directions in section 5 and 6.
2 Preliminaries
2.1 Optimality Conditions to Ocp
Development of the optimality conditions to OCP can be dated back to 1960s, characterized by the Pontryagin’s Maximum Principle (PMP) and the Dynamic Programming (DP). We detail the two different approaches below.
Theorem 1 (Discretetime PMP (todorov2006optimal)).
The discretetime PMP theorem can be derived using KKT conditions, in which the costate is equivalent to the Lagrange multiplier. As we will see in section 3.1, the adjoint dynamics Eq. (2) has a direct link to the Backpropagation. Note that the solution to Eq. (3) admits an openloop process in the sense that it does not depend on state variables. This is in contrast to the Dynamic Programming (DP) principle, in which a feedback policy is considered.
Theorem 2 (Dp (bellman1954theory)).
Hereafter we denote the objective involved at each decision stage in Eq. (4) as the Bellman objective :
(5) 
The principle of DP recasts the problem of minimizing over a control sequence to a sequence of minimization over each control. The value function summarizes the optimal costtogo at each stage, provided the following control subsequence also being optimal. is an optimal feedback law in a globally convergent closed loop system.
2.2 Differential Dynamic Programming
Trajectory optimization algorithms typically rely on solving the optimality equations from PMP or DP. Unfortunately, solving the Bellman equation in highdimensional problems is infeasible without any approximation, wellknown as the Bellman curse of dimensionality. To mitigate the computational burden from the minimization involved at each stage, one can replace the Bellman objective in Eq. (
4) with its second order approximation. Such an approximation is central to the Differential Dynamic Programming (DDP), a secondorder method that inherits a similar Bellman optimality structure while being computationally efficient.Given a nominal trajectory , DDP iteratively optimizes its accumulated cost, where each iteration consists a backward and forward pass, detailed below.
Backward pass: At each time step, DDP expands at up to secondorder, i.e. , where is expanded as^{1}^{1}1 We drop the time step in all derivatives of for simplicity.
with each term given by
(6a)  
(6b)  
(6c)  
(6d)  
(6e) 
denotes the derivative of the value function at the next state. The dot notation in Eq. (6c)  (6e
) represents the product of a vector with a 3D tensor. The optimal perturbation to this approximate Bellman objective admits an analytic form given by
(7) 
where . and respectively denote the open and feedback gains. Note that this policy is only optimal locally around the nominal trajectory where the second order approximation remains valid. The backward pass repeats the computation Eq. (6)  (7) backward in time, with the first and second derivative of value function computed also recursively by
(8) 
Forward pass: In the forward pass, we simply simulate a trajectory by applying the feedback policy computed at each stage. Then, a new Bellman objective is expanded along the updated trajectory, and the whole procedure repeats until the termination condition is met. Under mild assumptions, DDP admits quadratic convergence for systems with smooth dynamics (mayne1973differential) and is numerically superior to Newton’s method (liao1992advantages). We summarize the DDP algorithm in Alg. 1.
3 Training DNNs as Trajectory Optimization
3.1 Optimal Control Formulation
Recall that DNNs can be interpreted as dynamical systems in which each layer is viewed as a distinct time step. Consider for instance the mapping in feedforward networks,
(9) 
and represent the activation at layer and , with being the preactivation. and
respectively denote the nonlinear activation function and the affine transform parametrized by the weight
. Eq. (9) can be seen as a dynamical system propagating the activation vector using .It is natural to ask whether the necessary condition in the PMP theorem relates to firstorder optimization methods in DNN training. This is indeed the case as pointed out in li2017maximum and liu2019deep.
Lemma 3 ((li2017maximum)).
Lemma 3 follows by expanding the RHS of Eq. (1) and (2):
Thus, the first two equations in PMP correspond to the forward propagation and the chain rule used for Backpropagation. Notice that we now have the subscript
in the gradient since the dimension of and may change with time. When the Hamiltonian is differentiable wrt , one can attempt to solve Eq. (3) by iteratively taking the gradient descent. This will lead to the familiar update:(10) 
where and denote the update iteration and step size.
We now extend Lemma 3 to accept the batch setting. The following proposition will become useful as we proceed to the algorithm design in the next section.
Proposition 4.
(Informal; see Appendix A for full version) Let denote the augmented state with the batch size . Consider the new objective . Then, iteratively solving the “augmented” PMP equations is equivalent to applying minibatch gradient descent with Backpropagation. Specifically, the derivative of the augmented Hamiltonian takes the exact form with the minibatch gradient update:
(11) 
where .
Proposition 4 suggests that in the batch setting, we aim to find an ultimate openloop control that can drive every initial state among the batch data to its designed target. Despite seemly trivial, this is a distinct formulation to OCP since the optimal policy typically varies at different initial state.
It should be stressed that despite the appealing connection between the optimal control and DNN training, the two methodologies are not completely interchangeable. We note that the dimension of the activation typically changes throughout layers to extract effective latent representation. This unique property poses difficulties when one wish to adopt analysis from the continuoustime framework. Despite the recent attention on treating networks as an discretization of ODE (lu2017beyond; han2018mean), we note that this formulation restricts the applicability of the optimal control theory to networks with residual architectures, as opposed to the generic dynamics discussed here.
3.2 Trajectory Optimization Perspective
In this part we draw a new connection between the training procedure of DNNs and trajectory optimization. Let us first revisit the Backpropagation with gradient descent from Alg. 2 . During the forward propagation, we treat the weight as the nominal control that simulates the state trajectory . Then, the loss gradient is propagated backward, implicitly moving in the direction suggested by the Hamiltonian. The control update, in this case the firstorder derivative, is computed simultaneously and later applied to the system.
It should be clear at this point that Alg. 2 resembles DDP in several ways. Starting from the nominal trajectory, both procedures carry certain information wrt the objective, either or , backward in time in order to compute the control update. Since DDP computes the feedback policy at each time step, additional forward simulation is required in order to apply the update. The computation graph for the two processes is summarized in Fig. 1. In the following proposition we make this connection formally and provide conditions when the two algorithms become equivalent.
Proposition 5.
Assume at all stages, then the dynamics of the value derivative can be described by the adjoint dynamics, i.e. . In this case, DDP is equivalent to the stagewise Newton by having
(12) 
If furthermore we have , then DDP degenerates to the Backpropagation with gradient descent.
We leave the full proof in Appendix C and provide some intuitions for the connection between firstorder derivatives. First, notice that we always have at the horizon without any assumption. It can be shown by induction that when for all the remaining stages, will take the same update equation with ,
and Eq. (12) follows immediately by
The feedback policy under this assumption degenerates to
which is equivalent to the stagewise Newton method (de1988differential). It can be readily seen that we will recover the gradient descent update by having an isotropic inverse covariance, .
1storder  Adaptive  Stagewise  

1storder  Newton  
Proposition 5 states that the backward pass in DDP collapses to the Backpropagation when vanishes. To better explain the role of during optimization, consider an illustrated 2D example in Fig. 1. Given an arbitrary objective expanded at , standard secondorder methods compute the Hessian wrt and apply the update . DDP differs from them in that it also computes the mixed partial derivatives, i.e. . The resulting update law has the same update but with an additional term linear in state. This feedback term, shown in the red arrow, compensates when the state moves apart from during update. In Sec. 5.2, we will conduct ablation study and discuss the effect of this deviation on DNNs training. Connection between DDP and existing methods is summarized in Table 1.
4 DDP Optimizer for Feedforward Networks
In this section we discuss a practical implementation of DDP on training feedforward networks. Due to space constraint, we leave the complete derivation in the Appendix B.
4.1 Batch DDP Formulation
We first derive the differential Bellman objective when feedforward networks are used as the dynamics. Recall the propagation formula in Eq. (9) and substitute this new dynamics to the OCP by setting . After some algebra, we can show that^{2}^{2}2 We omit the subscript again for simplicity.
(13) 
where and denote the derivatives of the value function at preactivation. Computing the layerwise feedback policy and the value function remain the same as in Eq. (7) and (8).
The computational overhead in Eq. (13) can be mitigated by leveraging the structure of feedforward networks. Since the affine transform is bilinear in and , the terms and vanish. The tensor admits a sparse structure. For fullyconnected layers, computation can be simplified to
(14) 
For the coordinatewise nonlinear transform, and are diagonal matrix and tensor. In most learning instances, stagewise losses typically involved with weight decay alone; thus the terms also vanish.
Note that Eq. (13) describes the expansion along a single trajectory with the dynamics given by the feedforward networks. To modify it in the setting of batch trajectories optimization, we augment the activation space to (recall is the batch size), in the spirit of Proposition 4. It should be stressed, however, that despite drawing inspiration from the augmented Hamiltonian framework, the resulting representation does not admit a clean form such as averaging over individual updates in Eq. (11). To see why this is the case, observe that at the horizon , the derivative of the augmented value function can indeed be expressed by the individual ones by . Such an independent structure does not preserve through the backward pass since at , instead takes the form
where and are the averaging over and . The intuition is that when optimizing batch trajectories with the same control law, the Bellman objective of each sample couples with others through the second order expansion of the augmented . Consequently, the value function will no longer be independent soon after leaving the terminal horizon. We highlight this trait which distinguishes batch DDP from its original representation.
Data set  SGD  RMSProp  KFAC  EKFAC  KWNG 



DDP  

WINE  0.5515  0.5515  
()  ()  ()  ()  (98.67)  ()  (98.10)  ()  ()  
IRIS  0.0514  0.0505  
()  ()  ()  (94.86)  ()  ()  (94.81)  ()  ()  
DIGITS  0.0400  0.0364  
()  ()  ()  ()  ()  (95.43)  ()  (95.36)  ()  
MNIST  0.2242  0.2291  N/A  
()  ()  ()  (92.79)  ()  (92.84)  ()  ()  
Fashion  0.4202  0.4191  N/A  
MNIST  ()  ()  (84.27)  ()  ()  ()  ()  (84.39) 
4.2 Practical Implementation
Directly applying the batch DDP to optimize DNNs can be impractical. For one, training inherits stochasticity due to minibatch sampling. This differs from standard trajectory optimization where the initial state remains unaltered throughout optimization. Secondly, the control dimension is typically order of magnitude higher than the one for state; thus inverting the Hessian soon becomes computationally infeasible. In this part we discuss several practical implementation adopted from both literature that stabilizes the optimization process and enables realtime training.
First, we apply Tikhonov regularization on and line search since both play key roles in the convergence of DDP (liao1991convergence). We note that these regularization have shown up already in robustifying DNNs training, with the form of norm and in vaswani2019painless. From the perspective of trajectory optimization, we should emphasize that without regularization, will lose its positive definiteness whenever has a low rank. This is indeed the case in feedforward networks. The observation on lowrank matrices also applies to when the dimension of the activation reduces during forward propagation. Thus we also apply Tikhonov regularization to . This can be seen as placing a quadratic statecost and has been shown to improve robustness on optimizing complex humanoid (tassa2012synthesis). Next, when using DDP for trajectory optimization, one typically has the option of expanding the dynamics up to first or second order. Note that both are still considered secondorder methods and generate layerwise feedback policies, except the former simplifies the problem similar to GaussNewton. The computational benefit and stability obtained by keeping only the linearized dynamics have been discussed thoroughly in robotics literature (todorov2005generalized; tassa2012synthesis). Thus, hereafter we will refer the DDP optimizer to this version.
Tremendous efforts have been spent recently on efficient curvature estimation of the loss landscape during training. This is particularly crucial in enabling the applicability of the secondorder methods, since inverting the full Hessian
, even in a layerwise fashion, can be computationally intractable, and DDP has no exception. Fortunately, DDP can be integrated with these advances. We first notice that many popular curvature factorization methods contain modules that collect certain statistics during training to compute the preconditioned update. Take EKFAC (george2018fast) for instance, the update for each layer is approximated by , where denotes the history until the most recent iteration . The module bypasses the computation of Hessian and its inverse by directly estimating the matrixvector product. We can integrate with DDP by(15) 
and compute by applying Eq. (15) columnwise. Notice that we replace with since here we aim to approximate the landscape of the value function. Hereafter we name this approximation by DDPEKFAC.
Integrating DDP with existing optimizers is not restricted to secondorder methods. Recall the connection we made in Proposition 5. When is isotropic, DDP inherits the same structure as gradient descent. In a similar vein, adaptive first order methods, such as RMSprop (hinton2012neural), approximate by , where adapting to the diagonal of the inverse covariance. We will refer these two variants to DDPSGD and DDPRMSprop.
5 Experiments
5.1 Batch Trajectory Optimization on Synthetic Data
We first demonstrate the effectiveness of our DDP optimizer in batch trajectories optimization. Recall from the derivation in Sec. 2.2 that the feedback policy is optimal locally around the region where
is expanded; thus only samples traveling through this region benefits. Since conceptually a DNN classifier can be thought of as a dynamical system guiding each group of trajectories toward the region belong to their classes, we hypothesize that for the DDP optimizer to show its effectiveness on batch training, the feedback policy must act as an ensemble to capture the feedback behavior for each class. To this end, we randomly generate synthetic data points from
Gaussian clusters in the input space and train a 3layers feedforward networks () from scratch using DDP optimizer. For all settings, training error drops to nearly zero. The spectrum distribution, sorted in a descending order, of the feedback policy in the prediction layer is shown in Fig. 2. The result shows that the number of nontrivial eigenvalues in
matches exactly the number of classes in each problem (indicated by the vertical dashed line). As the state distribution concentrates to bulks through training, the eigenvalues also increase, providing stronger feedback direction to the weight update. Thus, we consolidate our batch DDP formulation in Sec. 4.1.5.2 Performance on Classification Data Set
Next, we validate the performance of the DDP optimizer, along with several variants proposed in the previous section, on classification tasks. In addition to the baselines such as SGD and RMSprop (hinton2012neural), we compare with stateoftheart second order optimizers, including KFAC (martens2015optimizing), EKFAC (george2018fast), and KWNG (arbel2020wng). We use (fullyconnected) feedforward networks for all experiments. The complete experiment setup and additional results (e.g. accuracy curve) are provided in the Appendix D. Table 2 summarizes the performance with the top two results highlighted. Without any factorization, we are not able to obtain result for DDP on (Fashion)MNIST due to the exploding computation when inverting .
On all data set, DDP achieves better or comparable results against existing methods. Notably, when comparing the vanilla DDP with its approximate variants (see Fig. 3
), the latter typically admit smaller variance and sometimes converge to better local minima. The instability in the former can be caused by the overestimation of the value Hessian when using only minibatch data, which is mitigated by the amortized method used in EKFAC, or diagonal approximation in firstorder methods. This shed light on the benefit gained by bridging two seemly disconnected methodologies.
5.3 Effect of Feedback Policies
Sensitivity to Hyperparameter: For optimizers that are compatible with DDP, we observe minor improvements over the final training loss. To better understand the effect of the feedback mechanism, we conduct ablation analysis and compare their training performance among different learning rate setup, while keeping all the other hyperparameters the same. The result on DIGITS is shown in Fig. 4, where the left and right columns correspond to the besttuned and perturbed learning rate. While all optimizers are able to converge to low training error when besttuned, the ones integrated with DDP tend to stabilize the training when the learning rate changes by some fraction; sometimes they also achieve lower training error in the end. The performance improvement induced by the feedback policy typically becomes obvious when the learning rate increases.
Variance Reduction and Faster Convergence: Table 3 shows the variance differences in training loss and test accuracy after the optimizers are integrated with DDP feedback modules. Specifically, we report the value , with being optimizers compatible with DDP, including SGD, RMSprop, and EKFAC. We keep all hyperparameters the same for each experiment so that the performance difference only comes from the existence of feedback policies. As shown in Table 3, for most cases having additional updates from DDP stabilizes the training dynamics by reducing its variance, and similar reduction can be observed in the variance of testing accuracy over optimization, sometimes by a large margin. Empirically, we find that such a stabilization may also lead to faster convergence. In Fig. 6, we report the training dynamics of the top three experiment instances in Table 3 that have large variance reduction when using DDP. In all cases, the optimization paths also admit faster convergence.
To give an explanation from the trajectory optimization viewpoint, recall that DDP computes layerwise policies with each composed of two terms. The openloop control is stateindependent and, in this setup, equivalent to the original update computed by each optimizer. The feedback term is linear in , which can be expressed by , where we abuse the notation and denote as the compositional dynamics propagating with . In other words, captures the state differential when new control is applied until layer , and we should expect it to increase when a further step size, thus a larger difference between controls, is taken at each layer. This implies the control update in the later layer may not be optimal (at least from the trajectory optimization viewpoint) since the gradient is evaluated at instead of . The difference between and cannot be neglected especially during early training when the objective landscape contains nontrivial curvature everywhere (alain2019negative). The feedback direction is designed to mitigate this gap. In fact we can show that approximately solves
(16) 
with the derivation provided in the Appendix C. Thus, by having this additional term throughout updates, we stabilize the training dynamics by traveling through the landscape cautiously without going off the cliff.
5.4 Remarks on Computation Overhead
Lastly, Fig. 5 summarizes the computational overhead during DDP backward pass on DIGITS. Specifically, we compare the wallclock processing time of DDPEKFAC and DDP to the computation spent on the curvature approximate module in the DDPEKFAC optimizer. The value thus suggests the additional overhead required for EKFAC to adopt the DDP framework, which includes secondorder expansion on the value function, computing layerwise feedback policies, etc. While the vanilla batch DDP scales poorly with the dimension of the hidden state under the current batch formulation, the overhead in DDPEKFAC increases only by a constant (from to ) wrt different architecture setup, such as batch size, hidden dimension, and network depth. We stress that our current implementation is not fully optimized, so there is still room for further acceleration.
6 Conclusion
In this work, we introduce Differential Dynamic Programming Neural Optimizer (DDP), a new class of algorithms arising from bridging DNNs training to the optimal control theory and trajectory optimization. This new perspective suggests existing methods stand as special cases of DDP and can be extended to adapt the framework painlessly. The resulting optimizer features layerwise feedback policies which help reduce sensitivity to hyperparameter and sometimes improve convergence. We wish this work provides new algorithmic insight and bridges communities from both deep learning and optimal control. There are numerous future directions left for explored. For one, deriving update laws for other architectures (e.g. convolution and batchnorm) and applying factorization to other gigantic matrices are essential to the applicability of DDP optimizer.
References
Appendix A Derivation of Batch PMP
a.1 Problem Formulation and Notation
Recall the original OCP for single trajectory optimization. In its batch setting, we consider the following stateaugmented optimal control problem:
(17) 
where denotes the stateaugmented vector over each , and denotes the batch size. and respectively represent the average intermediate and terminal cost over and . consists of independent mappings from each dynamics . Consequently, its derivatives can be related to the ones for each sample by
(18)  
Comments
There are no comments yet.