In this work, we consider the following optimal control problem (OCP) in the discrete-time setting:
where and represent the state and control at each time step . , and respectively denote the nonlinear dynamics, intermediate cost and terminal cost functions. OCP aims to find a control trajectory, , such that the accumulated cost over the finite horizon is minimized. Problems with the form of OCP
appear in multidisciplinary areas since it describes a generic multi-stage decision making problem, and have gained commensurate interest recently in deep learning.
Central to the research along this line is the interpretation of DNNs as discrete-time nonlinear dynamical systems, in which each layer is viewed as a distinct time step (weinan2017proposal). The dynamical system perspective provides mathematically-sound explanation for the success of certain DNN architectures (lu2019understanding). It also enables principled architecture design by bringing rich analysis from numerical differential equations (lu2017beyond; long2017pde) and discrete mechanism (greydanus2019hamiltonian; zhong2019symplectic) when learning problems inherit physical structures. In the continuum limit of depth, chen2018neural
parametrized an ordinary differential equation (ODE) directly using DNNs, later withliu2019neural extending the framework to accept stochastic dynamics.
From the optimization viewpoint, iterative algorithms for solving OCP have been shown to admit a control interpretation (hu2017control). When the network weight is recast as the control variable, OCP describes without any loss of generalization the training objective composed of layer-wise loss (e.g. weight regularization) and terminal loss (e.g. cross-entropy). This connection has been mathematically explored in many recent works (han2018mean; seidman2020robust; liu2019deep). Despite they provide theoretical statements for convergence and generalization, the algorithmic development remains relatively limited. Specifically, previous works primarily focus on applying first-order optimality conditions, provided by the Pontryagin principle (boltyanskii1960theory), to architectures restricted to residual blocks or discrete weights (li2018optimal; li2017maximum).
In this work, we take a parallel path from the Approximate Dynamic Programming (ADP, bertsekas1995dynamic), a technique particularly designed to solve complex Markovian decision processes. For this kind of problems, ADP has been shown numerically superior to direct optimization such as Newton method since it takes into account the temporal structure inherit in OCP (liao1992advantages). The resulting update features a locally optimal feedback policy at each stage (see Fig. 1), which is in contrast to the one derived from the Pontryagin’s principle. In the application of DNNs, we will show through experiments that the policies help improve training convergence and reduce sensitivity to hyper-parameter change.
Of our particular interest among practical ADP algorithms is the Differential Dynamic Programming (DDP, jacobson1970differential). DDP is a second order method that has been used extensively for complex trajectory optimization in robotics (tassa2014control; posa2016optimization), and in this work we further show that existing first and second order methods for training DNNs can be derived from DDP as special cases (see Table 1). Such an intriguing connection can be beneficial to both sides. While we can leverage recent advances in efficient curvature factorization of the loss Hessian (martens2015optimizing; george2018fast) to relieve the computational burden in DDP, on the other hand, computing feedback policies stands as an independent module; thus it can be integrated into existing first and second order methods.
The concept of feedback mechanism has already shown up in the study of network design, where the terminology typically refers to connections between modules over training (chung2015gated; wen2018deep) or successive prediction for vision applications (zamir2017feedback; li2019feedback). Conceptually perhaps shama2019adversarial; huh2019feedback are most related to our work, where the authors proposed to reuse the discriminator from a Generative Adversarial Network as a feedback module during training or test time. We note that, however, neither the problem formulation nor the mapping space of the feedback module is the same as ours. Our feedback policy is originated from the optimal control theory in which control update needs to compensate the state disturbance throughout propagation.
The paper is organized as follows: In section 2 we go over background on optimality conditions to OCP and review the DDP algorithm. Connection between DNNs training and trajectory optimization is solidified in section 3, with a practical algorithm demonstrated in section 4. We provide empirical results and future directions in section 5 and 6.
2.1 Optimality Conditions to Ocp
Development of the optimality conditions to OCP can be dated back to 1960s, characterized by the Pontryagin’s Maximum Principle (PMP) and the Dynamic Programming (DP). We detail the two different approaches below.
Theorem 1 (Discrete-time PMP (todorov2006optimal)).
The discrete-time PMP theorem can be derived using KKT conditions, in which the co-state is equivalent to the Lagrange multiplier. As we will see in section 3.1, the adjoint dynamics Eq. (2) has a direct link to the Back-propagation. Note that the solution to Eq. (3) admits an open-loop process in the sense that it does not depend on state variables. This is in contrast to the Dynamic Programming (DP) principle, in which a feedback policy is considered.
Theorem 2 (Dp (bellman1954theory)).
Hereafter we denote the objective involved at each decision stage in Eq. (4) as the Bellman objective :
The principle of DP recasts the problem of minimizing over a control sequence to a sequence of minimization over each control. The value function summarizes the optimal cost-to-go at each stage, provided the following control subsequence also being optimal. is an optimal feedback law in a globally convergent closed loop system.
2.2 Differential Dynamic Programming
Trajectory optimization algorithms typically rely on solving the optimality equations from PMP or DP. Unfortunately, solving the Bellman equation in high-dimensional problems is infeasible without any approximation, well-known as the Bellman curse of dimensionality. To mitigate the computational burden from the minimization involved at each stage, one can replace the Bellman objective in Eq. (4) with its second order approximation. Such an approximation is central to the Differential Dynamic Programming (DDP), a second-order method that inherits a similar Bellman optimality structure while being computationally efficient.
Given a nominal trajectory , DDP iteratively optimizes its accumulated cost, where each iteration consists a backward and forward pass, detailed below.
Backward pass: At each time step, DDP expands at up to second-order, i.e. , where is expanded as111 We drop the time step in all derivatives of for simplicity.
with each term given by
) represents the product of a vector with a 3D tensor. The optimal perturbation to this approximate Bellman objective admits an analytic form given by
where . and respectively denote the open and feedback gains. Note that this policy is only optimal locally around the nominal trajectory where the second order approximation remains valid. The backward pass repeats the computation Eq. (6) - (7) backward in time, with the first and second derivative of value function computed also recursively by
Forward pass: In the forward pass, we simply simulate a trajectory by applying the feedback policy computed at each stage. Then, a new Bellman objective is expanded along the updated trajectory, and the whole procedure repeats until the termination condition is met. Under mild assumptions, DDP admits quadratic convergence for systems with smooth dynamics (mayne1973differential) and is numerically superior to Newton’s method (liao1992advantages). We summarize the DDP algorithm in Alg. 1.
3 Training DNNs as Trajectory Optimization
3.1 Optimal Control Formulation
Recall that DNNs can be interpreted as dynamical systems in which each layer is viewed as a distinct time step. Consider for instance the mapping in feedforward networks,
and represent the activation at layer and , with being the pre-activation. and
respectively denote the nonlinear activation function and the affine transform parametrized by the weight. Eq. (9) can be seen as a dynamical system propagating the activation vector using .
It is natural to ask whether the necessary condition in the PMP theorem relates to first-order optimization methods in DNN training. This is indeed the case as pointed out in li2017maximum and liu2019deep.
Lemma 3 ((li2017maximum)).
Thus, the first two equations in PMP correspond to the forward propagation and the chain rule used for Back-propagation. Notice that we now have the subscriptin the gradient since the dimension of and may change with time. When the Hamiltonian is differentiable wrt , one can attempt to solve Eq. (3) by iteratively taking the gradient descent. This will lead to the familiar update:
where and denote the update iteration and step size.
We now extend Lemma 3 to accept the batch setting. The following proposition will become useful as we proceed to the algorithm design in the next section.
(Informal; see Appendix A for full version) Let denote the augmented state with the batch size . Consider the new objective . Then, iteratively solving the “augmented” PMP equations is equivalent to applying mini-batch gradient descent with Back-propagation. Specifically, the derivative of the augmented Hamiltonian takes the exact form with the mini-batch gradient update:
Proposition 4 suggests that in the batch setting, we aim to find an ultimate open-loop control that can drive every initial state among the batch data to its designed target. Despite seemly trivial, this is a distinct formulation to OCP since the optimal policy typically varies at different initial state.
It should be stressed that despite the appealing connection between the optimal control and DNN training, the two methodologies are not completely interchangeable. We note that the dimension of the activation typically changes throughout layers to extract effective latent representation. This unique property poses difficulties when one wish to adopt analysis from the continuous-time framework. Despite the recent attention on treating networks as an discretization of ODE (lu2017beyond; han2018mean), we note that this formulation restricts the applicability of the optimal control theory to networks with residual architectures, as opposed to the generic dynamics discussed here.
3.2 Trajectory Optimization Perspective
In this part we draw a new connection between the training procedure of DNNs and trajectory optimization. Let us first revisit the Back-propagation with gradient descent from Alg. 2 . During the forward propagation, we treat the weight as the nominal control that simulates the state trajectory . Then, the loss gradient is propagated backward, implicitly moving in the direction suggested by the Hamiltonian. The control update, in this case the first-order derivative, is computed simultaneously and later applied to the system.
It should be clear at this point that Alg. 2 resembles DDP in several ways. Starting from the nominal trajectory, both procedures carry certain information wrt the objective, either or , backward in time in order to compute the control update. Since DDP computes the feedback policy at each time step, additional forward simulation is required in order to apply the update. The computation graph for the two processes is summarized in Fig. 1. In the following proposition we make this connection formally and provide conditions when the two algorithms become equivalent.
Assume at all stages, then the dynamics of the value derivative can be described by the adjoint dynamics, i.e. . In this case, DDP is equivalent to the stage-wise Newton by having
If furthermore we have , then DDP degenerates to the Back-propagation with gradient descent.
We leave the full proof in Appendix C and provide some intuitions for the connection between first-order derivatives. First, notice that we always have at the horizon without any assumption. It can be shown by induction that when for all the remaining stages, will take the same update equation with ,
and Eq. (12) follows immediately by
The feedback policy under this assumption degenerates to
which is equivalent to the stage-wise Newton method (de1988differential). It can be readily seen that we will recover the gradient descent update by having an isotropic inverse covariance, .
Proposition 5 states that the backward pass in DDP collapses to the Back-propagation when vanishes. To better explain the role of during optimization, consider an illustrated 2D example in Fig. 1. Given an arbitrary objective expanded at , standard second-order methods compute the Hessian wrt and apply the update . DDP differs from them in that it also computes the mixed partial derivatives, i.e. . The resulting update law has the same update but with an additional term linear in state. This feedback term, shown in the red arrow, compensates when the state moves apart from during update. In Sec. 5.2, we will conduct ablation study and discuss the effect of this deviation on DNNs training. Connection between DDP and existing methods is summarized in Table 1.
4 DDP Optimizer for Feedforward Networks
In this section we discuss a practical implementation of DDP on training feedforward networks. Due to space constraint, we leave the complete derivation in the Appendix B.
4.1 Batch DDP Formulation
We first derive the differential Bellman objective when feedforward networks are used as the dynamics. Recall the propagation formula in Eq. (9) and substitute this new dynamics to the OCP by setting . After some algebra, we can show that222 We omit the subscript again for simplicity.
The computational overhead in Eq. (13) can be mitigated by leveraging the structure of feedforward networks. Since the affine transform is bilinear in and , the terms and vanish. The tensor admits a sparse structure. For fully-connected layers, computation can be simplified to
For the coordinate-wise nonlinear transform, and are diagonal matrix and tensor. In most learning instances, stage-wise losses typically involved with weight decay alone; thus the terms also vanish.
Note that Eq. (13) describes the expansion along a single trajectory with the dynamics given by the feedforward networks. To modify it in the setting of batch trajectories optimization, we augment the activation space to (recall is the batch size), in the spirit of Proposition 4. It should be stressed, however, that despite drawing inspiration from the augmented Hamiltonian framework, the resulting representation does not admit a clean form such as averaging over individual updates in Eq. (11). To see why this is the case, observe that at the horizon , the derivative of the augmented value function can indeed be expressed by the individual ones by . Such an independent structure does not preserve through the backward pass since at , instead takes the form
where and are the averaging over and . The intuition is that when optimizing batch trajectories with the same control law, the Bellman objective of each sample couples with others through the second order expansion of the augmented . Consequently, the value function will no longer be independent soon after leaving the terminal horizon. We highlight this trait which distinguishes batch DDP from its original representation.
4.2 Practical Implementation
Directly applying the batch DDP to optimize DNNs can be impractical. For one, training inherits stochasticity due to mini-batch sampling. This differs from standard trajectory optimization where the initial state remains unaltered throughout optimization. Secondly, the control dimension is typically order of magnitude higher than the one for state; thus inverting the Hessian soon becomes computationally infeasible. In this part we discuss several practical implementation adopted from both literature that stabilizes the optimization process and enables real-time training.
First, we apply Tikhonov regularization on and line search since both play key roles in the convergence of DDP (liao1991convergence). We note that these regularization have shown up already in robustifying DNNs training, with the form of -norm and in vaswani2019painless. From the perspective of trajectory optimization, we should emphasize that without regularization, will lose its positive definiteness whenever has a low rank. This is indeed the case in feedforward networks. The observation on low-rank matrices also applies to when the dimension of the activation reduces during forward propagation. Thus we also apply Tikhonov regularization to . This can be seen as placing a quadratic state-cost and has been shown to improve robustness on optimizing complex humanoid (tassa2012synthesis). Next, when using DDP for trajectory optimization, one typically has the option of expanding the dynamics up to first or second order. Note that both are still considered second-order methods and generate layer-wise feedback policies, except the former simplifies the problem similar to Gauss-Newton. The computational benefit and stability obtained by keeping only the linearized dynamics have been discussed thoroughly in robotics literature (todorov2005generalized; tassa2012synthesis). Thus, hereafter we will refer the DDP optimizer to this version.
Tremendous efforts have been spent recently on efficient curvature estimation of the loss landscape during training. This is particularly crucial in enabling the applicability of the second-order methods, since inverting the full Hessian, even in a layer-wise fashion, can be computationally intractable, and DDP has no exception. Fortunately, DDP can be integrated with these advances. We first notice that many popular curvature factorization methods contain modules that collect certain statistics during training to compute the preconditioned update. Take EKFAC (george2018fast) for instance, the update for each layer is approximated by , where denotes the history until the most recent iteration . The module bypasses the computation of Hessian and its inverse by directly estimating the matrix-vector product. We can integrate with DDP by
and compute by applying Eq. (15) column-wise. Notice that we replace with since here we aim to approximate the landscape of the value function. Hereafter we name this approximation by DDP-EKFAC.
Integrating DDP with existing optimizers is not restricted to second-order methods. Recall the connection we made in Proposition 5. When is isotropic, DDP inherits the same structure as gradient descent. In a similar vein, adaptive first order methods, such as RMSprop (hinton2012neural), approximate by , where adapting to the diagonal of the inverse covariance. We will refer these two variants to DDP-SGD and DDP-RMSprop.
5.1 Batch Trajectory Optimization on Synthetic Data
We first demonstrate the effectiveness of our DDP optimizer in batch trajectories optimization. Recall from the derivation in Sec. 2.2 that the feedback policy is optimal locally around the region where
is expanded; thus only samples traveling through this region benefits. Since conceptually a DNN classifier can be thought of as a dynamical system guiding each group of trajectories toward the region belong to their classes, we hypothesize that for the DDP optimizer to show its effectiveness on batch training, the feedback policy must act as an ensemble to capture the feedback behavior for each class. To this end, we randomly generate synthetic data points fromGaussian clusters in the input space and train a 3-layers feedforward networks (---) from scratch using DDP optimizer. For all settings, training error drops to nearly zero. The spectrum distribution, sorted in a descending order, of the feedback policy in the prediction layer is shown in Fig. 2
. The result shows that the number of nontrivial eigenvalues inmatches exactly the number of classes in each problem (indicated by the vertical dashed line). As the state distribution concentrates to bulks through training, the eigenvalues also increase, providing stronger feedback direction to the weight update. Thus, we consolidate our batch DDP formulation in Sec. 4.1.
5.2 Performance on Classification Data Set
Next, we validate the performance of the DDP optimizer, along with several variants proposed in the previous section, on classification tasks. In addition to the baselines such as SGD and RMSprop (hinton2012neural), we compare with state-of-the-art second order optimizers, including KFAC (martens2015optimizing), EKFAC (george2018fast), and KWNG (arbel2020wng). We use (fully-connected) feedforward networks for all experiments. The complete experiment setup and additional results (e.g. accuracy curve) are provided in the Appendix D. Table 2 summarizes the performance with the top two results highlighted. Without any factorization, we are not able to obtain result for DDP on (Fashion)-MNIST due to the exploding computation when inverting .
On all data set, DDP achieves better or comparable results against existing methods. Notably, when comparing the vanilla DDP with its approximate variants (see Fig. 3
), the latter typically admit smaller variance and sometimes converge to better local minima. The instability in the former can be caused by the overestimation of the value Hessian when using only mini-batch data, which is mitigated by the amortized method used in EKFAC, or diagonal approximation in first-order methods. This shed light on the benefit gained by bridging two seemly disconnected methodologies.
5.3 Effect of Feedback Policies
Sensitivity to Hyper-parameter: For optimizers that are compatible with DDP, we observe minor improvements over the final training loss. To better understand the effect of the feedback mechanism, we conduct ablation analysis and compare their training performance among different learning rate setup, while keeping all the other hyper-parameters the same. The result on DIGITS is shown in Fig. 4, where the left and right columns correspond to the best-tuned and perturbed learning rate. While all optimizers are able to converge to low training error when best-tuned, the ones integrated with DDP tend to stabilize the training when the learning rate changes by some fraction; sometimes they also achieve lower training error in the end. The performance improvement induced by the feedback policy typically becomes obvious when the learning rate increases.
Variance Reduction and Faster Convergence: Table 3 shows the variance differences in training loss and test accuracy after the optimizers are integrated with DDP feedback modules. Specifically, we report the value , with being optimizers compatible with DDP, including SGD, RMSprop, and EKFAC. We keep all hyper-parameters the same for each experiment so that the performance difference only comes from the existence of feedback policies. As shown in Table 3, for most cases having additional updates from DDP stabilizes the training dynamics by reducing its variance, and similar reduction can be observed in the variance of testing accuracy over optimization, sometimes by a large margin. Empirically, we find that such a stabilization may also lead to faster convergence. In Fig. 6, we report the training dynamics of the top three experiment instances in Table 3 that have large variance reduction when using DDP. In all cases, the optimization paths also admit faster convergence.
To give an explanation from the trajectory optimization viewpoint, recall that DDP computes layer-wise policies with each composed of two terms. The open-loop control is state-independent and, in this setup, equivalent to the original update computed by each optimizer. The feedback term is linear in , which can be expressed by , where we abuse the notation and denote as the compositional dynamics propagating with . In other words, captures the state differential when new control is applied until layer , and we should expect it to increase when a further step size, thus a larger difference between controls, is taken at each layer. This implies the control update in the later layer may not be optimal (at least from the trajectory optimization viewpoint) since the gradient is evaluated at instead of . The difference between and cannot be neglected especially during early training when the objective landscape contains nontrivial curvature everywhere (alain2019negative). The feedback direction is designed to mitigate this gap. In fact we can show that approximately solves
with the derivation provided in the Appendix C. Thus, by having this additional term throughout updates, we stabilize the training dynamics by traveling through the landscape cautiously without going off the cliff.
5.4 Remarks on Computation Overhead
Lastly, Fig. 5 summarizes the computational overhead during DDP backward pass on DIGITS. Specifically, we compare the wall-clock processing time of DDP-EKFAC and DDP to the computation spent on the curvature approximate module in the DDP-EKFAC optimizer. The value thus suggests the additional overhead required for EKFAC to adopt the DDP framework, which includes second-order expansion on the value function, computing layer-wise feedback policies, etc. While the vanilla batch DDP scales poorly with the dimension of the hidden state under the current batch formulation, the overhead in DDP-EKFAC increases only by a constant (from to ) wrt different architecture setup, such as batch size, hidden dimension, and network depth. We stress that our current implementation is not fully optimized, so there is still room for further acceleration.
In this work, we introduce Differential Dynamic Programming Neural Optimizer (DDP), a new class of algorithms arising from bridging DNNs training to the optimal control theory and trajectory optimization. This new perspective suggests existing methods stand as special cases of DDP and can be extended to adapt the framework painlessly. The resulting optimizer features layer-wise feedback policies which help reduce sensitivity to hyper-parameter and sometimes improve convergence. We wish this work provides new algorithmic insight and bridges communities from both deep learning and optimal control. There are numerous future directions left for explored. For one, deriving update laws for other architectures (e.g. convolution and batch-norm) and applying factorization to other gigantic matrices are essential to the applicability of DDP optimizer.
Appendix A Derivation of Batch PMP
a.1 Problem Formulation and Notation
Recall the original OCP for single trajectory optimization. In its batch setting, we consider the following state-augmented optimal control problem:
where denotes the state-augmented vector over each , and denotes the batch size. and respectively represent the average intermediate and terminal cost over and . consists of independent mappings from each dynamics . Consequently, its derivatives can be related to the ones for each sample by