Deep learning is one of the most rapidly developing areas in modern artificial intelligence with tremendous impact to different industries ranging from the areas of social media, health and biomedical engineering, robotics, autonomy and aerospace systems. Featured with millions of parameters yet without much hand tuning or domain-specific knowledge, Deep Neural Networks (DNNs) match and sometimes exceed human-level performance in complex problems involving visual synthesizing(Krizhevsky et al., 2012), language reasoning (Srivastava & Salakhutdinov, 2012), and long-horizon consequential planning (Silver et al., 2016). The remarkable practical successes, however, do not come as a free lunch. Algorithms for training DNNs are extremely data-hungry. While insufficient data can readily lead to memorizing irrelevant features (Geirhos et al., 2018), data imbalance may cause severe effects such as imposing improper priors implicitly (Wang et al., 2016). Although automating tuning for millions of parameters alleviates inductive biases from traditional engineering, it comes with the drawback of making interpretation and analysis difficult. Furthermore, DNN is known for being vulnerable to small adversarial perturbations (Goodfellow et al., 2014). In fact, researchers have struggled to improve its robustness beyond infinite norm ball attacks (Athalye & Carlini, 2018). Finally, while the deep learning community has developed recipes related to the choice of the underlying organization of a DNN, the process of the overall architectural design lacks solid theoretical understanding and remains a fairly ad-hock process.
The aforementioned issues highlight the need towards the development of a theory for deep learning which will provide a scientific methodology to design DNNs architectures, robustify their performance against external attacks/disturbances, and enable the development the corresponding training algorithms. Given this need, our objective in this paper is to review and present in a systematic way work towards the discovery of deep learning theory. This work relies on concepts drawn primarily from the areas of dynamical systems and optimal control theory, and its connections to information theory and statistical physics.
Theoretical understanding of DNN training from previous works has roughly followed two streams: deep latent representation and stochastic optimization. On the topic of deep representations, the composition of affine functions, with element-wise nonlinear activations, plays a crucial role in automatic feature extraction(Simonyan et al., 2013) by constructing a chain of differentiable process. An increase in the depth of a NN architecture has the effect of increasing its expressiveness exponentially (Poole et al., 2016)
. This naturally yields a highly over-parametrized model, whose loss landscape is known for a proliferation of local minima and saddle points(Dauphin et al., 2014)
. However, the over-fitting phenomena, suggested by the bias-variance analysis, has not been observed during DNN training(Zhang et al., 2016). In practice, DNN often generalizes remarkably well on unseen data when initialized properly (Sutskever et al., 2013).
Generalization of highly over-parametrized models cannot be properly explained without considering stochastic optimization algorithms. Training DNN is a non-convex optimization problem. Due to its high dimensionality, most practically-used algorithms utilize first-order derivatives with aids of adaptation and momentum mechanisms. In fact, even a true gradient can be too expensive to compute on the fly; therefore only an unbiased estimation is applied at each update. Despite these approximations that are typically used to enable applicability, first-order stochastic optimization is surprisingly robust and algorithmically stable(Hardt et al., 2015). The stochasticity stemmed from estimating gradients is widely believed to perform implicit regularization (Chaudhari & Soatto, 2018), guiding parameters towards flat plateaus with lower generalization errors (Keskar et al., 2016). First-order methods are also proven more efficient to escape from saddle points (Lee et al., 2017b), whose number grows exponentially with model dimensionality (Dauphin et al., 2014). Research along this line provides a fundamental understanding on training dynamics and convergence property, despite the analysis is seldom connected to the deep representation viewpoint.
How do the two threads of deep latent organization and stochastic optimization interplay with each other and what are the underlying theoretical connections? These are questions that have not been well-explored and are essential towards the development of a theory for Deep Learning.
Indeed, study of stochastic optimization dynamics often treats DNN merely as a black-box.
This may be insufficient to describe the whole picture.
When using back-propagation (LeCun et al., 1990) to obtain first-order derivatives,
the backward dynamics, characterized by the compositional structure, rescales the propagation made by optimization algorithms,
which in return leads to different representation at each layer.
Frameworks that are able to mathematically characterize these compounding effects will provide more nuanced statements.
One of such attempts has been information bottleneck theory (Shwartz-Ziv & Tishby, 2017),
which describes the dynamics of stochastic optimization using information theory, and connects it to optimal representation via the bottleneck principle.
Another promising branch from Du et al. (2018a, b) showed that
the specific representation, i.e. the Gram matrix, incurred from gradient descent (GD)
characterizes the dynamics of the prediction space and can be used to prove global optimality.
These arguments, however, have been limited to either certain architectures (Saxe et al., 2018)
or noise-free optimization111 We note that global optimality for stochastic gradient descent has been recently proven
yet their convergence theories rely on certain assumptions on the data set in order to have the Frobenius norm of the (stochastic) gradient lower-bounded.
This is in contract the least eigenvalue of the prediction dynamics in
We note that global optimality for stochastic gradient descent has been recently proven inZou et al. (2018); Allen-Zhu et al. (2018)
, yet their convergence theories rely on certain assumptions on the data set in order to have the Frobenius norm of the (stochastic) gradient lower-bounded. This is in contract the least eigenvalue of the prediction dynamics inDu et al. (2018a, b), which is more related to the dynamical analysis in this review. .
In this review, we provide a dynamical systems and optimal control perceptive to DNNs in an effort to systematize the alternative approaches and methodologies. This allows us to pose and answer the following questions: (i) at which state should the training trajectory start, i.e. how should we initialize the weights or hyper-parameters, (ii) through which path, in a distribution sense, may the trajectory traverse, i.e. can we give any prediction of training dynamics on average, (iii) to which fixed point, if exists, may the trajectory converge, and finally (iv) the stability and generalization property at that fixed point. In the context of deep learning, these can be done by recasting DNNs and optimization algorithms as (stochastic) dynamical systems. Advanced tools from signal processing, mean-field theory, and stochastic calculus can then be applied to better reveal the training properties. We can also formulate an optimal control problem upon the derived dynamics to provide principled guidance for architecture and algorithm design. The dynamical and control viewpoints fit naturally with supervised learning and can readily extend to other learning schemes, such as Bayesian learning, adversarial training, and specific forms of meta learning, This highlights the potential to provide more theoretical insights.
The article is organized as follows. In Sec. 2, we will go over recent works related to the dynamical viewpoint. Sec. 3 and 4 demonstrate how we can recast DNNs and stochastic optimizations as dynamical systems, then apply control theory to them. In Sec. 5, we extend our analysis to other learning problems. Finally, we conclude our work and discuss some future directions in Sec. 6.
Notation: We will denote and as the (pre-)activation at layer ( for simplicity). Mapping to the next layer obeys and , where
is a nonlinear activation functional andis an affine transform parametrized by . The full parameter space across all layers is denoted . Given a data set , where and , we aim to minimize a loss , or equivalently the cost
from the control viewpoint. The element of the vector/matrix are respectively denoted asand . We will follow the convention to denote the inner product of two vectors, with
as its generalization to the inner product of two functionals weighted by a probability density. We will always use the subscriptto denote the dynamics. Depending on the context, it can either mean propagation through DNN (Sec. 3) or training iterations (Sec. 4).
2 Related Work
Mean Field Approximation & Gaussian Process: Mean field theory allows us to describe distributions of activations and pre-activations over an ensemble of untrained DNNs in an analytic form. The connection was adapted in Poole et al. (2016) to study how signals propagate through layers at the initialization stage. It implies an existence of an order-to-chaos transition as a function of parameter statistics. While networks in the phase of order suffer from saturated information and vanished gradients, in the chaotic
regime expressions of networks grow exponentially with depth, and exploded gradients can be observed. This phase transition diagram, formally characterized inSchoenholz et al. (2016), provides a necessary condition towards network trainability and determines an upper bound on the number of layers allowable for information to pass through. This analysis has been successfully applied to most commonly-used architectures for critical initialization, including FCN, CNN, RNN, LSTM, ResNet (Schoenholz et al., 2016; Chen et al., 2018a; Gilboa et al., 2019; Xiao et al., 2018; Yang & Schoenholz, 2017). In addition, it can also be used to provide geometric interpretation by estimating statistical properties of Fisher information (Karakida et al., 2018; Pennington et al., 2018). It is worth noticing that all aforementioned works require the limit of layer width and i.i.d. weight priors in order to utilize the Gaussian approximation. Indeed, the equivalence between DNN and Gaussian process has long been known for single-layer neural networks (Williams, 1997) and extended to deeper architectures recently (Matthews et al., 2018; Garriga-Alonso et al., 2018). The resulting Bayesian viewpoint enables uncertainty estimation and accuracy prediction at test time (Lee et al., 2017a).
Implicit Bias in Stochastic Gradient Descent (SGD): There has been commensurate interest in studying the implicit bias stemmed from stochastic optimization. Even without stochasticity, vanilla GD algorithms are implicitly regulated as they converge to max-margin solutions for both linear predictors (Soudry et al., 2018; Gunasekar et al., 2018a) and over-parametrized models, e.g. multi-layers linear networks (Gunasekar et al., 2018b; Ji & Telgarsky, 2018). When the stochasticity is introduced, a different regularization effect has been observed (Wu et al., 2018). This implicit regularization plays a key role in explaining why DNNs generalize well despite being over-parametrized (Zhang et al., 2016; Neyshabur et al., 2017). Essentially, stochasticity pushes the training dynamics away from sharp local minima (Zhu et al., 2018) and instead guides it towards flat plateaus with lower generalization errors (Keskar et al., 2016)
. An alternative view suggests a convolved, i.e. smoothening, effect on the loss function throughout training(Kleinberg et al., 2018). The recent work from Chaudhari & Soatto (2018) provides mathematical intuitions by showing that SGD performs variational inference under certain approximations. Algorithmically, Chaudhari et al. (2016, 2018)
proposed a surrogate loss that explicitly biases SGD dynamics towards flat local minima. The corresponding algorithm relates closely to stochastic gradient Langevin dynamics, a computationally efficient sampling technique originated from Markov Chain Monte Carlo (MCMC) for large-scale problems(Teh et al., 2016; Li et al., 2016).
Information Theory & Statistical Physics: Research along this direction studies the dynamics of Markovian stochastic process and its effect on the deep representation at an ensemble level. For instance, the Information Bottleneck (IB) theory (Tishby & Zaslavsky, 2015; Shwartz-Ziv & Tishby, 2017) studies the training dynamics on the Information Plane described by the mutual information of layer-wise activations. The principle of information bottleneck mathematically characterizes the phase transition from memorization to generalization, mimicking the critical learning periods in biological neural systems (Achille et al., 2019). Applying the same information Lagrangian to DNN’s weights has revealed intriguing properties of the deep representation, such as invariance, disentanglement and generalization (Achille & Soatto, 2018, 2019), despite a recent debate in Saxe et al. (2018) arguing the inability of the findings in references (Tishby & Zaslavsky, 2015; Shwartz-Ziv & Tishby, 2017) to generalize beyond certain architectures. In Valle-Perez et al. (2019), similar statements on the implicit bias has been drawn from the Algorithmic Information Theory (AIT), suggesting the parameter-function map of DNNs is exponentially biased towards simple functions. The information theoretic viewpoint is closely related to statistical physics. In Goldt & Seifert (2017a, b), an upper bound mimicking the second law of stochastic thermodynamics was derived for single-layer networks on binary classification tasks. In short, generalization of a network to unseen datum is bounded by the summation of the Shannon entropy of its weights and a term that captures the total heat dissipated during training. The concept of learning efficiency was introduced as an alternative metric to compare algorithmic performance. Additionally, Yaida (2018) derived a discrete-time master equation at stationary equilibrium and linked it to the fluctuation-dissipation theorem in statistical mechanics.
Dynamics and Optimal Control Theory: The dynamical perspective has received considerable attention recently as it brings new insights to deep architectures and training processes. For instance, Weinan (2017) proposed to view DNNs as a discretization of continuous-time dynamical systems. From such, the propagating rule in the deep residual network (He et al., 2016),
, can be thought of as an one-step discretization of the forward Euler scheme on an ordinary differential equation (ODE),. This interpretation has been leveraged to improve residual blocks in the sense that it achieves more effective numerical approximation (Lu et al., 2017). In the continuum limit of depth, the flow representation of DNNs has made the transport analysis with Wasserstein geometry possible (Sonoda & Murata, 2019). Algorithmically, efficient computational methods have been developed in Chen et al. (2018b); Chen & Duvenaud (2019) to allow parameterization of (stochastic) continuous-time dynamics (e.g. derivative of latent variables) directly with DNNs. When the analogy between optimization algorithms and controllers is further drawn (Hu & Lessard, 2017; An et al., 2018b), standard supervised learning can be recast as a mean-field optimal control problem (Han et al., 2018). This is particularly beneficial since it enables new training algorithms inspired from optimal control literature (Li et al., 2017a; Li & Hao, 2018).
Similar analysis can be applied to SGD by viewing it as a stochastic dynamical system. In fact, most previous discussions on implicit bias formulate SGD as stochastic Langevin dynamics (Pavliotis, 2014). Other stochastic modeling, such as Lèvy process, has been recently proposed (Simsekli et al., 2019). In parallel, stability analysis of the Gram matrix dynamics induced by DNN reveals global optimality of GD algorithms (Du et al., 2018b, a). Applying optimal control theory to SGD dynamics results in optimal adaptive strategies for tuning hyper-parameters, such as the learning rate, momentum, and batch size (Li et al., 2017b; An et al., 2018a).
3 Deep Neural Network as a Dynamical System
As mentioned in Sec. 2, DNNs can be interpreted as finite-horizon nonlinear dynamical systems by viewing each layer as a distinct time step. In Sec 3.1, we will discuss how to explore this connection to analyze information propagation inside DNN. The formalism establishes the foundation of recent works (Du et al., 2018a; Schoenholz et al., 2016; Xiao et al., 2018), and we will discuss its implications in Sec. 3.2. In Sec 3.3
, we will draw the connection between optimization algorithms and controllers, leading to an optimal control formulation of DNN training characterized by mean-field theory. Hereafter we will focus on fully-connected (FC) layers and leave remarks for other architectures, e.g. convolution layers and residual connections, in SMA.
3.1 Information Propagation inside DNN
Recall the dynamics of a FC-DNN at time step , i.e. at layer -th and suppose its weights and biases are initialized by drawing i.i.d. from two zero-mean Gaussians, i.e.
and respectively denote the variance of the weights and biases, and is the dimension of the pre-activation at time
. Central limit theorem (CLT) implies in the limit of large layer widths,, the distribution of pre-activation elements,
, also converges to a Gaussian. It is straightforward to see the distribution also has zero mean and its variance can be estimated by matching the second moment of the empirical distribution ofacross all ,
can be viewed as the normalized squared length of the pre-activation, and we will use it as the statistical quantity of the information signal. The dynamics of , when propagating from time step to , takes a nonlinear form
with the initial condition given by . Notice that despite starting the derivation from random neural networks, the mapping in Eq. (3) admits a deterministic process, depending only on , , and . We highlight this determinism as the benefit gained by mean-field approximation. Schoenholz et al. (2016) showed that for any bounded and finite value of and , there exists a fixed point, , regardless of the initial state .
Similarly, for a pair of input we can derive the following recurrence relation
is the covariance matrix at time and the initial condition is given by . Under the same conditions for to exist, we also have the fixed points for the covariance and correlation, respectively denoted and . The element of the covariance matrix at its fixed point hence takes a compact form
where is Kronecker delta. It is easy to verify that has for the diagonal entries and for the off-diagonal ones. In short, when the mean field theory is applied to approximate distributions of activations and pre-activations, the statistics of the distribution follows a deterministic dynamics as propagating through layers. This statistics can be treated as the information signal and from there the dynamical system analysis can be applied, as we will show in the next subsection.
3.2 Stability Analysis & Implications
Traditional stability analysis of dynamical systems often involves computing the Jacobian at the fixed points. The Jacobian matrix describes the rate of change of system output when small disturbance is injected to input. In this vein, we can define the residual system, , and first-order expand222We refer readers to the Supplementary Material in Xiao et al. (2018) for a complete treatment. it at . The independent evolutions of the two signal quantities, as shown in Eq. (3) and (4), already hint that the Jacobian can be decoupled into two sub-systems. Their eigenvalues are given by,
This eigen-decomposition suggests the information traverses through layers in the diagonal and off-diagonal eigenspace. A fixed point is stable if and only if bothand are less than . In fact, the logarithms of and relate to the well-known Lyapunov exponents in the dynamical system theory, i.e.
where denotes the dynamics of the correlation.
The dynamical system analysis in Eq. (7-10) has several important implications. Recall again that given a DNN, its propagation rule depends only on and . We can therefore construct a phase diagram with and as axes and define a critical line that separates the ordered phase, in which all eigenvalues are less than to stabilize fixed points, and the chaotic phase, in which either or exceeds , leading to divergence. An example of the phase diagram for FC-DNNs is shown in Fig. 1. Networks initialized in the ordered phase may suffer from saturated information if the depth is sufficiently large. They become un-trainable since neither forward nor backward propagation is able to penetrate to the destination layer. Fig. 1 gives an illustration of how and , named depth scale (Schoenholz et al., 2016), predict the trainability of random DNNs. On the other hand, networks initialized along the critical line remain marginal stable, and information is able to propagate through an arbitrary depth without saturating or exploding. It is particularly interesting to note that while the traditional study on dynamical systems focuses on stability conditions, the dynamics inside DNN instead requires a form of transient chaos.
The discussion of the aforementioned critical initialization, despite being crucial in the success of DNN training (Glorot & Bengio, 2010), may seem limited since once the training process begins, the i.i.d. assumptions in order to construct Eq. (2), and all those derivations afterward, no longer hold. Fortunately, it has been empirically observed that when DNN is sufficiently over-parameterized, its weight will be close to the initialization over training iterations (Li & Liang, 2018). In other words, under certain assumptions, the statistical property derived at initialization can be preserved throughout training. This has a strong implication as it can be leveraged to prove the global convergence and optimality of GD (Du et al., 2018b, a). Below we provide the proof sketch and demonstrate how the deep information is brought into the analysis.
Recalling the information defined in Eq. (4-5) for a pair of input, we can thereby construct a Gram matrix , where is the size of the dataset and . The element of represents the information quantity between the data points with the corresponding indices, i.e. . The Gram matrix can be viewed as a deep representation of a matrix form induced by the DNN architecture and dataset at random initialization. The same matrix has been derived in several concurrent works, namely the Neural Tangent Kernel (NTK) in Jacot et al. (2018).
Now, consider a mean squared loss, , where and each element denotes the scalar prediction of each data point . The dynamics of the prediction error governed by the GD algorithm takes an analytical form (Du et al., 2018b) written as
and denote the iteration process and learning rate. Eq. (11) indicates a linear dynamics characterized by the matrix , whose element at initialization is related to the one of by
When the width is sufficiently large, will be close to (precisely is bounded) for all iterations . This, together with the least eigenvalue of being lower-bounded for non-degenerate dataset, concludes the linear convergence to the global minimum.
3.3 Training DNN with Optimal Control
To frame optimization algorithms and training processes into the dynamical system viewpoint, one intuitive way is to interpret optimization algorithms as controllers. As we will show in Sec. 3.3.1, this can be achieved without loss of generality and naturally yields an optimal control formalism of the deep learning training process. Such a connection is useful since the optimality conditions of the former problem are well studied and characterized by the Pontryagin’s Maximum Principle (PMP) and the Hamilton-Jacobi-Bellman (HJB) equation, which we will introduce in Sec. 3.3.2 and 3.3.3, respectively. The fact that back-propagation (LeCun et al., 1990) can be viewed as an approximation of PMP (Li et al., 2017a) opens a room for new optimization algorithms inspired from the optimal control perspective.
3.3.1 Mean-Field Optimal Control Derivation
In Hu & Lessard (2017); An et al. (2018b), a concrete connection was derived between first-order optimization algorithms and PID controllers. To see that, consider the formula of gradient descent and the discrete realization of integral control:
These two update rules are equivalent when we interpret the gradient as tracking error . Both modules are designed to drive certain statistics of a system, either the loss gradient or tracking error, towards zero, by iteratively updating new variables to affect system dynamics. When a momentum term is introduced, it will result in an additional lag component which helps increase the low-frequency loop gain (Franklin et al., 1994). This suggests accelerated convergence near local minima, as observed in the previous analysis (Qian, 1999). In other words, the parameters in DNNs can be recast as the control variables in dynamical systems.
With this viewpoint in mind, we can draw an interesting connection between deep learning training processes and optimal control problems (OCP). In a vanilla discrete-time OCP, the minimization problem takes the form
where and represent state and control vectors. , and respectively denote the dynamics, intermediate cost and terminal cost functions. In this vein, the goal of (supervised) learning is to find a set of optimal parameters at each time step (i.e. layer), , such that when starting from the initial state , its terminal state is close to the target . Dynamical constraints in Eq. (OCP) are characterized by the DNNs, whereas terminal and control-dependent intermediate costs correspond to training loss and weight regularization. Though state-dependent intermediate costs are not commonly-seen in supervised problems until recently (Nøkland & Eidnes, 2019)
, it has been used extensively in the context of deep reinforcement learning to guide or stabilize training, e.g. the auxiliary tasks and losses(Jaderberg et al., 2016; Liu et al., 2017).
Extending Eq. (OCP) to accept batch data requires viewing the input-output pair
as a random variable drawn from a probability measure. This can be done by introducing the mean-field formalism where the analysis is lifted to distribution spaces. The problem becomes searching an optimal transform that propagates the input population to the desired target distribution. The population risk minimization problem can hence be regarded as a mean-field optimal control problem (MF-OCP) (Han et al., 2018),
where denotes the set of essentially-bounded measurable functions555 This implies can be quite discontinuous in time, since the function class we consider only requires to be measurable and essentially bounded. and
is the joint distribution of the initial statesand terminal target . Note that we change our formulation from the discrete-time realization to the continuous-time framework since it is mathematically easier to analyze and offers more flexibilities. The MF-OCP formulation allows us to analyze the optimization of DNN training through two perspectives, namely the maximum principle approach and dynamic programming approach, as we will now proceed.
3.3.2 Mean-Field Pontryagin’s Maximum Principle
The necessary conditions of the OCP problem are described in the celebrated Pontryagin’s Maximum Principle (PMP) (Boltyanskii et al., 1960). It characterizes the conditions that an optimal state-control trajectory must obey locally. Han et al. (2018) derived the mean-field extension of the theorem, which we will restate below. We will focus on its relation with standard DNN optimization, i.e. gradient descent with back-propagation, and refer to Han et al. (2018) for the concrete proof.
Theorem 1 (Mean-Field PMP (Han et al., 2018)).
Assume the following statements are true:
[itemsep= -2 pt,topsep = -2 pt, partopsep = 5 pt, leftmargin = 32 pt]
The function is bounded; , are continuous in ; and , , are continuously differentiable with respect to .
The distribution has bounded support, i.e. for some .
Let be a solution that achieves the infimum of MF-OCP. Then, there exists continuous stochastic processes and , such that
where the Hamiltonian function is given by
and denotes the co-state of the adjoint equation.
Theorem 1 resembles the classical PMP result except that the Hamiltonian maximization condition (13b) is now taken over an expectation w.r.t. . Also, notice the optimal control trajectory admits an open-loop process in the sense that it does not depend on the population distribution. This is in contrast to what we will see from the dynamic programming approach (i.e. Theorem 2).
The conditions characterized by Eq. (13) can be linked to the optimization dynamics in DNN training. First, Eq. (13a) is simply the feed-forward pass from the first layer to the last one. The co-state can be interpreted as the Lagrange multiplier of the objective function w.r.t. the constraint variables (Bertsekas et al., 1995), and its backward dynamics is described in Eq. (13b). Here, we shall regard Eq. (13b) as the back-propagation (Li et al., 2017a). To see that, consider the discrete-time Hamiltonian function,
Substituting it into Eq. (13b
) will lead to the chain rule used derive back-propagation,
where is the gradient of the total loss function w.r.t. the activation at layer .
Finally, the maximization in Eq. (13c) can be difficult to solve exactly since the dimension of the parameter is typically millions for DNNs. We can, however, apply approximated updates iteratively using first-order derivatives (Li et al., 2017a), i.e.
The subscript denotes the time step in the DNN dynamics, i.e. the index of the layer, whereas the superscript represents the iterative update of the parameters in the outer loop. The update rule in Eq. (17) is equivalent to performing gradient descent on the original objective function in OCP,
When the expectation in Eq. (17) is replaced with a sample mean, Han et al. (2018) showed that if a solution of MF-OCP is stable666 The mapping is said to be stable on if for some . , we can find with high probability a random variable in its neighborhood that is a stationary solution of the sampled PMP.
3.3.3 Mean-Field Hamilton-Jacobi-Bellman Equation
The Hamilton-Jacobi-Bellman (HJB) equation (Bellman & Kalaba, 1964) characterizes both necessary and sufficient conditions to the OCP problem. The equation is derived from the principle of Dynamic Programming (DP), which reduces the problem of minimizing over a sequence of control to a sequence of minimization over a single control at each time. This is done by recursively solving the value function (define precisely below), and the obtained optimal policy is a function applied globally to the state space, i.e. a feedback controller with states as input.
Han et al. (2018) adapted the analysis to mean-field extension by considering probability measures as states. Following their derivations, we will consider the class of probability measures that is square integrable on Euclidean space with -Wasserstein metrics, denoted , throughout our analysis. Importantly, this will lead to an infinite-dimensional HJB equation as we will show later. It is useful to begin with defining the cost-to-go and value function, denoted as and :
Note that the expectation is taken over the distribution evolution, starting from and propagating through the DNN architecture. The cost-to-go function is simply a generalization of the MF-OCP objective to varying start time, and its infimum over the control space is achieved by the value function, i.e. the objective in MF-OCP can be regarded as , with as its infimum. Now, we are ready to introduce the mean-field DP and HJB equation.
Theorem 2 (Mean-Field DP & HJB (Han et al., 2018)).
Let the following statements be true:
[itemsep= -2 pt,topsep = -2 pt, partopsep = 5 pt, leftmargin = 32 pt]
, , are bounded; , , are Lipschitz continuous with respect to , and the Lipschitz constants of and are independent of .
Notice that (A1’) and (A2’) are much stronger assumptions as opposed to those from Theorem 1
. This is reasonable since the analysis is now adapted to take probability distributions as inputs. Theorem2 differs from the classical HJB in that the equations become infinite-dimensional. The computation requires the derivative of the value function w.r.t. a probability measure, which can be done by recalling the definition of the first-order variation (Ambrosio et al., 2008) of a function at the probability measure , i.e. , satisfying the following relation:
where is taken to be infinitesimally small. Note that and are functions respectively defined on the probability space and its associated Euclidean space. In other words, the derivative w.r.t. the density is achieved by interplaying probability measures with laws of random variables, to which we can apply a suitable definition of the derivative.
Due to the curse of dimensionality, classical HJB equations can be computationally intractable to solve for high-dimensional problems, let alone its mean-field extension. However, we argue that in the literature of DNN optimization, algorithms with a DP flavor, or at least an approximation of it, have not been well-explored. Research in this direction may provide a principled way to design feedback policies rather than the current open-loop solutions in which weights are fixed once the training ends. This can be particularly beneficial for problems related to e.g. adversarial attack and generalization analyses.
4 Stochastic Optimization as a Dynamical System
We now turn attention to stochastic optimization. Recall that in Sec. 3.3, we bridge optimization algorithms with controllers. In classical control theory, controllers alone can be characterized as separated dynamical systems. Stability analysis is conducted on the compositional system along with the plant dynamics (Franklin et al., 1994). Similarly, the dynamical perspective can be useful to study the training evolution and convergence property of DNN optimization. Unlike the deterministic propagation in Sec. 3.1, stochasticity plays a central role in the resulting system due to the mini-batch sampling at each iteration. Preceding of time corresponds to the propagation of training cycles instead of forwarding through DNN layers. The stochastic dynamical viewpoint forms most of the recent study on SGD (Chaudhari & Soatto, 2018; Chaudhari et al., 2018; An et al., 2018a).
This section is organized as follows. In Sec. 4.1, we will review the statistical property of the mini-batch gradient, which is the foundation for deriving the SGD dynamics. Recast of SGD as a continuous-time stochastic differential equation (SDE), or more generally a discrete-time master equation, will be demonstrated in Sec. 4.2, and 4.3, respectively. The theoretical analysis from the dynamical framework consolidates several empirical observations, including implicit regularization on the loss landscape (Zhang et al., 2016) and phase transition from a fast memorization period to slow generalization (Shwartz-Ziv & Tishby, 2017). It can also be leveraged to design optimal adaptive strategies, as we will show in Sec. 4.4.
4.1 Preliminaries on Stochastic Mini-Batch Gradient
Slightly abuse the notation and denote the averaging training loss on the data set as a function of parameter :
where is the training objective for each sample (c.f. Eq. (OCP)) and includes all compositional functionals of a DNN. We can write the full gradient of the training loss as , where is the gradient on each data point . The covariance matrix of , denoted , is a positive-definite (P.D.) matrix which can be computed deterministically, given the DNN architecture, dataset, and current parameter, as
Note that in practice, the eigen-spectrum of often features an extremely low rank ( for both CIFAR-10 and CIFAR-100 as reported in Chaudhari & Soatto (2018)).
Access of at each iteration is computationally prohibit for large-scale problems. Instead, gradients can only be estimated through a mini batch of i.i.d. samples . When the batch size is large enough, , CLT implies the mini-batch gradient, denoted , has the sample mean and covariance
The last equality in Eq. (23) holds when is sampled i.i.d. with replacement from .777 . The second equality holds since for . †† Remarks for Vanilla and Momentum GD: Dynamics of these systems can be characterized by the spectrum of loss functions. For instance, the eigendecomposition on convex objectives gives full descriptions on convergence speed of GD. Signal residing in the eigenspace with the least eigenvalue resists longest throughout optimization. When the momentum term is introduced, the system can be thought of as a discretization of a damped harmonic oscillator (Goh, 2017), and the vanilla and momentum GD dynamics can be respectively viewed as the over-damped and under-damped systems. This implies an oscillation when the momentum is set far from the critical damping coefficient. Despite focusing on SGD afterwards, most of our analyses can be extended to include momentum by considering a higher-order dynamics or augmented state space. For later purposes, let us also define the two-point noise matrix as . Its entry , or more generally the entry
of a higher-order noise tensor, can be written as
where is the partial derivative w.r.t. on the mini batch. Consequently, we can rewrite Eq. (23) as .
Finally, we will denote the distribution of the parameter at training cycle as and the steady-state distribution as .
4.2 Continuous-time Dynamics of SGD
where is the learning rate and . Now, consider the following continuous-time SDE and its Euler discretization,
In the standard SDE analysis (Øksendal, 2003), and refer to the drift and diffusion function. is the Wiener process, or Brownian motion, in the same dimension of . It is easy to verify that Eq. (28) resembles Eq. (26) if we set , , and . We have therefore derived the continuous-time limit of SGD as the following SDE,
where is proportional to the inverse temperature in thermodynamics.
Simply viewing Eq. (29) already gives us several insights. First, controls the magnitude of the diffusion process. A similar relationship, named noise scale, between the batch size and learning rate has been proposed in Smith & Le (2017); Jastrzebski et al. (2017). These two hyper-parameters, however, are not completely interchangeable since they contribute to different properties of the loss landscape (Wu et al., 2018). Secondly, the stochastic dynamics is characterized by the drift term from the gradient descent flow and the diffusion process from . When the parameter is still far from equilibrium, we can expect the drifting to dominate the propagation. As we approach flat local minima, fluctuations from the diffusion become significant. This drift-to-fluctuation transition has been observed on the Information Plane (Shwartz-Ziv & Tishby, 2017) and can be derived exactly for convex cases (Moulines & Bach, 2011).
Since the stochastic processes in Eq. (29) and (26) are different, the approximation is valid only up to the distribution level. While a more accurate approximation is possible by introducing stochastic modified equations (Milstein, 1994), we will limit the analysis to Eq. (29) and study the resulting dynamics using stochastic calculus and statistical physics.
Dynamics of Training Loss: To describe the propagation of the training loss, , as a function of the stochastic process in Eq. (29), we need to utilize Itô lemma, an extension of the chain rule in the ordinary calculus to the stochastic setting:
Lemma 1 (Itô lemma (Itô, 1951)).
Consider the stochastic process . Suppose and follow appropriate smooth and growth conditions, then for a given function , is also a stochastic process:
where denotes the Hessian. i.e. .
Applying Itô lemma to readily yields the following SDE:
where we denote to simplify the notation. Taking the expectation of Eq. (30) over the parameter distribution and recalling , the dynamics of the expected training loss can be described as
which is also known as the backward Kolmogorov equation (Kolmogoroff, 1931)
, a partial differential equation (PDE) that describes the dynamics of a conditional expectation. Notice that does not appear in Eq. (31) since the expectation of Brownian motion is zero. The term draws a concrete connection between the noise covariance and the loss landscape. In Zhu et al. (2018), this trace quantity was highlighted as a measurement of the escaping efficiency from poor local minima. We can also derive the dynamics of other observables following similar steps.
When training converges, the left-hand side of Eq. (31) is expected to be near zero, i.e.
In other words, the expected magnitude of the gradient signal is balanced off by the expected Hessian-covariance product measured in trace norm. A similar relation can be founded in the discrete-time setting (c.f. Eq. (42)), as we will see later in Sec. 4.3.
Dynamics of Parameter Distribution: The dynamics in Eq. (29) is a variant of the damped Langevin diffusion, which is widely used in statistical physics to model systems interacting with drag and random forces, e.g. the dynamics of pollen grains suspended in liquid, subjected to the friction from Navier-Stokes equations and random collisions from molecules. We synthesize the classical results for Langevin systems in the theorem below and discuss its implication in our settings.
Theorem 3 (Fokker-Plank equation and variational principle (Jordan et al., 1998)).
Consider a damped Langevin dynamics with isotropic diffusion:
where is the damping coefficient. The temporal evolution of the probability density, , is the solution of the Fokker-Plank equation (FPE):
where , and respectively denote the divergence, gradient, and Laplacian operators. Suppose is a potential function satisfying appropriate growth conditions, FPE has a unique stationary solution given by the Gibbs distribution . Furthermore, the stationary Gibbs distribution satisfies the variational principle — it minimizes the following functional
where and . In fact, serves as a Lyapunov function for the FPE, as it decreases monotonically along the dynamics of FPE and converges to its minimum, which is zero, at . In other words, we can rewrite FPE as a form of Wasserstein gradient flow (WGF):
|(WGF of FPE)|
where follows the same definition in Eq. (19) and is equal to by simple algebra.
FPE characterizes the deterministic transition of the density of an infinite ensemble of particles and is also known as the forward Kolmogorov equation (Øksendal, 2003). The form of the Gibbs distribution at equilibrium reaffirms the importance of the temperature , as it determines the sharpness of . While high temperature can cause under-fitting, in the asymptotic limit as , the steady-state distribution will degenerate to point masses located at . From an information-theoretic viewpoint, can be interpreted as the free energy, where and are respectively known as the energy (or evidence) and entropy functionals. Therefore, minimizing balances between the likelihood of the observation and the diversity of the distribution.
Now, in order to apply the variational principle, needs to be treated as a potential function under the appropriate growth conditions, which is rarely the case under the setting of deep learning. Nevertheless, assuming these assumptions hold will lead to an important implication, suggesting that the implicit regularization stemmed from SGD can be mathematically quantified as entropy maximization. Another implication is that in the presence of non-isotropic diffusion during training888 The non-isotropy of is expected since the dimension of the parameter is much larger than the number of data points used for training. Empirical supports can be founded in Chaudhari & Soatto (2018). , the trajectory governed by Eq. (33) has been shown to converge to a different location from the minima of the training loss (Chaudhari & Soatto, 2018). In short, the variational inference implied from Eq. (29) takes the form
which is minimized at . The relationship between and at equilibrium is given by999 The derivation of Eq. (35) is quite involved as it relies on the equivalence between Itô and A-type stochastic integration for the same FPE. We refer readers to Chaudhari & Soatto (2018) for a complete treatment.
where the divergence is applied to the column space of the diffusion matrix. In other words, the critical points of differ from those of the original training loss by the quantity . It can be readily verified that if and only if