In this paper, we study the expressive power of deep artificial neural networks (DNNs), and demonstrate that one can construct DNNs with polynomial complexity to approximate nonsmooth value functions associated with stiff stochastic differential equations (SDEs).
More precisely, for each , we consider the value function of the following -dimensional zero-sum stochastic differential game on a finite time horizon :
where , , are sets of admissible open-loop control strategies (see Section 2.2 for a precise definition), is a (possibly nonsmooth) terminal cost function with at most quadratic growth at infinity, and for each , , , is the solution to the following -dimensional controlled SDE:
where is a matrix, and are respectively and -valued functions, and is a
-dimensional Brownian motion defined on a probability space. In the case with , (1.1
) degenerates to a controlled ordinary differential equation. Moreover, if one of the setsand is singleton, the zero-sum game reduces to an optimal control problem.
In this work, we shall allow the coefficients , and to be stiff in the sense that they are Lipschitz continuous (with respect to the Euclidean norm on ) but the Lipschitz constants grow polynomially in the dimension . Such stiff SDEs arise naturally from spatial discretizatoins of stochastic partial differential equations (SPDEs) by using spectral methods (see e.g. [21, 16, 22, 26, 25]), or finite difference/element methods (see e.g. [16, 2, 12]).
For simplicity, let us consider the following uncontrolled SPDE as motivating example, but similar arguments also apply to controlled SPDEs. Let be an -dimensional Brownian motion on a probability space , , , and denotes the strong dual space of . Here is identified with so that . Then, for given mappings , , and , it has been shown in [21, 16, 22] that the following semilinear SPDE:
admits a solution under the following strong monotonicity condition: there exist some , such that for all , ,
where denotes the duality product of . Now, let be an orthonormal basis of , made of elements in , and for all . Then for each , we can project the SPDE (1.2) onto the subspace and consider a -dimensional Itô-Galerkin approximation of (1.2) in of the form:
where the discrete operators satisfy a monotonicity condition similar to (1.3). Then, under suitable regularity assumptions, one can show the well-posedness of a solution to the finite-dimensional SDE (1.4
), and estimate the rate of convergence in terms of the dimension. Note that for SPDEs driven by -valued random fields, one can consider similar Itô-Galerkin SDEs with finite-dimensional noises by truncating the series representation of the (space-time) random process (see  for sufficient conditions under which this extra approximation of the noise preserves the overall convergence order in ).
Now suppose that we are interested in the value functional , where is taken within a neighbourhood of the initial condition in (1.2), and is a given locally Lipschitz cost functional. This is practically important if the exact dynamics of (1.2) is only known subject to uncertain initial conditions, or if we would like to compare the optimal cost of a control problem among all initial states (see e.g. [11, 20]
). An accurate representation of the value functional is also crucial for the control design in reinforcement learning (see). The convergence of to as suggests us to approximate the functional by the -dimensional value function with a sufficiently large .
However, we face several difficulties in approximating the -dimensional value function . Recall that the errors of the Galerkin approximations are in general of the magnitude for some (see e.g. [16, 2, 12]). Thus to achieve the accuracy , we need to approximate the value function in the -dimensional Euclidean space with , and the complexity of many classical function approximation methods, e.g. piecewise constant and piecewise linear approximations, will grow exponentially in , i.e., they suffer from the so-called Bellman’s curse of dimensionality. Moreover, the control processes and the nonsmoothness of the terminal costs imply that the value function typically has weak regularity, e.g. is merely locally Lipschitz continuous and could grow quadratically at infinity. This prevents us from approximating the value function by using sparse grid approximations [7, 34], or high-order polynomial expansions . Finally, since the mappings , and in (1.2) could involve differential operators, the Lipschitz constants (with respect to the Euclidean norm) of in (1.4) will in general grow polynomially in dimension . This stiffness of coefficients creates a difficulty in constructing efficient discrete-time dynamics to approximate the time evolution of the Itô-Galerkin SDE (1.4).
In recent years, DNNs have achieved remarkable performance in representing high-dimensional mappings in a wide range of applications (see e.g. [26, 24, 25, 30, 3] and the references therein for applications in optimal control and numerical simulation of PDEs), and it seems that DNNs admit the flexibility to overcome the curse of dimensionality. However, even though there is a vast literature on the approximation theory of artificial neural networks (see e.g. [19, 27, 31, 37, 38, 1, 9, 10, 17, 23, 35, 5, 14, 15, 32]), to the best of our knowledge, only [10, 13, 23] established DNNs’ expression rates for approximating nonsmooth value functions (associated with uncontrolled SDEs with nonstiff affine diffusion coefficients). In this work, we shall extend their results by giving a rigorous proof of the fact that DNNs do overcome the curse of dimensionality for approximating (nonsmooth) value functions of zero-sum games of controlled SDEs with stiff, time-inhomogeneous, nonlinear coefficients.
More precisely, we shall establish that for a wide class of controlled stiff SDEs, to represent the corresponding value functions with accuracy , the number of parameters in the employed DNNs grows at most polynomially in both the dimension of the state equation and the reciprocal of the accuracy (see Theorems 2.1 and 2.3). As a direct consequence of these expression rates, we show that one can approximate the viscosity solution to a Kolmogorov backward PDE with stiff coefficients by DNNs with polynomial complexity (see Corollary 2.2). Moreover, if we further assume that the Galerkin approximation of the controlled SPDE has a convergence rate for some , our result indicates that we can represent the nonlinear functional without the curse of dimensionality.
The approach we take here is to first describe the evolution of a -dimensional controlled SDE (1.1) by using a suitable discrete-time dynamical system, and then constructing the desired DNN by a specific realization of the discrete-time dynamics. This is of the same spirit as , where the authors represent an uncontrolled SDE with constant diffusion and nonlinear drift coefficients by its explicit Euler discretization. However, due to the stiffness of the Itô-Galerkin SDEs considered in this paper, such an explicit time discretization will in fact lead to an approximation error depending exponentially on the dimension (cf. [23, Proposition 4.4]), and hence it cannot be used in our construction. We shall overcome this difficulty by approximating the underlying dynamics with its partial-implicit Euler discretization, whose error depends polynomially on the dimension and the (time) stepsize. We also adopt a two-step approximation of the terminal cost function involving truncation and extrapolation, which allows us to construct rectified neural networks for quadratically growing terminal costs; see the discussion below (H.1) for details.
The rest of this paper is structured as follows. Section 2 states the assumptions and presents the main theoretical results of the expression rates. We discuss several fundamental operations of DNNs in Section 3, and analyze a perturbed linear-implicit Euler discretization of SDEs in Section 4. Based on these estimates, we establish the expression rates of rectified neural networks for uncontrolled systems in Section 5, and controlled systems in Section 6. Section 7 offers possible extensions and directions for further research.
2 Main results
In this section, we shall recall the notion of DNN, and state our main results on the expression rates of DNNs for approximating value functions associated with controlled SDEs with stiff coefficients.
We start with some notation which is needed frequently throughout this work. For any given , we denote by
the Euclidean norm of a vector in, by the canonical Euclidean inner product, and by the identity matrix. For a given matrix , we denote by the Frobenius norm of , and by the matrix norm induced by Euclidean vector norms. We shall also denote by a generic constant, which may take a different value at each occurrence. Dependence of on parameters will be indicated explicitly by , e.g. .
Now we introduce the basic concepts of DNNs. By following the notation in [37, 10, 13] (up to some minor changes), we shall distinguish between a deep artificial neural network, represented as a structured set of weights, and its realization, a multi-valued function on . This enables us to construct complex neural networks from simple ones in an explicit and unambiguous way, and further analyze the complexity of DNNs.
Definition 2.1 (Deep artificial neural networks).
Let be the set of DNNs given by
Let be functions defined on , such that for any given , we have , , , and . We shall refer to the quantities , , and as the size, depth, input dimension and output dimension of the DNN , respectively.
For any given activation function, let be the function which satisfies for all that , and let be the realization operator such that for any given and
we have defined recursively as follows: let for all , and let
Roughly speaking, one can describe a DNN by its architecture, that is the number of layers and the dimensions of all layers , together with the coefficients of the affine functions used to compute each layer from the previous one. Note that Definition 2.1 does not specify a fixed nonlinear activation function in the architecture of a DNN, but instead considers the realization of a DNN with respect to a given activation function, which allows us to study the approximation capacity of DNNs with arbitrary activation functions (see e.g. Lemma A.1, due to its representation flexibility.
For any given DNN , the quantity represents the number of all real parameters, including zeros, used to describe the DNN. We remark that one can also consider the number of non-zero entries of the DNN as in . However, since it is in general difficult to build a sparse architecture with pre-allocated zero entries to approximate a desired value function, we choose to adopt the notation of ‘size’ by considering all parameters and quantify the complexity of the DNN in a conservative manner.
Motivated by the application to optimal control problems of SPDEs, in the remaining part of this section, we shall construct a sequence of DNNs , such that for each , , represents the value function induced by a -dimensional stiff SDE with the accuracy on . We shall demonstrate that under a monotonicity condition similar to (1.3), the complexity of the constructed DNN depends polynomially on both and , i.e., the DNNs overcome the curse of dimensionality. We first give the results for uncontrolled SDEs with stiff coefficients in Section 2.1, and then extend the results to controlled SDEs with piecewise-constant strategies in Section 2.2.
2.1 Expression rate for SDEs and Kolmogorov PDEs with stiff coefficients
In this section, we present the expression rate of DNNs for approximating value functions induced by nonlinear SDEs with stiff coefficients.
We start by introducing the value functions of interest. For each , we consider the following value function:
where is the strong solution to the following -dimensional SDE:
with a -dimensional Brownian motion defined on a probability space .
We now list the main assumptions on the coefficients.
Let and be fixed constants. For all and , let , and , , be measurable functions satisfying the following conditions, for all and :
The matrix and the functions satisfy the following monotonicity condition:
, and for all .
and admit the following regularity:
and for and .
The functions and enjoy the following properties:
where , and .
Let us briefly discuss the importance of the above assumptions. The monotonicity condition (2.3) in (H.1(a)) is weaker than the finite-dimensional analogue of the strong monotonicity condition (1.3), in the sense that (2.3) involves only the standard Euclidean norm instead of discrete Sobolev norms. The monotonicity, along with the Lipschitz continuity in (H.1(c)), ensures the well-posedness of (2.2) (see e.g. ), and allows us to derive precise regularity estimates (in -norms for ) of the solution to the SDE (2.2) with respect to the coefficients and the initial condition. Note that it is easy to check that if satisfy (H.1(c)) with a Lipschitz constant independent of the dimension , then the coefficients satisfy (H.1(a)). Thus our setting includes the representation result in  as a special case.
We remark that both the monotonicity condition (2.3) and the Lipschitz continuity of are crucial for constructing networks with polynomial complexity to approximate the desired value functions. With the help of the monotonicity condition (H.1(a)), we can demonstrate that both the regularity of the solution to (2.2) and the error estimates of a corresponding partial-implicit Euler scheme depend polynomially on , and , i.e., the Lipschitz constants of the coefficients (see Section 4 for details; see also  for SDEs with merely Lipschitz continuous coefficients, for which the corresponding estimates depend exponentially on the Lipschitz constants of the coefficients). These polynomial dependence results subsequently enable us to construct DNNs with polynomial complexities to approximate the value functions induced by stiff SDEs, including those arising from Galerkin approximations of SPDEs.
On the other hand, the Lipschitz continuity of allows us to construct the desired DNNs through a linear-implicit Euler scheme of (2.2), which is implicit in the linear part of the drift and remains explicit for the nonlinear part of the drift. In this way, we avoid constructing DNNs to approximate the inverse of the nonlinear mapping at each time step.
Finally, instead of approximating directly on , (H.1(d)) allows us to focus on approximating on a hypercube, and then extend the approximation linearly outside the domain. This is motivated by the fact that approximating a function by neural networks on a prescribed compact set has been better understood than approximating the function globally on (see e.g. [31, 37, 38, 9, 5, 14, 15, 32]). In particular, since can admit quadratic growth at infinity and ReLU networks can only generate piecewise linear functions, for a given small enough , there exists no ReLU network such that the inequality holds for all . Therefore, we adopt a two-step approximation by first approximating with a suitable Lipschitz continuous function , and then representing by a ReLU network on with a desired accuracy; see Proposition 3.1 for the representation results for weighted square functions, which are the commonly used cost functions for PDE-constrained optimal control problems.
To construct neural networks with the desired complexities, we shall assume that the family of functions and can be approximated by ReLU networks without curse of dimensionality.
Assume the notation of (H.1). Let and be the function satisfying for all . Let , , be a family of DNNs with the following properties, for any given , and :
The DNNs have the same architecture, i.e.,
for some integers , depending on and .
The DNNs admit the following complexity estimates:
The realizations , , and admit the following approximation properties: for all and ,
Since a ReLU network can be extended to an arbitrary depth and width without changing its realization (Lemma A.3), we assume without loss of generality in (H.2(a)) that have the same architecture to simplify our analysis.
The conditions (H.2(b),(c)) imply the function and can be approximated by ReLU networks with polynomial complexity in , and . These conditions clearly hold for most sensible discretizations of linear SPDEs, such as the Zakai equation (see e.g. [21, 12]):
where and are second-order and first-order linear differential operators, respectively. Moreover, by virtue of the fact that ReLU networks can efficiently represent the pointwise maximum/minimum operations (see Proposition 3.3), one can see (H.2(b),(c)) also hold for the discretizations of the following Hamilton–Jacobi–Bellman–Isaacs equation, since the (discretized) Hamiltonian can be exactly expressed by ReLU networks:
where , is a bounded open set in , and the Hamiltonian is given by:
and are two given finite sets. Finally, for general semilinear PDEs with bounded solutions, one may consider an equivalent semilinear PDE by truncating the nonlinearity outside a compact set, and approximate the truncated coefficients by DNNs.
Note that it has been shown in [33, Theorem 4] that for any given -Lipschitz continuous function of variables and any hypercube in , there exists a deep ReLU network with complexity , which approximates the function in sup-norm with accuracy on the hypercube. Therefore, (H.1(d)) and (H.2(c)) essentially assume that the difference between the terminal function and the deep ReLU network approximating can be controlled by the quadratic growth of outside the hypercube . We refer the reader to Proposition 3.1, where we verify (H.1(d)) and (H.2(c)) for a class of quadratic cost functions.
Now we are ready to state one of the main results of this paper, which shows that one can construct DNNs with polynomial complexity to approximate the value functions induced by nonlinear stiff SDEs. Similar representation results have been shown in  for SDEs with affine drift and diffusion coefficients, and in  for SDEs with nonlinear drift and constant diffusion coefficients. Our results extend these results to SDEs with time-inhomogeneous nonlinear drift and diffusion coefficients. Moreover, we allow the Lipschitz constants of the coefficients to grow with the dimension , which is crucial for the application to SPDE-constrained optimal control problems. The proof of this theorem is given in Section 5.
Suppose (H.1) and (H.2) hold. For each , let be the value function defined in (2.1), and let be a probability measure on satisfying , with the same constant as in (H.1), and some constant independent of .
Then there exists a family of DNNs and a constant , depending only on and , such that for all , , we have , and
The following result is a direct consequence of Theorem 2.1, which shows one can approximate the viscosity solution to a Kolmogorov backward PDE with stiff coefficients on a bounded domain without curse of dimensionality.
2.2 Expression rate for controlled SDEs with stiff coefficients
In this section, we extend the expression rates in Section 2.1, and construct DNNs with polynomial complexity to approximate value functions associated with a sequence of controlled SDEs with stiff coefficients.
We start by introducing the set of admissible strategies. Let be the set of intervention times defined as:
For each , we consider the following piecewise-constant deterministic strategies: for ,
where , , are given finite subsets of for some . Note that can be the coefficients of a parameterized control policy in the sense that if on , the state equation is controlled by a policy on , where are some prescribed basis functions.
Now for each , we consider a two-player zero-sum stochastic differential game, where the “inf-player” aims to minimize a particular function over all strategies , while the “sup-player” aims to maximize it over all strategies . The value function is given by:
where for each , , , is the strong solution to the following -dimensional controlled SDE:
with a -dimensional Brownian motion defined on a probability space . For simplicity, here we do not take into account any running costs of the state process , but it is straightforward to extend our results to control problems with running costs. Moreover, the result would not change if the linear part of the drift is also controlled.
We then state the assumptions on the coefficients of (2.6) for deriving the expression rates of DNNs. Roughly speaking, we assume (H.1) and (H.2) hold uniformly in terms of the control parameters. However, we would like to point out that even though the functions are continuous in time, the controlled drift and diffusion of (2.6) are discontinuous in time due to the jumps in the control processes.
Let , and be fixed constants. Let the set be defined as in (2.4). For all and , let , be a subset of for some , and , , , be measurable functions with the following properties:
The cardinality of the set satisfies .
Assume the notation of (H.3). Let and be the function satisfying for all . Let , , be a family of DNNs with the following properties, for any given , and :
Suppose (H.3) and (H.4) hold. For each , let be the value function defined in (2.5), and let be a probability measure on satisfying , with the same constant as in (H.1), and some constant independent of .
Then there exists a family of DNNs and a constant , depending only on and , such that for all , , we have , and