Model reduction of controlled dynamical systems has been a long standing, and as yet, unsettled challenge in control theory. The benefits are clear: a low dimensional approximation of a high dimensional system can be manipulated with a simpler controller, and can be simulated at lower computational cost. A complex, high dimensional system may even be replaced by a simpler model all together leading to significant cost savings, as in circuit design, while the “important variables” of a system might shed light on underlying physical or biological processes. Reduction of linear dynamical systems has been treated with some success to date. As we describe in more detail below, model reduction in the linear case proceeds by reducing the dimension of the system with an eye towards preserving its essential input-output behavior, a notion directly related to “balancing” observability and controllability of the system. The nonlinear picture, however, is considerably more involved.
In this paper we propose a scheme for balanced model-order reduction of general, nonlinear control systems. A key, and to our knowledge, novel point of departure from the literature on nonlinear model reduction is that our approach marries approximation and dimensionality reduction methods known to the machine learning and statistics communities with existing ideas in linear and nonlinear control. In particular, we apply a method similar to kernel PCA as well as function learning in Reproducing Kernel Hilbert Spaces (RKHS) to the problem of balanced model reduction. Working in RKHS provides a convenient, general functional-analytical framework for theoretical understanding as well as a ready source of existing results and error estimates. The approach presented here is also strongly empirical, in that observability and controllability, and in some cases the dynamics of the nonlinear system are estimated from simulated or measured trajectories. This emphasis on the empirical makes our approach broadly applicable, as the method can be applied without having to tailor anything to the particular form of the dynamics.
The approach we propose begins by constructing empirical estimates of the observability and controllability Gramians in a high (or possibly infinite) dimensional feature space. The Gramians are simultaneously diagonalized in order to identify directions which, in the feature space, are both the most observable and the most controllable. The assumption that a nonlinear system behaves linearly when lifted to a feature space is far more reasonable than assuming linearity in the original space, and then carrying out the linear theory hoping for the best. Working in the high dimensional feature space allows one to perform linear operations on a representation of the system’s state and output which can capture strong nonlinearities. Therefore a system which is not model reducible using existing methods, may become reducible when mapped into such a nonlinear feature space. This situation closely parallels the problem of linear separability in data classification: A dataset which is not linearly separable might be easily separated when mapped into a nonlinear feature space. The decision boundary is linear in this feature space, but is nonlinear in the original data space.
Nonlinear reduction of the state space already opens the door to the design of simpler controllers, but is only half of the picture. One would also like to be able to write a closed, reduced dynamical system whose input-output behavior closely captures that of the original system. This problem is the focus of the second half of our paper, where we again exploit helpful properties of RKHS in order to provide such a closed system.
The paper is organized as follows. In the next section we provide the relevant background for model reduction and balancing. We then adapt and extend balancing techniques described in the background to the current RKHS setting in Section III. Section IV then proposes a method for determining a closed, reduced nonlinear control system in light of the reduction map described in Section III. Finally, Section V provides experiments illustrating an application of the proposed methods to a specific nonlinear system.
Several approaches have been proposed for the reduction of linear control systems in view of control, but few exist for finite or infinite-dimensional controlled nonlinear dynamical systems. For linear systems the pioneering “Input- Output balancing” approach proposed by B.C. Moore observes that the important states are the ones that are both easy to reach and that generate a lot of energy at the output. If a large amount of energy is required to reach a certain state but the same state yields a small output energy, the state is unimportant for the input-output behavior of the system. The goal is then to find the states that are both the most controllable and the most observable. One way to determine such states is to find a change of coordinates where the controllability and observability Gramians (which can be viewed as a measure of the controllability and the observability of the system) are equal and diagonal. States that are difficult to reach and that don’t significantly affect the output are then ignored or truncated. A system expressed in the coordinates where each state is equally controllable and observable is called its balanced realization.
A proposal for generalizing this approach to nonlinear control systems was advanced by J. Scherpen , where suitably defined controllability and observability energy functions reduce to Gramians in the linear case. In general, to find the balanced realization of a system one needs to solve a set of Hamilton-Jacobi and Lyapunov equations (as we will discuss below). Moore 
proposed an alternative, data-based approach for balancing in the linear case. This method uses samples of the impulse response of a linear system to construct empirical controllability and observability Gramians which are then balanced and truncated using Principal Components Analysis (PCA, or POD). This data-driven strategy was then extended to nonlinear control systems with a stable linear approximation by Lall et al., by effectively applying Moore’s method to a nonlinear system by way of the Galerkin projection. Despite the fact that the balancing theory underpinning their approach assumes a linear system, Lall and colleagues were able to effectively reduce some nonlinear systems.
Phillips et al.  has also studied reduction of nonlinear circuit models in the case of linear but unbalanced coordinate transformations and found that approximation using a polynomial RKHS could afford computational advantages. Gray and Verriest mention in  that studying algebraically defined Gramian operators in RKHS may provide advantageous approximation properties, though the idea is not further explored. Finally, Coifman et al. 
discuss reduction of an uncontrolled stochastic Langevin system. There, eigenfunctions of a combinatorial Laplacian, built from samples of trajectories, provide a set of reduction coordinates but does not provide a reduced system. This method is related to kernel principal components (KPCA) using a Gaussian kernel, however reduction in this study is carried out on a simplified linear system outside the context of control.
Ii-a Balancing of Linear Systems
Consider a linear control system
where is controllable, is observable and is Hurwitz. We define the controllability and the observability Gramians as, respectively,
These two matrices can be viewed as a measure of the controllability and the observability of the system . For instance, consider the past energy , , defined as the minimal energy required to reach from in infinite time
and the future energy , , defined as the output energy generated by releasing the system from its initial state , and zero input for , i.e.
for and . In the linear case, it can be shown that and The columns of span the controllable subspace while the nullspace of coincides with the unobservable subspace. As such, and (or their estimates) are the key ingredients in many model reduction techniques. It is also well known that and satisfy the Lyapunov equations 
The idea behind balancing is to find a representation where the system’s observable and controllable subspaces are aligned so that reduction, if possible, consists of eliminating uncontrollable states which are also the least observable. More formally, we would like to find a new coordinate system such that
where . If is controllable and is observable, then there exists a transformation such that the state space expressed in the transformed coordinates is balanced and
. Typically one looks for a gap in the singular valuesfor guidance as to where truncation should occur. If we see that there is a such that , then the states most responsible for governing the input-output relationship of the system are while are assumed to make negligible contributions.
If is unstable then the controllability and observability quantities defined in (II-A) are undefined since the integrals will be unbounded. There may, however, still exist solutions to the Lyapunov equations (II-A) when is unstable, and these solutions will be unique if and only if . In this case balancing may be carried out as usual by finding (if possible) a transformation such that where is again diagonal and positive semidefinite [26, 10]. Other approaches to balancing unstable linear systems exist (see [9, 28] for the method of LQG balancing for example).
Ii-B Balancing of Nonlinear Systems
In the nonlinear case, the energy functions and in (II-A) and (II-A) are obtained by solving both a Lyapunov and a Hamilton-Jacobi equation. Here we follow the development of Scherpen . Consider the nonlinear system
with , , , , for , and . Moreover, assume the following Hypothesis.
Assumption A: The linearization of (II-B) around the origin is controllable, observable and is asymptotically stable.
 If the origin is an asymptotically stable equilibrium of on a neighborhood of the origin, then for all , is the unique smooth solution of
under the assumption that (II.1) has a smooth solution on . Furthermore for all , is the unique smooth solution of
under the assumption that (II.1) has a smooth solution on and that the origin is an asymptotically stable equilibrium of on .
With the controllability and the observability functions on hand, the input-normal/output-diagonal realization of system (II-B) can be computed by way of a coordinate transformation. More precisely,
Analogous to the linear case, the system’s states can be sorted in order of importance by sorting the singular value functions, and reduction proceeds by removing the least important states.
In the above framework for balancing of nonlinear systems, one needs to solve (or numerically evaluate) the PDEs (II.1), (II.1) and compute the coordinate change , however there are no systematic methods or tools for solving these problems. Various approximate solutions based on Taylor series expansions have been proposed [13, 12, 7]. Newman and Krishnaprasad  introduce a statistical approximation based on exciting the system with white Gaussian noise and then computing the balancing transformation using an algorithm from differential topology. As mentioned earlier, an essentially linear empirical approach was proposed in . In this paper, we combine aspects of both data-driven approaches and analytic approaches by carrying out balancing in a suitable RKHS.
Iii Empirical Balancing of Nonlinear Systems in RKHS
We consider a general nonlinear system of the form
with , , , , and . Let be the reachable set from the initial condition .
Hypothesis H: The system (III) is zero-state observable, its linearization around the origin is controllable, and the origin of is asymptotically stable.
We treat the problem of estimating the observability and controllability Gramians as one of estimating an integral operator from data in a reproducing kernel Hilbert space (RKHS) . Our approach hinges on the key modeling assumption that the nonlinear dynamical system is linear in an appropriate high (or possibly infinite) dimensional lifted feature space. Covariance operators in this feature space and their empirical estimates are the objects of primary importance and contain the information needed to perform model reduction. In particular, the (linear) observability and controllability Gramians are estimated and diagonalized in the feature space, but capture nonlinearities in the original state space. The reduction approach we propose adapts ideas from kernel PCA (KPCA)  and is driven by a set of simulated or sampled system trajectories, extending and generalizing the work of Moore  and Lall et al. .
In the development below we lift state vectors of the system (III) into a reproducing kernel Hilbert space .
Iii-a Elements of Learning Theory
In this section, we give a brief overview of reproducing kernel Hilbert spaces as used in statistical learning theory. The discussion here borrows heavily from[5, 27]. Early work developing the theory of RKHS was undertaken by N. Aronszajn .
Let be a Hilbert space of functions on a set . Denote by the inner product on and let be the norm in , for and . We say that is a reproducing kernel Hilbert space (RKHS) if there exists such that
has the reproducing property, i.e. , .
spans , i.e. .
will be called a reproducing kernel of . will denote the RKHS with reproducing kernel .
(Mercer kernel map) A function is called a Mercer kernel if it is continuous, symmetric and positive definite.
The important properties of reproducing kernels are summarized in the following proposition
If is a reproducing kernel of a Hilbert space , then
for and (positive definitness).
Let . The following kernels, defined on a compact domain , are Mercer kernels: (Linear), (Polynomial), (Gaussian) .
Let be a symmetric and positive definite function. Then, there exists a Hilbert space of functions defined on admitting as a reproducing Kernel.
Conversely, let be a Hilbert space of functions satisfying
Then, has a reproducing kernel .
Every sequence of functions which converges strongly to a function in , converges also in the pointwise sense, that is, , for any point . Further, this convergence is uniform on every subset of on which is bounded.
Let be a positive definite kernel on a compact domain or a manifold . Then there exists a Hilbert space and a function such that
is called a feature map, and a feature space111 The dimension of the feature space can be infinite and corresponds to the dimension of the eigenspace of the integral operator
The dimension of the feature space can be infinite and corresponds to the dimension of the eigenspace of the integral operatordefined as if is a Mercer kernel, for and is a Borel measure on ..
The fact that Mercer kernels are positive definite and symmetric reminds us of similar properties of Gramians and covariance matrices. This is an essential fact that we are going to use in the following.
RKHS play an important in learning theory whose objective is to find an unknown function from random samples
. For instance, assume that the random probability measure that governs the random samples isand is defined on . Let be a compact subset of and . If we define the least square error of as
then the function that minimzes the error is the regression function
where is the conditional probability measure on . Since is unknown, neither nor is computable. We only have the samples . The error is approximated by the empirical error by
for , plays the role of a regularizing parameter. In learning theory, the minimization is taken over functions from a hypothesis space often taken to be a ball of a RKHS associated to Mercer kernel , and the function that minimizes the empirical error is
where the coefficients is solved by the linear system
and is taken as an approximation of the regression function . Hence, minimizing over the (possibly infinite dimensional) Hilbert space, reduces to minimizing over . The series (III-A) converges absolutely and uniformly to .
We call learning the process of approximating the unknown function from random samples on .
In the following, we assume that the kernels are continuous and bounded by
Iii-B Empirical Gramians in RKHS
Following , we estimate the controllability Gramian by exciting each coordinate of the input with impulses while setting . One can also further excite using rotations of impulses as suggested in , however for simplicity we consider only the original signals proposed in . Let be the -th excitation signal, and let be the corresponding response of the system. Form the matrix , so that is seen as a data matrix with column observations given by the respective responses . Then is given by
We can approximate this integral by sampling the matrix function within a finite time interval assuming the regular partition . This leads to the empirical controllability Gramian
As described in , the observability Gramian is estimated by fixing , setting for , and measuring the corresponding system output responses . As before, assemble the responses into a matrix . The observability Gramian and its empirical counterpart are given by
where . The matrix can be thought of as a data matrix with column observations
so that corresponds to the response at time of the single output coordinate to each of the (separate) initial conditions . This convention will lead to greater clarity in the steps that follow.
Iii-C Kernel PCA
Kernel PCA  will be a helpful starting point for understanding the approach to balanced reduction introduced in this paper. We briefly review the relevant background here. Kernel PCA (KPCA) generalizes linear PCA by carrying out PCA in a high dimensional feature space defined by a feature map . Taking the feature map and given the set of data , we can consider PCA in the feature space by simply working with the covariance of the mapped vectors,
denotes the tensor product between two vectors in. We will assume the data are centered in the feature space so that . If not, data may be centered according to the prescription in .
The principal subspaces are computed by diagonalizing , however as is shown in , one can equivalently form the matrix of kernel products for , and solve the eigenproblem
then we have that
, and the non-zero eigenvalues ofand coincide.
The eigenvectorsof are then normalized so that the eigenvectors of have unit norm in the feature space, leading to the condition . Assuming this normalization convention, sort the eigenvectors according to the magnitudes of the corresponding eigenvalues in descending order, and form the matrix
Similarly, form the matrix of sorted eigenvectors of . The first principal components of a vector in the feature space are then given by . It can be shown however (see ) that principal components in the feature space can be computed in the original space with kernels using the map
Iii-D Model Order Reduction Map
The method we propose consists, in essence, of collecting samples and then performing a process similar to “simultaneous principal components analysis” on the controllability and observability Gramian estimates in the (same) RKHS. As mentioned above, given a choice of the kernel defining a RKHS , principal components in the feature space can be computed implicitly in the original input space using . It is worth emphasizing however that we will be co-diagonalizing two Gramians in the feature space by way of a non-orthogonal transformation; the process bears a resemblance to (K)PCA, and yet is distinct. Indeed the favorable properties associated with an orthonormal basis are no longer available, the quantities we will in practice diagonalize are different, and the issue of data-centering must be considered with some additional care.
First note that the controllability Gramian can be viewed as the sample covariance of a collection of vectors, scaled by
and the observability Gramian can be similarly viewed as the sample covariance of a collection of vectors
where the are defined in Equation (1).
We can thus consider three quantities of interest:
The controllability kernel matrix of kernel products
for where we have re-indexed the set of vectors to use a single linear index.
The observability kernel matrix ,
for , where we have again re-indexed the set for simplicity.
The Hankel kernel matrix ,
for , .
We have chosen the suggestive terminology “Hankel kernel matrix” above because the square-roots of the nonzero eigenvalues of the matrix are the empirical Hankel singular values of the system mapped into feature space, where we assume the system behaves linearly. This assertion will be proved immediately below. Note that ordinarily, and will be rank deficient.
Before proceeding we consider the issue of data centering in feature space. PCA and kernel PCA assume that the data have been centered in order to make the problem translation invariant. In the setting considered here, we have two distinct sets of data: the observability samples and the controllability samples. A reasonable centering convention centers the data in each of these datasets separately. Let denote the matrix whose columns are the observability samples mapped into feature space by , and let be the matrix similarly built from the feature space representation of the controllability samples. Then and
. Assume for the moment that there areobservability data samples and controllability samples, and let denote the length , vectors of all ones, respectively. We can define centered versions of the feature space data matrices as
where and . We will need two centered quantities in the development below. The first centered quantity we consider is the centered version of , namely . Although one cannot compute explicitly from the data, we can compute by observing that
The second quantity we’ll need a centered version of the empirical observability feature map
where is the state variable and the observability samples are again indexed by a single variable as in Equation (III-D). Centering follows reasoning similar to that of the Hankel kernel matrix immediately above:
Note: Throughout the remainder of this paper we will drop the special notation and assume that are centered appropriately.
With the quantities defined above, we can co-diagonalize the empirical Gramians (balancing) and reduce the dimensionality of the state variable (truncation) in feature space by carrying out calculations in the original data space. As the system is assumed to behave linearly in the feature space, the order of the model can be reduced by discarding small Hankel values , and projecting onto the subspace associated with the first largest eigenvalues. The following key result describes this process:
Theorem III.4 (Balanced Reduction in Feature Space)
Balanced reduction in feature space can be accomplished by applying the state-space reduction map given by
where if , and is the empirical observability feature map.
We assume the data have been centered in feature space. Let be a matrix with columns , so that is the feature space controllability Gramian counterpart to Equation (3). Similarly, let be a matrix with columns , so that is the feature space observability Gramian counterpart to Equation (4). Since by definition , we also have that and . In general the Gramians are infinite dimensional whereas the kernel matrices are necessarily of finite dimension.
We now carry out linear balancing on in the feature space (RKHS). First, take the SVD of so that
The last equality in Equation (9) follows since is symmetric and therefore is too. The linear balancing transformation is then given by , and one can readily verify that . Here, inverses should be interpreted as peudo-inverses when appropriate222Such as in the case of when the number of data points is less than the dimension of the RKHS.. From Equations (8)-(10), we see that and thus . We can project an arbitrary mapped data point onto the (balanced) “principal” subspace of dimension spanned by the first rows of by computing
where is the empirical observability feature map, recalling that is the matrix formed by taking the top eigenvectors of by Equation (10).
We note that square roots of the non-zero eigenvalues of are exactly the Hankel singular values of the system mapped into the feature space, under the assumption of linearity in the feature space. This can be seen by noting that , where refers to the non-zero eigenvalues of its argument.
Iv Closed Dynamics of the Reduced System
Given the nonlinear state space reduction map , a remaining challenge is to construct a corresponding (reduced) dynamical system on the reduced state space which well approximates the input-output behavior of the original system on the original state space. Setting
and applying the chain rule,
where refers to an appropriate notion (to be defined) of the inverse of . However we are faced with the difficulty that the map is not in general injective (even if ), and moreover one cannot guarantee that an arbitrary point in the RKHS has a non-empty preimage under . We propose an approximation scheme to get around this difficulty: The dynamics will be approximated by an element of an RKHS defined on the reduced state space. When is assumed to be known explicitly it can be approximated to a high degree of accuracy. An approximate, least-squares notion of will be given to first or second order via a Taylor series expansion, but only where it is strictly needed – and at the last possible moment – so that a first or second order approximation will not be as crude as one might suppose. We will also consider, as an alternative, a direct approximation of which takes into account further properties of the reproducing kernel as well as the fact that the Jacobian is to be evaluated at in particular. In both cases, the important ability of the map to capture strong nonlinearities will not be significantly diminished.
Iv-a Representation of the dynamics in RKHS
The vector-valued map can be approximated by a composing a set of regression functions (one for each coordinate) in an RKHS, with the reduction map . It is reasonable to expect that this approximation will be better than directly computing using, for instance, a Taylor expansion approximation for which may ignore important nonlinearities at a stage where crude approximations must be avoided.
Let denote a reduced state variable, and concatenate the input examples so that , and is a set of input-output training pairs describing the -th coordinate of the map . The training examples should characterize “typical” behaviors of the system, and can even re-use those trajectories simulated in response to impulses for estimating the Gramians above. We will seek the function which minimizes
here is a regularization parameter. We have chosen the square loss, however other suitable loss functions may be used. It can be shown that in this case takes the form , where defines the RKHS (and is unrelated to used to estimate the Gramians). Note that although our notation takes the RKHS for each coordinate function to be the same, in general this need not be true: different kernels may be chosen for each function. Here the comprise a set of coefficients learned using the regularized least squares (RLS) algorithm. The kernel family and any hyper-parameters can be chosen by cross-validation. For notational convenience we will further define the vector-valued empirical feature map
for . In this notation where .
A broad class of systems seen in the literature  are also characterized by separable dynamics of the form . In this case one need only estimate the functions and from examples and .
Iv-B Approximation of the Jacobian Contribution
We turn to approximating the component appearing in Equation (11).
Iv-B1 Inverse-Taylor Expansion
A simple solution is to compute a low-order Taylor expansion of and then invert it using the Moore-Penrose pseudoinverse to obtain the approximation. For example, consider the first order expansion . Then we can approximate (in the first-order, least-norm sense) as
We may start with , but periodically update the expansion in different regions of the dynamics if desired. A good expansion point could be the estimated preimage of returned by the algorithm proposed in . If is the centered version of the length vector defined by (III-D), then
where is the length vector of all ones. An example calculation of in the case of a polynomial kernel is given in the section immediately below.
Iv-B2 Exploiting Kernel Properties
For certain choices of the kernel defining the Gramian feature space , one can exploit the fact that and its derivative bear a special relationship, and potentially improve the estimate for
. Perhaps the most commonly used off-the-shelf kernel families are the polynomial and Gaussian families. For any two kernels with hyperparametersand (respectively) in one of these classes, we have that . We’ll consider the polynomial kernel of degree , in particular; the Gaussian case can be derived using similar reasoning. For a polynomial kernel we have that
Recalling that and , if were invertible then we would have
The map is not injective however, and in addition the fibers of may be potentially empty, so we must settle for an approximation. It is reasonable then to define as the solution to the convex optimization problem
where is defined as in Equation (III-D). If a point has a pre-image in this definition is consistent with composing with the formal definition and noting that in this case . Furthermore, a trajectory of the closed dynamical system on the reduced statespace need not (and may not) have a counterpart in the original statespace by virtue of the way in which