1 Introduction
Classic stochastic control methods such as LQG hinge on the mathematical fact that the family of Gaussian distributions is closed under an affine transformation. This allows uncertainty to be propagated in a tractable manner under the Gaussianity assumption. Robust control methods, such as the classic tube model predictive control, also rely on the linearity of dynamics to propagate the polytopic uncertainty. However, when we move beyond those assumptions to the territories of nonlinear dynamics and nonGaussian noise, uncertainty propagation becomes difficult or even intractable.
A central component in robust and stochastic control is how to represent
the system uncertainty. To this end, point estimate, ellipsoidal uncertainty set, Gaussian distribution, or randomized computer simulations have been used in various applications. Along with those, a few families of
uncertainty propagation methods have been proposed, e.g., generalized polynomial chaos approximation, Gaussian processes. The representation and propagation affect control as well as system identification. For example, a parameter estimation problem typically uses the likelihood function as its criterion for goodnessoffit, which requires assuming the distribution family.Consider an illustrative sketch in Figure 1. How can we measure the differences between system models? If the dynamical system of interest is deterministic (left), then we can simply compare the solutions of the systems in the sense of Euclidean distance.
Now we consider Figure 1 (right), which illustrates the evolution of two stochastic dynamical systems (with arbitrary distribution). Due to the stochasticity, the solutions are distributions. Comparison in Euclidean distance is no longer feasible. How do we incorporate statistical information without imposing strong distributional assumptions? The question we address is how to represent, compare, and propagate the randomness in dynamical systems in a quantitative way.
This work proposes to embed uncertain state distributions into the reproducing kernel Hilbert spaces (RKHS), enabling quantification and algebraic operations therein. The main contributions of this work are the following.

We introduce the RKHS embedding method of kernel probabilistic programming in the context of uncertain dynamical systems. The resulting formulations of representation, comparison, and propagation methods are our main contributions.

Through distinct numerical examples, we demonstrate the flexible and distributionfree nature of our method as a unifying tool for dynamical systems.

We propose a recursive reduced set method to propagate the system uncertainty by iteratively solving an optimization problem. This forms fixedsize kernel expansions rather than allowing the number of uncertainty realizations to explode over time, as is commonly encountered in robust control [e.g., Scokaert and Mayne (1998); Tempo et al. (2012) Chapter 12; Campi et al. (2019)].
1.0.1 Notation:
In this work, the state of uncertain dynamical systems are denoted by , where is time and is the uncertainty in the system, e.g., a disturbance process. The Stieltjes integral
denotes the integration w.r.t. either a continuous or a discrete probability measure
. Symbol often denotes a nonempty set and a reproducing kernel Hilbert space (RKHS). We writeto denote that the random variable (RV)
follows the distribution law .2 Background & related work
2.1 Uncertainty quantification in dynamical systems
Stochastic dynamical system refers to a system whose evolution is affected by nondeterministic disturbances. To illustrate our idea concretely, let us consider a simple example of stochastic ODE that was studied in Xiu and Karniadakis (2002):
(1) 
where the parameter
is uncertain. This uncertainty may enter as (a) timeinvariant, or, more generally (b) timevarying stochastic process. For a moment, let us consider case (a) and the parameter follows a certain distribution law
. This is hence an initial value problem (IVP) with uncertain parameter. The problem of quantifying the distribution of the solution to the IVP (1) is typically referred to as the uncertainty quantification in dynamical systems.One somewhat trivial approach of quantifying the solution is the Monte Carlo sampling that samples from the underlying distribution. Then we deterministically integrate the ODE with those realizations of uncertain parameter to obtain solutions . We may later extract the moment statistics of by its Monte Carlo estimation.
In numerical analysis, one representative thread of works in uncertainty quantification is the generalized polynomial chaos (gPC) method. It expands the solution of IVP (1) as a series in orthogonal polynomial basis functions ,
This expansion is mathematically elegant in that the basis ’s account for the stochasticity and the expansion coefficients ’s are deterministic. From this point onward, we can either follow Galerkin’s method propagate the coefficients through the dynamical systems as in Xiu and Karniadakis (2002), or, use the socalled stochastic collocation to numerically integrate the IVP (1) at certain collocation nodes as in Xiu (2009).
Another welldeveloped methodology in Bayesian statistics is the Gaussian process (GP). Intuitively, GP generalizes the Gaussian distribution to the distribution of functions. For example, dynamics described by a difference equation can be modeled by a GP prior, i.e.,
where is a shorthand for. Given data samples, it can be shown that the posterior predictive distribution for an unseen state is also Gaussian. This allows us to quantify the distribution of solution
. We refer to Rasmussen and Williams (2006) for an accessible introduction.Mathematically speaking, gPC and GP are both surrogate function classes enabled by function approximation theory. This paper does not focus on analyzing the function approximation aspect. Instead, the embedding method we shall propose in Section 3 calls for a shift in our ways of thinking about statistical distributions.
2.2 Goodnessoffit measure for system identification
System identification studies how to construct mathematical models of dynamical systems from observed data. While this paper does not directly propose a system identification algorithm, we show how the proposed concept can impact how we analyze and compare the models in terms of goodnessoffit. To give readers a concrete example for this new concept, we revisit the parameter and variability estimation (PVE) proposed by Mohammadi et al. (2015) in Section 4.2
. This may be thought of as an alternative to the least square (LSQ) estimation, whose underlying assumption is that the system is following an additive Gaussian distribution with fixed variance,
In contrast, PVE is based on the robust optimization idea that the disturbance should lie within an ellipsoidal uncertainty set but not making any assumptions on the distribution family. The PVE method then formulates the identification of the uncertain ellipsoid as a semidefinite programming (SDP) problem. More details are provided in that paper.It can be relatively difficult to compare an LSQ point estimate (or MLE in general) with e.g. PVE because they do not share the same likelihood . This paper proposes such a unifying framework of comparing different system models’ goodnessoffit.
2.3 Reproducing kernel Hilbert space (RKHS) embedding
A kernel is a realvalued bivariate function . It is said to be positive definite if for any , , and . In addition, it is a reproducing kernel of an RKHS if (i) , and (ii) , . The latter is known as the reproducing property of . Choosing for some and applying the reproducing property yield the kernel trick
(2) 
That is, we can view the kernel evaluation as a generalized similarity measure between and after mapping them into the feature space . We refer to defined above as a canonical feature map associated with the kernel . One of the most common kernels on is the Gaussian kernel
(3) 
where is a bandwidth parameter.
An important application of kernel methods is in representing probability measures via kernel mean embedding (KME) [Smola et al. (2007)]. This line of work can be thought of as a systematic way of endowing unstructured data with representations in a Hilbert space to provide the ability to perform algebraic operations therein. Mathematically, we define the KME of a random variable as follows.
Definition 2.1
(Kernel mean embedding) Given random variable and kernel for which , we define the kernel mean embedding of as a function
This function is a member of the RKHS, associated with the kernel .
In the rest of the paper, we follow the convention in kernel machine learning to simply write the function
as to emphasize that it is an element of the RKHS . It has been shown that if belongs to a class of kernels known as characteristic kernels, then uniquely determines the distribution ; see, e.g., Fukumizu et al. (2004); Sriperumbudur et al. (2010).To help readers get a concrete understanding, we outline common kernels and the information their KME preserve in Table 1. We then give examples of the explicit forms of the KMEs.
Linear  Mean of distribution  

Polynomial  Up to th moments  
Gaussian  All information  
Exponential  All information 
2.3.1 Example (Secondorder polynomial kernel embedding)
Suppose the kernel function in question is the polynomial kernel of order two , the KME is given by
(4) 
This shows that the RKHS associated with this KME consists of quadratic functions whose coefficients preserve statistical information up to the second order (mean and variance), but not higher. In general, th order polynomial kernel embeddings preserve information up to the th order. A richer kernel embedding, e.g. Gaussian kernel embedding, may preserve information up to infinite order.
Remark Readers familiar with polynomial approximation may recognize that the integrand in (4) can be expanded in certain WienerAskey polynomial bases. However, our method differs from the philosophy of gPC expansion in that it does not seek to use finiteorder truncation for approximating functions. Rather, we make use of the kernel trick (2) to represent similarity in data even in infinite dimensional feature space. This gives rise to the power of kernel machine learning.
2.3.2 Example (Exponential kernel embedding)
Given the exponential kernel , the KME is
This is the momentgenerating function. Notably, replacing
by yields the Laplace transform.The following result in Song (2008) gives the consistency of a samplebased estimator for .
Lemma 2.1
Furthermore, Schölkopf et al. (2015); SimonGabriel et al. (2016) proved the estimation consistency for KME of transformations of RVs in more general conditions. They term their approach kernel probabilistic programming (KPP).
Lemma 2.2
(KME consistency) Suppose is a continuous function, are continuous kernels on and . Under mild conditions [cf. Theorem 1 of SimonGabriel et al. (2016)] , the following is true.
Notably, we do not require samples to be i.i.d. This result equips us with algebraic tools to learn an embedding of directly from that of . In the rest of the paper, by KPP we mean the RKHS embedding method that performs algebraic operations on KMEs via transformations of random samples.
In the context of reinforcement learning, e.g., in
Nishiyama et al. (2012); Grünewälder et al. (2012); Boots et al. (2013), RKHS embeddings of conditional distributions, which is different from ours, were used to learn dynamics. We share the common thread of using RKHS embeddings while differ in a few important aspects, e.g., our use of KPP, numerical integration for continuoustime systems. See Song et al. (2013) for more details and references.3 Approach
3.1 Representing uncertainty with RKHS embeddings
KPP introduced in the last section gives us a powerful tool to represent distributions without any parametric assumption. In the context of dynamical systems, if we view the evolution of the system uncertainty as transformations of random variables (RV), then KPP naturally becomes a tool to propagate system uncertainty. An important motivation of our methodology is that it shall be distributionfree, i.e., it shall not impose assumptions on the uncertainty distributions (e.g., Gaussianity).
In a nutshell, we represent the distribution of , the state of the dynamical systems (continuous or discrete time), by its KME and the corresponding samplebased estimator given by
(6) 
where a simple choice is . As discussed in Section 2.3, KME with secondorder polynomial kernel preserves (nominal state) and second (variance) order information commonly used in stochastic control. In this light, we may view our method as a generalization of Monte Carlo moment estimation.
3.2 Goodnessoffit measure for uncertain system models
As suggested in Figure 1 (right), it may be difficult to quantitatively compare stochastic system models directly. For example, say we have identified an LSQ point estimate and another estimation described by a distribution in uncertain parameter based on two different system identification methods. We cannot simply compare the goodnessoffit by comparing the likelihood objectives as they might differ for different identification methods. Furthermore, the parameter descriptions may also differ (e.g. point estimation vs. set description) In addition, can we be certain, in a quantitative manner, that the systems behave differently after the propagation through dynamics? ^{1}^{1}1We note the KullbackLeibler (KL) divergence is not distributionfree: we need distribution functions to calculate its estimate.
We summon the strength of KME to endow almost arbitrary data types the meaning of distance through the Hilbert space embedding. This allows us to compare systems by performing statistical twosample tests [cf. Gretton et al. (2012)] using the simulation samples.
Given the state distribution embeddings of two different systems computed as in (6), we may measure how different two stochastic systems are by straightforwardly computing their distance in the embedding Hilbert space, The above quantity is also known as a maximum mean discrepancy (MMD). Using kernel trick (2) and the estimator in (6), we obtain the following.
Lemma 3.1
(Samplebased estimator for RKHS distance; MMD) Given two sets of samples and from simulations of two dynamical systems, a samplebased estimator for is given by
(7)  
where we omit time index for conciseness. More details can be found in Schölkopf et al. (2002).
We propose that the goodnessoffit may again be straightforwardly measured by the RKHS distance where and are two state distributions under two uncertain parameter descriptions.
This is powerful in that it can compare arbitrary (unknown) uncertain systems. We demonstrate this flexibility in Section 4.2.
3.3 Uncertainty propagation via KPP
We have thus far discussed the use of KME as representation and goodnessoffit measure for uncertainty in stochastic systems. In this section, we propose to use KPP for uncertainty propagation, which is at the core of many stochastic control algorithms.
We first present two different views of uncertainty propagation in systems. From a statistical standpoint, they correspond to the diagonal and Ustatistics estimation. In particular, the Ustatistic estimator is known to have lower variance than the diagonal estimator but to require more samples. We then show a novel recursive application of the socalled reduced set method in propagating stochastic dynamics forward.
Direct propagation via KPP:
The main idea of this algorithm is simple: sample a realization of the uncertainty, and evolve the system as it is deterministic. The steps are presented in Algorithm 1. By doing so, we view the deterministic evolution as algebraic operations performed on the uncertainty.
In Step 4, if the underlying system model is continuoustime, we rely on numerical integration to propagate the samples forward. Let us consider the integral of the dynamics function (deterministic or random) over the time period In practice, this integral is often intractable. Numerical integration is performed to approximate its value, where may denote the step size of an onestep numerical integration rule.
One immediate question is, how will the integration error affect the embedding estimate? I.e., is the following true?
(8) 
By virtue of Lemma 2.2, we obtain the following result.
Lemma 3.2
(Consistency ^{2}^{2}2With a slight overload of terminology, we note the term consistent is used in both statistics and numerical analysis community. In both fields, they refer to the asymptotic convergence of statistical estimator and numerical integration respectively. of KPP estimation with numerical integration) Suppose is a continuous positivedefinite kernel, is chosen either via i.i.d sampling or collocation rules. The KPP estimator produced by a onestep numerical integration rule with step size in Algorithm 1 is consistent, i.e.,
(9) 
The proof (given in the appendix) is a direct consequence of the consistency of numerical integration and that of the KPP estimator. Similar propagation methods are used in stochastic collocation in conjunction with gPC [Xiu (2009)]. In the above algorithm, the propagated samples are used to represent the distributions via the RKHS embeddings. In the next section, we shall see a nontrivial generalization of the above algorithm.
3.3.1 Recursive reduced set KPP for uncertainty propagation:
One nuance is encountered when considering more general descriptions of uncertainty other than the simple parameter uncertainty. For simplicity, let us restrict the discussion to discretetime dynamics and assume that the uncertainty enters as discrete realizations of stochastic processes at each time . In this case, the system state at time is a function of all previousstep uncertainties,
From a statistical standpoint, this is a transformation of multiple RVs. To estimate the KME, one can either use the diagonal estimator which corresponds to the alreadydiscussed Algorithm 1,
or the Ustatistics estimator which delivers smaller variance
(10) 
The downside of the Ustatistics estimator is that it may involve exponentially many samples of the uncertain random variable (disturbances). To relieve this sample complexity while still capturing the statistical distribution, Schölkopf et al. (2015) proposed to use the reduced set method to compute multistep transformations of RV with only a subset of samples. Intuitively, the reduced set method seeks to find a (small) set of expansion points and weights such that the expansion approximates a Ustatistics estimation such as in (10).
We propose the recursive reduced set kernel probabilistic programming for uncertainty propagation in Algorithm 2.
(11) 
(12) 
(13) 
Intuitively, at every time step, we look for a subset of all samples (of the Ustatistics samples) to serve as new expansion points. One step of our recursion is similar to the basic idea of efficient quadrature rule [Gauss (1815)]. The optimization problem in Step 5 has two main tasks, finding the expansion points and weights simultaneously. There is a wide range of techniques for treating this problem (see Chapter 18 of Schölkopf et al. (2002)) In Section 4, we give an numerical example as a proof of concept for this procedure.
4 Numerical experiments
We present numerical examples that vary in their uncertain system descriptions and types of tasks, showcasing the flexibility of the proposed framework.
4.1 Uncertain ODE
Let us revisit the example of stochastic ODE in (1),
In this experiment, the uncertain is drawn from a mixtureofGaussian distribution. We then draw another from a unimodal Gaussian. Its mean and variance are chosen such that the first two moments match those of , i.e.,
(14) 
Their distributions are illustrated in Figure 2 (top).
We wish to answer: can we quantify the different behaviors of those systems without imposing distribution assumptions? To this end, we apply Algorithm 1 to obtain two sets of propagated system states and . The state KME estimator and are computed using (6).
As illustrated in Figure 2 (bottom), the two momentmatched parameter distributions result in distinct system behaviors over time. To quantitatively compare the behaviors, we compute the RKHSdistance over time between those two embeddings using (7).
To understand how different kernels preserve statistical properties of the system, we also plotted the embedding distances associated with polynomial kernel of order in Figure 3 (bottom). We clearly observe that different polynomial kernels capture different orders of moment information. Notably, the two system states seem to have similar means and therefore the first order polynomial kernel does not differentiate the two systems. As we increase the order of the polynomial kernel, the difference becomes evident.
We then show the Gaussian kernel embeddings in Figure 3 (top), where we vary the bandwidth parameter . It can be shown that the Gaussian kernel keeps track of statistical moments up to infinite order. If bandwidth is large, the kernel treats everything the same so RKHS distance is small. On the other hand, if bandwidth is small, the kernel treats everything as different. In the limiting case as the bandwidth , it can be shown the KME estimation reduces to the usual Monte Carlo estimation [cf. Section 3.3 in Schölkopf et al. (2015)].
4.2 Distributionfree goodnessoffit for identification
To demonstrate the proposed technique is a good measure of goodnessoffit for arbitrary uncertainty distributions, we apply it to PVE example in Mohammadi et al. (2015). Different from the uncertainty description in Section 4.1, this is typically a “worstcase” scenario where the model needs to account for the worst case realization of disturbances, described by ellipsoidal uncertainty sets.
In this example, the model to be identified is a timevarying autoregressive exogenousinput model (ARX) with variable parameter.
The uncertain timevarying parameters are . To generate the data, we follow the setup of Mohammadi et al. (2015) to choose the true uncertain parameters uniformly randomly^{3}^{3}3We do not use this information during our measuring goodnessoffit. In addition, we note the original data generation procedure may be extended beyond uniform sampling as robust optimization only concerns the distribution support. within the true ellipsoidal uncertainty set . The identification was performed according to that paper. As a result, we obtain the estimated ellipsoid and the LSQ parameter , as illustrated in Figure 4 (left).
We evolve the system for 600 steps using the true model and those two different uncertain parameter models (PVE vs. Gaussian distributed parameter resulting from the LSQ assumption, where is estimated by the sample variance as in that paper). The resulting state trajectories are shown in Figure 4 (right). We apply Algorithm 1 and compute the RKHS distance between the embeddings of the estimated model and the true model. Figure 5 demonstrates that the PVE model (red) matches the true model better than the LSQ model (black) in the sense of RKHS metric. This comparison is performed under no distributional assumptions.
What is interesting is that, in the original paper, it is obviously LSQ did not deliver the parameter variability that fits the true generating distribution. However, one may ask, is the LSQ estimated parameter nonetheless able to deliver similar system behaviors? But there was no unified way to compare them. This paper provides an unifying framework to quantitatively answer this question.
4.3 Recursive reduced set KPP for uncertainty propagation
Relying on the statistical consistency results in Section 2, uncertainty propagation can be performed straightforwardly using Algorithm 1. In this section, we demonstrate the use of Algorithm 2. As a proof of concept, let us consider a simple discretetime stochastic system
We emphasize the distribution of could be made fairly arbitrary (and nonstationary)— the proposed propagation method does not place restrictions on this distribution. The system evolution is illustrated in Figure 6 (left).
At every time step, Algorithm 2 is applied to find an embedding of the current state , where denotes the reducedset indices. Then the dynamics is propagated forward again. Note we use the naive reduced set method following SimonGabriel et al. (2016) which simply samples expansion points from the full set and then solves a quadratic minimization problem for coefficients as in Step 5. A more sophisticated method will be introduced in future work. As illustrated in Figure 6 (left), This procedure is repeated for 10 time steps. We compare the RKHS approximation error of the recursive reduced set method in Algorithm 2, against that of the diagonal estimate of Algorithm 1. While we do not know the true embedding , we approximate it with a largesample estimate using 500 samples. The approximation error in RKHS metric are illustrated in Figure 6 (right). We observe a faster convergence in both mean and variance by the recursive reduced set method in Algorithm 2.
5 Discussion
This paper proposed to use kernel probabilistic programming as a unifying tool for treating uncertainties in dynamical systems. We demonstrated concrete numerical procedures of such applications. This new concept calls for a change in the mode of thinking. KPP characterizes uncertainties in systems. Thus propagating the embeddings is propagating the uncertainty in dynamics.
It is our ongoing endeavor to investigate more sophisticated optimization procedures as well as mathematically rigorous analysis of convergence in Algorithm 2 with more general numerical integration. Another important direction is robust control design and state estimation using RKHS embeddings.
Compared with existing popular methods such as gPC or GP, RKHS embedding methods for dynamical systems are still in the early development. This paper serves as a call to action for their wider applications to robust and stochastic control.
We would like to thank Mario Zanon for providing the code to reproduce the PVE experiment.
References
 Hilbert space embeddings of predictive state representations. arXiv preprint arXiv:1309.6819. Cited by: §2.3.2.
 Scenario optimization for mpc. In Handbook of Model Predictive Control, pp. 445–463. Cited by: 3rd item.

Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces
. Journal of Machine Learning Research 5, pp. 73–99. Cited by: §2.3.  Methodus nova integralium valores per approximationem inveniendi. apvd Henricvm Dieterich. Cited by: §3.3.1.
 A kernel twosample test. Journal of Machine Learning Research 13, pp. 723–773. Cited by: §3.2.
 Modelling transition dynamics in MDPs with RKHS embeddings. In Proceedings of the 29th International Conference on Machine Learning, pp. 535–542. Cited by: §2.3.2.
 Estimation of uncertain ARX models with ellipsoidal parameter variability. 2015 European Control Conference, ECC 2015 (1), pp. 1766–1771. External Links: Document, ISBN 9783952426937 Cited by: §2.2, §4.2, §4.2.
 Kernel mean embedding of distributions: a review and beyond. Foundations and Trends in Machine Learning 10 (12), pp. 1–141. Cited by: Table 1.

Hilbert space embeddings of POMDPs.
In
Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence
, pp. 644–653. Cited by: §2.3.2.  Gaussian processes for machine learning. Gaussian Processes for Machine Learning, by CE Rasmussen and CKI Williams. ISBN13 9780262182539. Cited by: §2.1.
 Computing functions of random variables via reproducing kernel Hilbert space representations. Statistics and Computing 25 (4), pp. 755–766. External Links: Document, ISSN 15731375 Cited by: §2.3.2, §3.3.1, §4.1.

Learning with kernels: support vector machines, regularization, optimization, and beyond
. MIT press. Cited by: §3.2, §3.3.1.  Minmax feedback model predictive control for constrained linear systems. IEEE Transactions on Automatic control 43 (8), pp. 1136–1142. Cited by: 3rd item.
 Consistent kernel mean estimation for functions of random variables. Advances in Neural Information Processing Systems 1 (i), pp. 1740–1748. External Links: ISSN 10495258 Cited by: §2.3.2, Lemma 2.2, §4.3.
 A Hilbert space embedding for distributions. In Proceedings of the 18th International Conference on Algorithmic Learning Theory (ALT), pp. 13–31. Cited by: §2.3, Lemma 2.1.
 Learning via Hilbert space embedding of distributions. Ph.D. Thesis, The University of Sydney. Cited by: §2.3.2, Lemma 2.1.
 Kernel embeddings of conditional distributions: a unified kernel framework for nonparametric inference in graphical models.. IEEE Signal Processing Magazine 30 (4), pp. 98–111. Cited by: §2.3.2.
 Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research 99, pp. 1517–1561. Cited by: §2.3.
 Randomized algorithms for analysis and control of uncertain systems: with applications. Springer Science & Business Media. Cited by: 3rd item.
 The wiener–askey polynomial chaos for stochastic differential equations. SIAM journal on scientific computing 24 (2), pp. 619–644. Cited by: §2.1, §2.1.
 Fast numerical methods for stochastic computations: a review. Communications in computational physics 5 (24), pp. 242–272. Cited by: §2.1, §3.3.
Appendix A Proof of Lemma 3.2
We provide a proof for the consistency of Algorithm 1.
(15) 
In the last inequality, The first term is due to the continuity of kernel . In the last line, the first term converges to due to the consistency of numerical integration whereas the second term converges due to the consistency of KPP in Proposition 3.2.
The convergence rate is . The first term is the result of pth order onestep numerical integration rule (e.g. RKp) with step size . The second term is triggered by The KPP estimator finitesample convergence. is the sample size used by the KPP algorithm.
Comments
There are no comments yet.