    # Nonlinear input design as optimal control of a Hamiltonian system

We propose an input design method for a general class of parametric probabilistic models, including nonlinear dynamical systems with process noise. The goal of the procedure is to select inputs such that the parameter posterior distribution concentrates about the true value of the parameters; however, exact computation of the posterior is intractable. By representing (samples from) the posterior as trajectories from a certain Hamiltonian system, we transform the input design task into an optimal control problem. The method is illustrated via numerical examples, including MRI pulse sequence design.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Building mathematical models of systems from measured input-output data is a task of central importance in a number of fields, from science and engineering, to medicine and finance. Well-chosen inputs can improve the quality of the model by exciting the system in such a way as to extract relevant information. The process of choosing such inputs, subject to experimental constraints, is known as input design.

The information content associated with an input is often quantified by (some scalar function of) the Fisher Information Matrix (FIM) [1, §7.4]. For models in which the mapping from inputs to outputs is linear, the FIM is an affine function of the input spectrum, and so input design in the frequency domain can be formulated as a convex optimization problem. For general nonlinear models, even computation of the FIM can be a challenging task that must be carried out approximately, making optimal input design difficult; cf. related work below for further details.

In this paper, we take a different approach to input design for general nonlinear models, and optimize the posterior distribution over model parameters directly. Using tools from statistical inference, namely Hamiltonian Monte Carlo (HMC) , we construct a Hamiltonian system, trajectories of which correspond to samples from the posterior distribution. The input design problem of shaping the posterior for accurate parameter inference can then be formulated as an optimal control problem for this Hamiltonian system.

#### I-1 Related work

While the approach presented in this paper is applicable to any model satisfying the assumptions stated in §II, our focus shall be on input design for dynamical systems. Input design for linear systems is by now a mature topic, with a large body of literature, cf. e.g. [3, 1, 4, 5] and the references therein. In contrast, results on input design for nonlinear systems are more scarce. Input design for structured systems (i.e. the interconnection of linear dynamics with a static nonlinearity) was studied in [6, 7] and , with the latter proving that D-optimal inputs can be realized by a mixture of Gaussians for certain Wiener systems. Ideas from input design for impulse response systems were extended to nonlinear systems admitting a nonparametric Volterra series representation in . Similarly, design for nonlinear FIR models was considered in [10, 11, 12]. We note that for linearly parametrized static nonlinear systems, input design is a convex problem; this observation motivated the work  which considers inputs restricted to a finite number of amplitude levels, for dynamical systems.

#### I-2 Paper structure

The paper proceeds as follows: a detailed problem specification is provided in §II, followed by a brief primer on HMC in §III. The Hamiltonian optimal control problem is presented in §IV, while approximations required to solve the problem, along with the final proposed algorithm, are stated in §V. Numerical experiments illustrating the method are presented in §VI.

## Ii Problem set-up

In this work, the object of interest is a system, , that accepts inputs and generates measurable outputs

. The relationship between inputs and outputs is modeled by a probability distribution

, where denotes the model parameters. Here is the likelihood, and denotes any prior belief we may have over the model parameters.

###### Assumption 1.

There exists such that the data-generating distribution for is given by .

That is, we assume the existence of ‘true parameters’ , such that, given , the output from is distributed according to . Our goal is to design an input such that can be accurately inferred from the input/output data . In this paper, we shall primarily be concerned with dynamical systems, , for which the inputs and outputs shall be time series, i.e., and . However, the approach applies to any probabilistic input/output system, e.g., systems for which the mapping from to is static.

The main assumption governing the class of probabilistic systems to which our approach is applicable is as follows:

###### Assumption 2.

There exist latent variables such that gradients of the log joint-likelihood, , w.r.t. and , can be computed. Furthermore, any prior distribution over is chosen such that can be computed.

Assumption 2 is necessary for the application of HMC; cf., §III. In this paper, we will work with state-space models for which the joint-likelihood decomposes as

 p(x0|θ)∏Tt=1p(yt|xt,ut,θ)p(xt|xt−1,ut−1,θ),

where the internal states constitute suitable latent variables for Assumption 2. As a concrete example, consider the nonlinear dynamical system

 xt+1 =fθ(xt,ut)+wt, yt=gθ(xt,ut)+vt, (1) wt ∼N(0,Σw,θ), vt∼N(0,Σv,θ), x0∼N(μθ,Σ0,θ),

For this model, and , so is differentiable, when , , , and are differentiable. For the special-case of linear, Gaussian dynamical systems this assumption holds with (no latent variables).

To quantify the notion of ‘accurately inferring’ , we seek inputs that cause the posterior to concentrate about the true parameters, i.e., we seek to minimize

 J(u,y):=Eθ|y[|θ−θ∗|2]=∫|θ−θ∗|2p(θ|u,y)dθ. (2)

Of course, we do not have access to outputs before providing an input , so we shall instead seek to minimize

 ¯J(u):=Ey|θ∗[J(u,y)]=∫J(u,y)p(y|u,θ∗)dy, (3)

i.e., the expected cost, where the expectation is w.r.t. the true data-generating distribution, , given . Both (2) and (3) depend on the true parameters, .

###### Assumption 3.

The ‘true’ parameter value is known.

This is a somewhat unrealistic, albeit ubiquitous, assumption in the experiment design literature [1, §13], as

is the very quantity we wish to estimate. Nevertheless, in applications some reasonable guess for

can usually be made; alternatively, multi-stage adaptive or robust optimization  may be employed to iteratively improve the initial estimate.

Finally, our choice of inputs is subject to constraints. Let denote the set of feasible inputs; common choices include constraints on input amplitude, e.g., , and/or power, e.g., for some . The only requirement for is that projection onto this set can be carried out efficiently.

To summarize, the goal of the paper is to solve the input design problem: . The principal difficulty is computing expected values w.r.t. the posterior distribution for the general class of probabilistic systems considered in this paper. Our key to circumventing this difficulty is the construction of a Hamiltonian system, the trajectories of which correspond to samples from the posterior. This system forms the basis of HMC, which we describe next.

## Iii Hamiltonian Monte Carlo

In this section we provide a brief pragmatic introduction to the use of HMC for statistical inference; for further details, cf. . Consider a physical system, denoted , described by position and momentum , which collectively form the system state . Let and denote the kinetic and potential energy of this system, respectively. The behavior of is then characterized by its Hamiltonian, , and Hamilton’s equations of motion:

 dqidt=∂H∂ρi,dρidt=−∂H∂qi,fori=1,…,d. (4)

The Hamiltonian defines a canonical distribution (in the statistical mechanics sense) for the states , defined by

 pu,y(q,ρ)=(1/Z)exp(−(H(q,ρ))/T), (5)

where is the temperature of the system, and is an appropriate normalization constant. From the perspective of statistical mechanics, a given energy function leads to a probability distribution . For the purpose of statistical inference, HMC takes the opposite perspective: given a probability distribution of interest (a ‘target distribution’), , we can define a Hamiltonian such that the canonical distribution in (5) coincides with . In particular, one can let the positions represent the parameters of interest, i.e. , and choose . The momentum variables enter only for the purpose of constructing the Hamiltonian system; we are not interested in the distribution over , for the purpose of inference. Given this, a popular choice for the kinetic energy is the quadratic function

, which leads to a Gaussian distribution over

. Here, can be interpreted as the (positive definite) inertia matrix. Whenever HMC is used in this paper, the target distribution is , i.e., the posterior distribution over the parameters of , given and .

To generate samples from

, a Markov chain is constructed as follows: given a current state

, a new state is proposed by first re-sampling the momentum from its canonical (Gaussian) distribution, and then forward-simulating the Hamiltonian dynamics (4) for some finite time, . To simulate the Hamiltonian dynamics it is necessary to employ a discrete-time approximation of (4). A popular choice is the so-called ‘leapfrog’ method (which is reversible and preserves volume exactly), one iteration of which involves the following sequence of steps: equationparentequation

 ρi(t+ϵ/2) =ρi(t)−(ϵ/2)∇qiU(q(t)), (6a) qi(t+ϵ) =qi(t)+(ϵ/m)ρi(t+ϵ/2), (6b) ρi(t+ϵ) =ρi(t+ϵ/2)−(ϵ/2)∇qiU(q(t+ϵ)). (6c)

The proposed state is then accepted (as a new sample from ) with probability . Note that the accept/reject step is only required to correct for errors in the numerical integration of (4). To run HMC, the following parameters must be specified; see e.g.  for practical guidelines on these choices.

## Iv Input design via Hamiltonian dynamics

Equipped with this knowledge of HMC, let us first present the intuition for the proposed approach to input design.

### Iv-a Intuitive explanation and illustration of the approach

Consider a probabilistic system and the associated Hamiltonian system , as defined in §II and §III, respectively. Given inputs/outputs from , trajectories of correspond to samples from the posterior . The cost can be thought of, roughly speaking, as the expected distance of a trajectory of , given initial conditions sampled from . To minimize we can think of solving an optimal control problem for : choose (and, indirectly, ) so as to drive the position component () of the state, after some finite time, close to the target state, . This idea is illustrated in Fig. 1, by application to a linear dynamical system

 xt+1=Axt+But+wt, yt=Cxt+Dut+vt. (7)

with and . Let , i.e., these are the unknown parameters that we wish to estimate from data. True parameter values are given by . We shall return to this example in §VI-A, where further details are provided, but for now focus on the two inputs shown in Fig. 1(a): ‘nominal’ and ‘optimized’ . Trajectories of the Hamiltonian systems corresponding to these inputs are shown in Fig. 1(b). Though initialized at the same positions, the (final value of) trajectories corresponding to tend to concentrate around , more so than those corresponding to . In Fig. 1

(c) we plot the posterior distributions (computed exactly, using a Kalman filter), and observe that

is indeed much more ‘peaked’ around compared to .

### Iv-B Computing the cost function with Hamiltonian dynamics

To make this idea more precise, we now relate the cost function defined in (2) to the optimal control problem described above. Let denote the new state obtained by (exactly) evolving the Hamiltonian system (given and ) forward for duration , starting from the initial state . For ease of exposition, suppose that can be evaluated, such that the position of the Hamiltonian can be chosen as . We will relax this assumption in the sequel. Similarly, let . Then, the optimal control problem described above can be formulated as

 minu∈U E¯q,¯ρ[|→qℓ(u,y,¯q,¯ρ)−q∗|2], (8)

where denotes expectation w.r.t. the canonical distribution in (5), given and . We then have the following equivalence between (8) and :

###### Lemma 4.

Consider defined in (2). For all ,

 J(u,y)=E¯q,¯ρ[|→qℓ(u,y,¯q,¯ρ)−q∗|2].
###### Proof.

As the Hamiltonian dynamics (4) leave the canonical distribution invariant , is distributed according to . Furthermore, as and , we have . ∎

It is worth noting that the usual ‘bias-variance decomposition’ applies to

. Recall that denotes expectation w.r.t. . With we have equationparentequation

 Eθ|y[|θ−θ∗|2]=Eθ|y[|θ−μ+μ−θ∗|2] = Eθ|y[|μ−θ∗|2+2(θ−μ)′(μ−θ∗)+|θ−μ|2] = |μ−θ∗|2+tr Eθ|y[(θ−μ)(θ−μ)′],

i.e., minimizing (or solving (8)) is equivalent to minimizing the sum of the 2-norm error between the posterior mean and the true parameters, and the trace of the posterior variance. This can be observed in Fig 1(c).

For ease of exposition, we assumed that could be computed. For general probabilistic systems, e.g., nonlinear state-space models such as (1), this is not the case. However, as stated in Assumption 2, we only require that gradients of are computable, for some latent variable , cf., §II. In this general setting, we augment the state of with these latent variables, and choose . From (6a), HMC requires , where

 −U(q)=logp(q|u,y)∝log(p(y|u,q)p(q|u)).

For , decomposes as

 logp(y|u,x,θ)+logp(x|u,θ)+logp(θ) = logp(y,x|u,θ)+logp(θ),

hence Assumption 2 is sufficient for computation of . We can then optimize a weighted version of (8), , in which and . This weighted version ignores the components of associated with , so that we retain .

To summarize, Lemma 4 established equivalence between the optimal control problem (8) and the cost in (2). As we wish to optimize the expected cost, given in (3), the optimization problem we propose solving for input design is given by

 u∗=argmaxu∈U Ey|θ∗[E¯q,¯ρ[|→qℓ(u,y,¯q,¯ρ)−q∗|2]], (10)

where denotes expectation w.r.t. .

## V Approximate solution to control problem

The optimization problem (10), as stated, is difficult to solve. In this section we propose a number of approximations to obtain a tractable alternative. At the outset, we emphasize that we are only interested in local optimization: i.e., we will assume some nominal input , and seek an improved input by numerical optimization.

### V-a Approximating expected values

There are two expected values in (10), neither of which can be computed exactly for the general probabilistic models under consideration. First, we consider approximation of , i.e., expectation w.r.t. the generative distribution . One approach would be to simply approximate it with a Monte Carlo (MC) average, using samples drawn from . However, in this approach, the dependence of on is lost during optimization of , because is generated with a fixed, nominal . As an alternative, we consider approximate realizations from , where is a deterministic function of . The idea is to capture (some of) the dependence of  on . For example, consider the state-space model in (1). The approximate realization  can be formed by first generating samples of the stochastic processes in (1), and then setting and

 ~yit(u)=gθ∗(~xit(u),ut)+vit, ~xit+1(u)=fθ∗(~xit(u),ut)+wit.

We then approximate (10) with

 1NN∑i=1E¯q,¯ρ[|→qℓ(u,~yi(u),¯q,¯ρ)−q∗|2]. (11)

Note that in practice we always choose , i.e., we use a single deterministic approximation formed by setting the stochastic processes to their mean values, e.g., for the state-space model (1) the values of would each be set to zero.

Next, we turn our attention to the approximation of , i.e., expectation w.r.t. the canonical distribution in (5). As depends on , the natural MC approximation of (11) (for ) would involve samples drawn from defined by . For simplicity, we ignore this dependency, and instead use a MC approximation with samples drawn from , i.e., the canonical distribution defined by the nominal input and the approximate output . Such samples can be generated by running HMC. Thus, with , we approximate (10) (with ) by

 1MM∑i=1|→qℓ(u,~y(u),¯qi,¯ρi)−q∗|2. (12)

### V-B Discrete-time approximation of dynamics

The cost in (12) depends on the mapping , which involves the evolution of the continuous-time dynamical system , cf. (4). For the purpose of optimization, we replace the dynamics in (4) with the discrete-time approximation in (6). Then minimization of (12) w.r.t. can be approximated by the following discrete-time optimal control problem:

 minu∈U 1MM∑i=1|qi(L)−q∗|2, (13) s.t. qi(0)= ¯qi, ρi(0)=¯ρi, i=1,…,M ρi(1)=ρi(0)+h(qi(0)), i=1,…,M qi(k)=qi(k−1)+(ϵ/m)ρi(k), k=1,…,L, ρi(k)=ρi(k−1)+h(qi(k−1)), k=1,…,L,

where , and . We then approximately solve (13), i.e. locally optimize it, using standard numerical optimization methods , such as projected gradient descent.

### V-C Discussion

We conclude this section with a brief discussion of some extensions, limitations, and practical considerations.

Sensitivity to  The choice of (i.e. the best guess of the unknown true parameter) determines the point about which we attempt to concentrate the posterior, , as well as deterministic output approximation, . It is possible to reduce the sensitivity of the former to errors in our guess of

by using a ‘dead-zone’ loss function in (

2), instead of the 2-norm [17, §6.1.2]. Note that does not affect HMC.

Choice of HMC parameters  The method requires the user to specify , cf. §III, which may be tuned via pilot runs (of HMC). For instance, the stepsize can be increased until the acceptance rate falls below a reasonable tolerance; cf.  for guidelines.

Computational complexity  The key factors that influence computational complexity are: i) running HMC, ii) solving (13), and iii) the number of samples . Note that  directly affects both i) and ii). The larger , the more iterations of HMC are required to generate the samples. is typically on the order of 10-100, which means that the cost of solving (13) dominates that of running HMC. The cost of (13) scales linearly with , as it is equivalent to solving simultaneous nonlinear model predictive control (MPC) problems. To reduce the computational burden, it is possible to employ stochastic gradient methods, and use a subset of to compute the gradient at each iteration.

## Vi Numerical examples

### Vi-a Illustration on a linear system

We now return to the illustration presented in §IV-A, and fill-out the remaining details in light of §V. As we can compute exactly, the position of the Hamiltonian system can be set to , i.e., for this system, there is no need to include latent variables in the Hamiltonian state. We randomly generate by sampling from for and project it onto the feasible set , cf. Fig. 1(a). HMC is run (line 1 of Alg. 1) to generate samples from , cf. Fig. 1(b). Again, cf.  for HMC parameters. The optimization problem (13) is then solved, locally with projected gradient descent as in line 1, to obtain , which is shown in Fig. 1(a). Trajectories of the Hamiltonian systems corresponding to each of these inputs, and , are shown in Fig. 1(b). Notice that the final positions of the trajectories corresponding to the optimized input tend to concentrate about . This is reflected in the posterior distribution , plotted in Fig. 1(c), which is more peaked around . This is perhaps unsurprising, given that the amplitude of is typically larger than . For this reason, we also plot the posterior for to demonstrate that the improvement observed after optimization is due to more than just a simple increase in the amplitude of the input.

### Vi-B Nonlinear state-space system

In this section, we consider input design for the following nonlinear state-space model

 xt+1=θx3t−xt1+x2t+ut+wt, yt=xt+x2t+110x3t+vt,

where

are normally distributed according to

. The goal is to design the input , , so as to infer the parameter , subject to constraints . The true parameter value is given by . A nominal input is generated by sampling from , which is then optimized by Alg. 1 using HMC samples. The deterministic output is generated using the mean values of the stochastic quantities, . One such application of this procedure is depicted in Fig. 2. The nominal input , optimized input , and a pseudo-random binary input given by are shown in Fig. 2(a). The posterior distributions corresponding to each of these inputs are plotted in Fig. 2(b). The posterior corresponding to is highly concentrated about . Interestingly, the pseudo-random input leads to a posterior with greater bias and variance than even the nominal input; emphasizing that for nonlinear systems, simply scaling the ‘magnitude’ (amplitude and/or power) of the input does not necessarily lead to better performance, in contrast to the linear case. Fig. 2(c) reports the results of maximum likelihood estimation (MLE) of , using and as shown in Fig. 2

(a). For each of the 500 trials, data is generated by simulating the model (with a random realization of the stochastic quantities), and MLE is performed with the expectation maximization (EM) algorithm

. As suggested by Fig. 2(b), we observe a significant reduction in estimation error when using compared to . The results in Fig. 2 are representative of performance; however, exact results depend on the random realization of . Code to reproduce this example is available in .

### Vi-C Optimal flip angles for MRI

In this section, we consider a magnetic resonance imaging (MRI) pulse sequence design problem, identical to that studied in . We provide a minimal description of the design task; for full details and problem motivation, cf. . The dynamics of the system are given by

 xt+1=[cos(ut)−sin(ut)sin(ut)cos(ut)][τ1x1,t+1−τ1τ2x2,t],
 p(yt|xt,ut,τ)=ytσ2exp(−y2t+x22,t2σ2)I0(ytx2,tσ2),

where denotes the modified Bessel function of the first kind (order zero). The hidden states correspond to magnetization. The observed outputs are Rician-distributed, with . The inputs describe flip-angles in the RF excitation pulses, and as such they are constrained to . The model parameters, , are related to the and relaxation times by , where is the sampling time (between RF pulses). True parameters correspond to and relaxation times of 0.68 and 0.09, respectively. The task is to design a sequence , , so as to estimate given that is known.

To apply our method, we first randomly generate a nominal input (the variance is chosen such that the nominal input is ‘small’). We then proceed with Alg.1, with HMC samples. The deterministic output is generated by taking the expected value of given (note that is a deterministic function of ). We repeat this process for 10 random initializations (of ) and report the results in Fig. 3. Fig. 3(a) shows the posterior distributions for various inputs , including that obtained by dynamic programing (DP), , as proposed in . We make two observations: first, the optimized input demonstrates a significant improvement over the nominal input . Second, in all but one of the trials, leads to a posterior that is more concentrated about , compared to the input from DP . To investigate this further, we perform maximum likelihood estimation of  using these inputs. Fig. 3(c) reports estimation error for 1000 such trials; note that the outputs for these trials were randomly generated from the Ricain distribution. The histogram on the left of Fig. 3(c) compares as shown in Fig. 3(b) to . As suggested by Fig. 3(a), leads to a reduction in the variance of the estimation error. The histogram on the right of Fig. 3 randomly selects one of the optimized inputs from Fig. 3(a) for each (MLE) trial; we also observe a reduction in the error variance, though the effect is less pronounced.

## References

•  L. Ljung, System Identification: Theory for the User, 2nd ed.   Prentice Hall, 1999.
•  S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth, “Hybrid Monte Carlo,” Physics letters B, vol. 195, no. 2, pp. 216–222, 1987.
•  G. C. Goodwin and R. L. Payne, Dynamic system identification: experiment design and data analysis.   Academic press New York, 1977, vol. 136.
•  H. Jansson and H. Hjalmarsson, “Input design via LMIs admitting frequency-wise model specifications in confidence regions,” IEEE transactions on Automatic Control, vol. 50, no. 10, pp. 1534–1549, 2005.
•  I. R. Manchester, “Input design for system identification via convex relaxation,” arXiv preprint arXiv:1009.5614, 2010.
•  T. L. Vincent, C. Novara, K. Hsu, and K. Poolla, “Input design for structured nonlinear system identification,” Automatica, vol. 46, no. 6, pp. 990–998, 2010.
•  M. Gevers, M. Caenepeel, and J. Schoukens, “Experiment design for the identification of a simple Wiener system,” in IEEE 51st Annual Conference on Decision and Control (CDC), 2012, pp. 7333–7338.
•  K. Mahata, J. Schoukens, and A. De Cock, “Information matrix and D-optimal design with Gaussian inputs for Wiener model identification,” Automatica, vol. 69, pp. 65–77, 2016.
•  G. Birpoutsoukis, “Volterra series estimation in the presence of prior knowledge,” Ph.D. dissertation, Vrije Universiteit Brussel, 2018.
•  H. Hjalmarsson and J. Mårtensson, “Optimal input design for identification of non-linear systems: Learning from the linear case,” in American Control Conference (ACC), 2007, pp. 1572–1576.
•  C. A. Larsson, H. Hjalmarsson, and C. R. Rojas, “On optimal input design for nonlinear FIR-type systems,” in 49th IEEE Conference on Decision and Control (CDC), 2010, pp. 7220–7225.
•  A. De Cock, M. Gevers, and J. Schoukens, “D-optimal input design for nonlinear FIR-type systems: A dispersion-based approach,” Automatica, vol. 73, pp. 88–100, 2016.
•  M. Forgione, X. Bombois, P. M. Van den Hof, and H. Hjalmarsson, “Experiment design for parameter estimation in nonlinear systems based on multilevel excitation,” in European Control Conference (ECC), 2014, pp. 25–30.
•  C. R. Rojas, J. S. Welsh, G. C. Goodwin, and A. Feuer, “Robust optimal experiment design for system identification,” Automatica, vol. 43, no. 6, pp. 993–1008, 2007.
•  R. M. Neal, “MCMC using Hamiltonian dynamics,” Handbook of Markov Chain Monte Carlo, pp. 113–162, 2011.
•  M. Diehl, H. J. Ferreau, and N. Haverbeke, “Efficient numerical methods for nonlinear MPC and moving horizon estimation,” in Nonlinear model predictive control.   Springer, 2009, pp. 391–417.
•  S. Boyd and L. Vandenberghe, Convex Optimization.   Cambridge: Cambridge University Press, 2004.
•  J. Umenberger, “Code to reproduce selected examples.” 2019. [Online]. Available: {https://github.com/umenberger}
•  T. B. Schön, A. Wills, and B. Ninness, “System identification of nonlinear state-space models,” Automatica, vol. 47, no. 1, pp. 39–49, 2011.
•  J. Maidens, A. Packard, and M. Arcak, “Parallel dynamic programming for optimal experiment design in nonlinear systems,” in IEEE 55th Conference on Decision and Control (CDC), Las Vegas, NV, USA, 2016, pp. 2894–2899.