## 1 Introduction

Analyzing the dynamics of nonlinear systems in various ways has been always an important area of research. This field includes nonlinear system identification ogunnaike1994process ; nelles2013nonlinear , modal analysis of dynamical system rudy2018data ; tu2013dynamic

, system identification using neural networks

narendra1990identification ; hunt1992neuraland supervised learning of time series data, which is also known as function approximation or regression

alaeddini2018linear ; ting2011locally ; vijayakumar2000locally ; cleveland1988locally ; junior2015regional .Most of the dynamical systems we want to understand and control are either stochastic, or they have some inherent noise components, which makes learning their models challenging. Moreover, in model-based control atkeson1998nonparametric ; AndrewBagnell2014 , or optimal control ha2017multiscale , often the model must be learned online as we get some measurement data throughout the process. Hence, the learning process not only has to result in accurate and precise approximations of the real-world behavior of the system but also must be fast enough so that it will not delay the control process. For example, in model-based reinforcement learning atkeson1998nonparametric ; AndrewBagnell2014 , an agent first learns a model of the environment and then uses that model to decide which action is best to take next. If the computation of the transition dynamics takes very long, the predicted policy is no longer useful. On the other hand, if the control policy is learned based on an imprecise transition model, the policy cannot be used without significant modifications as shown in Atkeson (1994) atkeson1994using

. These learning procedures can be combined with Bayesian inference techniques to capture the stochasticities and/or uncertainties in the dynamical systems as well as the nonlinearities

meier2014efficient ; neal2012bayesian ; kononenko1989bayesian .One of the popular methods to learn the transition dynamics is supervised learning. Usually, there are three different ways to construct the learning criteria: global, local, or a combination of both. From another perspective, these methods are classified into memory-based (lazy) or memory-less (eager) methods based on whether they use the training sets in the prediction process. An example of the former methodology is the

-nearest neighbor algorithm, and for the latter, the artificial neural network is a good representative. In an artificial neural network, the target function is approximated globally during training, implying that we do not need the training set to run an inference for a new query. Therefore, it requires much less memory than a lazy learning system. Moreover, the post-training queries have no effect on the learned model itself, and we get the same result every time for a given query. On the other hand, in lazy learning, the training set expands for any new query, and, thus, the model’s prediction changes over time.### 1.1 Global Regression Methods for Learning Dynamics

Most regression algorithms are global in that they minimize a global loss function. Gaussian processes (GPs) are popular global methods

rasmussen2004gaussianthat allow us to estimate the hyperparameters under uncertainties. However, their computational cost (

for observations) limits their applications in learning the transition dynamics for robot control. There have been efforts both in sparsifying Gaussian process regression (GPR) snelson2006sparse ; titsias2009variational ; quia2010sparse ; matthews2016sparse ; hensman2015mcmc and developing online krizhevsky2012imagenet ; wilson2016stochastic and incremental gijsberts2013real ; csato2002sparse algorithms for the sparse GPR to make it applicable for robotic systems. Another challenging question is how to construct these processes, as for instance, how to initialize the hyperparameters or define a reasonable length scale. van2017convolutional presents the convolutional Gaussian processes, in which a variational framework is adopted for approximation in GP models. The variational objective minimizes the KL divergence across the entire latent process matthews2016scalable , which guarantees an exact model approximation given enough resources. The computational complexity of this algorithm is () through sparse approximation titsias2009variational . Moreover, it optimizes a non-Gaussian likelihood function opper2009variational .### 1.2 Dynamics Learning with Spatially Localized Basis Functions

Receptive field-weighted regression (RFWR) in schaal1998constructive is one of the prominent algorithms that motivates many incremental learning procedures. Receptive fields in nonparametric regression are constructed online and discarded right after prediction. Locally weighted regression (LWR), and a more advanced version of it, termed as locally weighted projection regression (LWPR) vijayakumar2000locally , are two popular variants of RFWR, especially in control and robotics. The idea is that a local model is trained incrementally within each receptive field independent of the other receptive fields. Prediction is then made by blending the results of all the local models. The size and shape of the receptive fields and the bias on the relevance of the individual input dimensions are the parameters to be learned in these models. In addition, the learning algorithm allocates new receptive fields and prunes the extra ones as needed when new data are observed. The local models are usually simple like low-order polynomials hastie1993local . These methods are robust to negative inference since the local models are trained independently. This contrasts with neural network models and is discussed in schaal1998constructive .

Another natural and interesting comparison is with the mixture of experts (ME) model jacobs1991adaptive ; jordan1994hierarchical

. ME models are global learning systems where the experts compete globally to cover the training data, and they address the bias-variance problem

fortmann2012understanding by finding the right number of local experts and also the optimal distance metric for computing the weights in locally weighted regression atkeson1990using . In xu1995alternative , they define a mixture of Gaussian as the gating network and train both the gating network and the local models with a single-loop analytical version of the expectation-maximization (EM) algorithm. Schaal et al. schaal1998constructive compare these two algorithms in schaal1998constructive , and conclude that the performance of ME depends on the number of experts and their distributions in the input space. Moreover, it needs a larger training set than RFWR to provide comparably accurate results. The results of the two algorithms become indistinguishable when we increase the amount of training data and/or lower the noise. However, it is difficult to achieve these conditions since a large enough training set is not always available and the signal-to-noise ratio is not manually defined. Nevertheless, LWR and LWPR require many tuning parameters whose optimal values are highly data dependent. Therefore, we cannot always obtain robust and accurate results due to the highly localized training strategy of these methods, which does not allow the local models to benefit from the other models in their vicinities.### 1.3 Combining Local and Global Regression Methods for Dynamics Learning

Usually, the online performance of global methods depends on how well the distance metric is learned offline, which is based on how well the training data represent the input space. There are several concerns with linear models as discussed above. Therefore, one should develop better methods to tackle these problems.

Local Gaussian regression (LGR) in meier2014efficient

is a probabilistic alternative to LWR, which transforms it into a localized inference procedure by employing variational approximation. This method combines the best of global and local regression frameworks by combining the well-known Bayesian regression framework with radial basis function (RBF) features for nonlinear function approximation. This is a top-down approach that takes advantage of the efficiency of the local regression and the accuracy of the global Bayesian regression and incorporates local features. In this method, maximizing the likelihood function is facilitated by the introduction of the hidden targets for every basis function, which act as links that connect observations to the unknown parameters via Bayes’ law. In addition, the likelihood function is maximized using the Variational Expectation Maximization (EM) algorithm

neal1998view ; Jordan1999 ; Tzikas2008 . EM algorithm is a popular method that iteratively maximizes the likelihood function without explicitly computing it. The variational EM is an alternative algorithm that approximates the posterior distribution with a factorized function of the hidden variables in the model Tzikas2008 . The use of variational EM makes LGR a fast learning algorithm. A graphical model representation of this algorithm is shown in Fig. 1.Another useful characteristic of LGR is optimizing the RBF length scale so that the minimum number of local models is used for learning. This optimization reduces the size of the model significantly. In addition, it adopts the idea of Automatic Relevance Determination (ARD) mackay1992bayesian ; tipping2001sparse ; neal2012bayesian that results in a sparse model as described later in section 2.2. Both these adjustments result in a compact model that does not occupy much memory as compared to the space needed to store the training set.

### 1.4 Validity of the EM Algorithm

The EM algorithm is a popular iterative method for solving maximum likelihood estimation problems, and its performance guarantees has been shown for the most generic EM algorithms balakrishnan2017statistical ; wu1983convergence , However, there are very few theorems that prove its the general convergence properties of EM algorithms especially when applied to complex maximum likelihood problems. In problems with a bounded or convex loss functions, it is possible to find a generalization bound on the approximations made by the EM algorithm using Rademacher complexity cherkassky1999model ; kakade2009complexity ; grunwald2017tight ; meir2003generalization . This is a line of research that links the Bayesian and frequentist approaches together germain2016pac . The idea is that the Bayes approach is powerful in constructing an estimator, whereas the frequentist approach provides a methodology for better evaluation. Therefore, by putting them together, one can come up with a more robust model design. However, these methods are limited to simple classes of problems opper1999general , and are usually not applicable to hierarchical Bayesian models. This lack of applicability is primarily to the fact that the likelihood function cannot be described using a bounded or convex function.

Moreover, most of the bounds found using the VC dimension or the Rademacher complexity are data-dependent, making them less popular for model (algorithm) evaluation. For instance, meir2003generalization establishes data-dependent bounds for mixture algorithms with convex constraints and unbounded log-likelihood function. However, this is rarely the case in building complex models for high-dimensional datasets. Another effort has been to investigate whether the parameters updated by the EM algorithm converge to a stationary point in the log-likelihood function gunawardana2005convergence ; gupta2011theory . gunawardana2005convergence restate the EM algorithm as a generalized alternating minimization procedure and used an information geometric framework for describing such algorithms and analyzing their convergence properties. They show that the optimal parameters found by the variational EM procedure is a stationary point in the likelihood function if and only if the parameters locally minimize the variational approximation error. This can only happen in two ways. The first way is if the variational error has stationary points in the likelihood function, which is guaranteed only if we know those stationary point prior to estimation. The other way is if the variational error is independent of the parameters, which is not possible if the variational family makes independence assumptions to ensure tractability. Therefore, such models have lower variational errors than those with independence assumptions gunawardana2005convergence .

Dempster et al. dempster1977maximum prove a general convergence theorem for the EM algorithm. However, the proof is not constructed based on correct assumptions. wu1983convergence and boyles1983convergence point out flaws in the proof, and present a counter example for which the convergence of the generalized EM algorithm to the maxima of the likelihood function does not result in a single set of parameters. The convergence of these parameters are, therefore, highly dependent on the initial guess points and the properties of the likelihood function.

### 1.5 Summary

In this paper, we adopt the LGR model presented in meier2014efficient , and modify it suitably for batch-wise learning of stochastic dynamics. We term this model the Batch-hierarchical Bayesian linear regression (Batch-HBLR) model. Moreover, we describe all the steps to derive the Bayes update equations for the posterior distributions of the model. We then analyze the convergence of the variational EM algorithm that is at the core of training the model. Subsequently, we evaluate its performance experimentally on three different dynamical systems, including a challenging external force-field actuated micro-robot. Results indicate good performance on all the systems in terms of approximating the dynamics closely with a parsimonious model structure leading to fast computation times during testing. We, therefore, anticipate our Batch-HBLR model to provide a foundation for online model-based reinforcement learning of robot motion control under uncertainty.

## 2 Hierarchical Bayesian Regression

### 2.1 Statistical Background

Prior to formulating the regression model, it is useful to provide some statistical background. The multivariate normal (Gaussian) distribution for

and the univariate gamma distribution probability density functions (pdfs) are written in (

1) and (3), respectively.(1) |

We use the following notation in the rest of the manuscript to refer to a normal distribution,

, whereis the mean vector and

is the covariance matrix. The log-likelihood for i.i.d. samples drawn from is:(2) | ||||

Here, is a matrix containing all the samples, and represents an individual sample. Moreover, in this manuscript, we use for the natural logarithm.

For a random variable drawn from the Gamma distribution (), we use the following pdf,

(3) |

We use the following notation in the rest of the manuscript to refer to a gamma distribution, . Using Stirling’s formula for the gamma function, we approximate for with . Then, we get the following log-likelihood function for the gamma distribution,

(4) |

At this point, we also want to clarify the difference between the two notations and . We use the former when is a vector of parameters, and the probability is a function of and we call it the likelihood function. In contrast, the latter denotes the conditional probability of when is a random variable.

### 2.2 Bayesian Linear Regression Model with Local Features

In the Bayesian framework, consider the function and the variable . We want to predict the function value at an arbitrary location , using a set of N noisy observations, , where . are independently drawn form a zero-mean Gaussian distribution,

(5) |

where is the precision and . Therefore, one can assume is randomly sampled from a data generating distribution represented by , and denote as the i.i.d. observations of elements, , and .

In basic (global) Bayesian regression, the function is modeled as a linear combination of basis functions, and if these functions also include a bias term, then .

(6) |

where, are the weights of the linear combination, and . By definition, . Therefore, we write the likelihood function in the following form

(7) |

In local regression, we introduce kernel functions, such as the Radial Basis Function (RBF), to emphasize the local observations about the point of interest. This emphasis results in more accurate predictions, while the computational complexity remains the same as the global regression problem. A better methodology is to use a Bayesian linear regression comprising local linear models, with a Gaussian likelihood for the regression parameters () written as:

(8) |

where , and . is the dimensionality of the wighted feature vector , or the number of linear models. Here, we build the feature vector with linear bases plus a bias term, and define the feature vector as . The weights for these spatially localized basis functions are defined by the Radial Basis Function (RBF) kernel, whereby the RBF feature weight is given by

(9) |

The RBF function is parameterized by its center and a positive definite matrix , both of which are estimated given the observations . Here, we consider the diagonal distance matrix .

In this methodology, we implicitly assume independence among the local models, as the correlations between the distant models are almost zero (due to the low value of the RBF). However, we have not yet used this assumption to reduce the computational cost of the EM algorithm (mentioned earlier in Section 1 and described later in Section 2.3), to iteratively solve the maximum likelihood estimation problem for the models parameters. To do so, we introduce a hidden variable for every local model, and infer the parameters of each local model independently. Fig. 2 is a graphical model that illustrates how the hidden variables participate in both in the global and the local loops. Therefore, the complete likelihood of this graphical model is,

(10) |

, so , and . Later, in Algorithm 1, we use to denote the column of .

We place Gaussian priors on the regression parameters (11) and gamma priors on the precision parameters . It is worth mentioning that is a diagonal matrix with the precision parameters along the main diagonal . When we write , we refer to the vector that contains all the precision parameters of the local model . Due to this diagonal structure of , the model learns the precision parameters for each feature independently. This property is very useful when adopting the idea of automatic relevance determination (ARD) to automatically select the significant parameters by comparing the precision with a threshold value mackay1992bayesian ; tipping2001sparse ; neal2012bayesian . ARD is an elegant method for sparsifying a learning model with many features. The prior on the weights (11

) is a normal distribution; since, it is a conjugate prior, the posterior is also a normal distribution.

(11) |

After training, usually, some of these precision parameters converge to large numbers, which indicates that the corresponding feature does not play a significant role in representing the data. Thus, it would be reasonable to ignore those features in order to reduce the dimensionality of the regression model. This technique is called pruning, which largely alleviates the problem arising from the curse of dimensionality.

The joint prior distribution of all the model parameters is:

(12) | ||||

To summarize, the model has hidden variables and parameters .

### 2.3 Variational Inference of Hidden Variables and Models Parameters

We now want to infer the posterior probability . One of the most popular methods to do so is maximum likelihood (ML). According to this approach, the ML estimate is obtained as

(13) |

As mentioned earlier, we use the EM algorithm to solve the ML estimation problem, and follow the exposition in Neal1998 and Bishop2006 . More specifically, we employ the variational EM framework as described below.

We rewrite the log-likelihood function as

(14) |

with

(15) |

and

(16) |

where is an arbitrary probability function for the hidden variable , and is the Kullback-Leibler (KL) divergence between and . Since , . Hence, is the lower bound of the log-likelihood function. The EM algorithm maximizes the lower bound using a two step iterative procedure. Assume that the current value of the parameters is . The E-step maximizes the lower bound with respect to , which happens when . In this case, the lower bound is equal to the likelihood because . In the M-step, we keep constant, and maximize the lower bound with respect to the parameters to find a new value .

Now, if we substitute in the lower bound and expand (15), we get

(17) | ||||

where the second term in the right-hand-side is the entropy of and does not depend on . The first term, which is usually named , is the expectation of the likelihood of the complete data (). is commonly maximized in the M-step of the EM algorithm in the signal processing literature. To summarize, the two steps of the EM algorithm are,

(18) | |||

(19) |

In the variational EM algorithm, we approximate the posterior with a factorized function. In this approximation, the hidden variable is partitioned into partitions with . is also factorized in the same way . Given this assumption, the lower bound can be formulated as

(20) |

where , and

(21) |

The bound in (20) is maximized when the KL distance become zero, which is the case for . In other words, the optimal distribution is obtained from the following equation,

(22) |

To summarize, the two steps of the variational EM algorithm are given by,

(23) | |||

(24) |

These equations are the set of consistency conditions for the maximum of the lower bound for the factorized approximation of the posterior. They do not give us an explicit solution as they also depend on other factors. Hence, we need to iterate through all the factors and replace each of them one-by-one with its revised version. We now derive the updates for every factor of the hidden variables by considering the factorized posterior distribution,

(25) |

where and .

Note that assuming factorization between and automatically results in a factorized posterior distribution over all and , and every element of . To approximate the natural logarithm of the posterior distribution for the factors, we take the expected value of over all the variables except the ones that we are solving for. In the resulting expression, we only keep the terms containing the variable of interest. By doing so, the variational Bayes update equations for the posterior mean and covariance of the parameters of each linear model are found in Equations (26) through (43).

(26) | ||||

In the rest of the manuscript, we use (

) symbol to refer to the first moment of the approximation of the parameters. For instance,

and . Moreover, .The posterior distribution of is a Normal distribution; therefore, is a quadratic function of , which we refer as . From (2), we know that the negative inverse of the covariance matrix is equal to the second derivative of with respect to . The derivatives of the right hand side of (11) are:

(27) |

(28) |

Moreover, at the mean; hence, by setting (12) equal to zero and solving for , we get the mean of the posterior:

(29) | |||

(30) |

is a diagonal matrix, and we refer to the elements on its main diagonal by , where .

Similarly, we find the variational Bayes update for every element of the precision vector .

(31) | ||||

Hence,

(32) | |||

(33) |

We observe that is the same for all the models () and every individual element () of the precision vector . Therefore, we use instead of later in Algorithm 1.

Similarly, we derive the variational Bayes update for the posterior of the precision parameter by computing

(34) | ||||

The updates are

(35) | |||

(36) |

In (36), is the approximate variance of the local model. Again, since is identical for all the local models, we use instead.

Finally, the variational Bayes update equations for the posterior of mean and covariance of each local model target are found through the following steps:

(37) | ||||

where , , , and .

(38) |

We re-write the right hand side of (38) as an exponential function (), where is a quadratic function of and is defined as

(39) | ||||

To find the parameters of , we compute the first and second derivatives of :

(40) |

(41) |

Hence, the covariance matrix is . Using the Sherman-Morrison formula, we reformulate the covariance matrix as

(42) |

Let us suppose . Then, the diagonal elements of are . Note that the covariance matrix of the local models does not depend on the individual samples.

The minimum of is attained when the first derivative is zero. Setting (40) to zero and solving for gives us

(43) | ||||

Comments

There are no comments yet.