1 Introduction
The present paper is concerned with the development of probabilistic coarsegrained models for highdimensional dynamical systems with a view of enabling multiscale simulation. We describe a unified treatment of complex problems described by large systems of deterministic or stochastic ODEs and/or large number of data streams. Such systems arise frequently in modern multiphysics applications either due to the discrete nature of the system (e.g. molecular dynamics) or due to discretization of spatiotemporal models (e.g. PDEs):
(1) 
where (e.g. , ). Stochastic versions are also frequently encountered:
(2) 
where is a driving stochastic process (i.e. Wiener process). Uncertainties could also appear in the initial conditions that accompany the aforementioned systems of equations.
Even though the numerical solution of (stochastic) ODEs is a wellstudied subject and pertinent computational libraries are quite mature, traditional schemes are impractical or infeasible for systems which are highdimensional and exhibit a large disparity in scales. This is because most numerical integrators must use timesteps of the order of the fastest scales which precludes solutions over long time ranges that are of interest for physical and engineering purposes. In the context of atomistic simulations, practically relevant time scales exceed typical integration steps of by several orders of magnitude [3]. Furthermore, when numerical solutions of transient PDEs are sought, resolution and accuracy requirements give rise to systems with more than degrees of freedom [102, 22, 128, 73] where the integration time steps are slaved by fast reaction rates or high oscillation frequencies. This impedes their solution and frequently constitutes computationally infeasible other important tasks such as stability analysis, sensitivity, design and control.
Multiscale dynamical systems exist independently of the availability of mathematical models. Large numbers of time series appear in financial applications, meteorology, remote sensing where the phenomena of interest unfold also over a large range of time scales [139, 80]. A wealth of time series data is also available in experimental physics and engineering which by themselves or in combination with mathematical models can be useful in analyzing underlying phenomena [106, 88, 108] by deriving reduced, predictive descriptions.
Quite frequently the time evolution of all the observables is irrelevant for physical and practical purposes and the analysis is focused on a reduced set of variables or reaction coordinates obtained by an appropriate mapping . The goal is then to identify a closed, deterministic or stochastic system of equations with respect to , e.g. :
(3) 
In the context of equilibrium theormodynamics where enmsemble averages with respect to the invariant distribution of are of interest, coarsegraining amounts to freeenergy computations [25]. In the nonequilibrium case and when an invariant distribution exists, a general approach for deriving effective dynamics is based on MoriZwanzig projections [146, 68, 27, 28, 29, 34]
. Other powerful numerical approaches to identify the dynamical behavior with respect to the reduced coordinates include transition path sampling, the transfer operator approach, the nudged elastic band, the string method, Perron cluster analysis and spectral decompositions
[39, 42, 43, 49, 40, 105]. Marked efforts in chemical kinetics have led to an array of computational tools such as computational singular perturbation [98, 99], the intrinsic lowdimensional manifold approach [104, 144] and others [120, 113, 90]. Notable successes in overcoming the timescale dilemma have also been achieved in the context of MD simulations [97, 134, 135, 132] (or Hamiltonian systems in general [112, 100, 127]).In several problems, physical or mathematical arguments have led analysts to identify a few, salient features and their interdependencies that macroscopically describe the behavior of very complex systems consisting of a huge number of individuals/agents/components/degrees of freedom. These variables parameterize a lowdimensional, attracting, invariant, “slow” manifold characterizing the longterm process dynamics [71]. Hence the apparent complexity exhibited in the highdimensionality and the multiscale character of the original model is a pretext of a much simpler, latent structure that, if revealed, could make the aforementioned analysis tasks much more tractable. The emergence of macroscopic, coherent behavior has been the foundation of coarsegrained dynamic models that have been successful in a wide range of applications. The coarsegrained parameterization and associated model depend on the analysis objectives and particularly on the time scale one wishes to make predictions. Modern approaches with general applicability such as the Equationfree method [92] or Heterogeneous Multiscale Method (HeMM,[47]) are also based on the availability of reduced coordinates and in the case of HeMM of a macroscopic model which is informed and used in conjunction with the microscale model.
Largely independently of the developments in the fields of computational physics and engineering, the problem of deriving, predictive reducedorder models for a large number of time series that potentially exhibit multiple scales has also been addressed in statistics and machine learning communities
[140, 53] with applications in network analysis [33], environmetrics [141], sensor network monitoring [142, 117], moving object tracking [4], financial data analysis [5], computer model emulation [103]. Significant advances have been achieved in modeling [131], forecasting [143] and developing online, scalable algorithms [51, 126, 91, 94, 89, 145]. that are frequently based on the discovery of “hidden variables” that provide insight to the intrinsic structure of streaming data [54, 35, 110, 109, 85].The present paper proposes a datadriven alternative that is able to automatically coarsegrain highdimensional systems without the need of preprocessing and availability of physical insight. The data is most commonly obtained by simulations of the most reliable, finestscale (microscopic) model available. This is used to infer a lowerdimensional description that captures the dynamic evolution of the system at a coarser scale (i.e. a macroscopic model). The majority of available techniques address separately the problems of identifying appropriate reduced coordinates and the effective dynamics in this lowerdimensional representation. It is easily understood that the solution of one affects the other. We propose a general framework where these two problems are simultaneously solved and coarsegrained models are built from the ground up. We propose procedures that concurrently infer the macroscopic dynamics and their mapping the highdimensional, finescale description. As a result no errors or ambiguity are introduced when the finescale model needs to be reinitialized in order to obtain additional simulation data. To that end, we advocate a largely unexplored in computational physics perspective based on the Bayesian paradigm which provides a rigorous foundation for learning from data. It is capable of quantifying inferrential uncertainties and, more importantly, uncertainty due to information loss in the coarsegraining process.
We present a Bayesian statespace model where the reduced, coarsegrained dynamics are parametrized by tractable, lowdimensional dynamical models. These can be viewed as experts offering opinions on the evolution of the highdimensional observables. Each of these modules could represent a single latent regime and would therefore be insufficient by itself to explain the whole system. As is often the case with real experts, their opinions are valid under very specific regimes. We propose therefore a framework for dynamically synthesizing such models in order to obtain an accurate global representation that retains its interpretability and computational tractability.
An important contribution of the paper, particularly in view of enabling simulations of multiscale systems, is online inference algorithms based on Sequential Monte Carlo that scale linearly with the dimensionality of the observables (Equation (1)). These allow the recursive assimilation of data and recalibration of the coarsegrained dynamics. The Bayesian framework adopted provides probabilistic predictive estimates that can be employed in the context of adaptive timeintegration. Rather than determining integration time steps based on traditional accuracy and stability metrics, we propose using measures of the predictive uncertainty in order to decide how long into the future the coarsegrained model can be used. When the uncertainty associated with the predicitve estimates exceeds the analyst’s tolerance, the finescale dynamics can be consistently reinitialized in order to obtain additional data that sequentially update the coarsegrained model.
In agreement with some recently proposed methodologies [92, 93], the datadriven strategy can seamlessly interact with existing numerical integrators that are wellunderstood and reliably implemented in several legacy codes. In addition, it is suitable for problems where observational/experimental data must be fused with mathematical descriptions in a rigorous fashion and lead to improved analysis and prediction tools.
The structure of the rest of the paper is as follows. In Section 2 we describe the proposed framework in relation with the stateoftheart in dimensionality reduction. We provide details of the probabilistic model proposed in the context of Bayesian StateSpace models in Section 2.2. Section 2.3
is devoted to inference and learning tasks which involve a locallyoptimal Sequential Monte Carlo sampler and an online ExpectationMaximization scheme. The utilization of the coarsegrained dynamics in the context of a Bayesian (adaptive) timeintegration scheme is discussed in
2.4 and numerical examples are provided in Section 3.2 Proposed Approach
2.1 From staticlinear to dynamicnonlinear dimensionality reduction
The inherent assumption of all multiscale analysis methods is the existence of a lowerdimensional parameterization of the original system with respect to which the dynamical evolution is more tractable at the scales of interest. In some cases these slow variables can be identified a priori and the problem reduces to finding the necessary closures that will give rise to a consistent dynamical model. In general however one must identify the reduced space as well as the dynamics within it.
A prominent role in these efforts has been held by Principal Component Analysis (PCA) based methods. With small differences and depending on the community other terms such as Proper Orthogonal Decomposition (POD) or KarhunenLoève expansion (KL), Empirical Orthogonal Functions (EOF) have also been used. PCA finds its roots in the early papers by Pearson
[111] and Hotelling [84] and was originally developed as a staticdimensionality reduction technique. It is based on projections on a reduced basis identified by the leading eigenvectors of the covariance matrix
. In the dynamic case and in the absence of closed form expressions for the actual covariance matrix, samples of the process at distinct time instants are used in order to obtain an estimate of the covariance matrix:(4) 
where is the empirical mean. If there is a spectral gap after the first eigenvalues and is the matrix whose columns are the leading normalized eigenvectors of then the reducedorder model is defined with respect to . The reduced dynamics can be identified by a Galerkin projection (or a PetrovGalerkin projection) of the original ODEs in Equation (1):
(5) 
Hence the reduced space
is approximated by a hyperplane in
and the projection mapping linear (Figure 1(a)). While it can be readily shown that the projection adopted is optimal in the mean square sense for stationary Gaussian processes, it is generally not so in cases where nonGaussian processes or other distortion metrics are examined. The application of PCAbased techniques, to highdimensional, multiscale dynamical systems poses several modeling limitations. Firstly, the reduced space might not be sufficiently approximated by a hyperplane of dimension . Even though this assumption might hold locally, it is unlikely that this will be a good global approximation. Alternatively, the dynamics of the original process might be adequately approximated ondimensional hyperplane but this hyperplane might change in time. Secondly, despite the fact that the projection on the subspace spanned by the leading eigenvectors captures most of the variance of the original process, in cases where this variability is due to the fast modes, there is no guarantee that
will account for the longrange, slow dynamics which is of primary interest in multiscale systems. Thirdly, the basic assumption in the estimation of the covariance matrix, is that the samples are drawn from the same distribution, i.e. that the process is stationary. A lot of multiscale problems however involve nonstationary dynamics (e.g. nonequilibrium MD [78, 30]). Hence even if a stationary reducedorder process provides a good, local, approximation to , it might need to be updated in time. Apart from the aforementioned modeling issues, significant computational difficulties are encountered for highdimensional systems () and large datasets () as the leading eigenvectors of large matrices (of dimension proportional to or ) need to be evaluated. This effort must be repeated, if more samples become available (i.e. increases) and an update of the reducedorder model is desirable. Recent efforts have concentrated on developing online versions [137] that circumvent this problem.The obvious extension to the linear projections of PCA is nonlinear dimensionality reduction techniques. These have been the subject of intense research in statistics and machine learning in recent years ([121, 115, 130, 44, 123, 10, 9]) and fairly recently have found their way to computational physics and multiscale dynamical systems (e.g. [32, 96, 107, 59]
). They are generally based on calculating eigenvectors of an affinity matrix of a weighted graph. While they circumvent the limiting, linearity assumption of standard PCA, they still assume that the underlying process is stationary (Figure
1(b)). Even though the system’s dynamics might be appropriately tracked on a lowerdimensional subspace for a certain time period, this might not be invariant across the whole time range of interest. The identification of the dynamics on the reducedspace is not as straightforward as in standard PCA and in most cases, a deterministic or stochastic model is fit directly to the projected data points [31, 50, 58]. More importantly since the inverse mapping from the manifold to is not available analytically, approximations have to be made in order to find preimages in the dataspace [11, 50]. From a computational point of view, the cost of identifying the projection mapping is comparable to standard PCA as an eigenvalue problem on a matrix has to be solved. Updating those eigenvalues and the nonlinear projection operator in cases where additional data become available implies a significant computational overhead although recent efforts [122] attempt to overcome this limitation.A common characteristic of the aforementioned techniques is that even though the reduced coordinates are learned from a finite amount of simulation data, there is no quantification of the uncertainty associated with these inferences. This is a critical component not only in cases where multiple sets of reduced parameters and coarsegrained models are consistent with the data, but also for assessing errors associated with the analysis and prediction estimates. It is one of the main motivations for adopting a probabilistic approach in this project. Statistical models can naturally deal with stochastic systems that frequently arise in a lot of applications. Most importantly perhaps, even in cases where the finescale model is deterministic (e.g. Equation (1)), a stochastic reduced model provides a better approximation that can simultaneously quantify the uncertainty arising from the information loss that takes place during the coarsegraining process [52, 95].
A more general perspective is offered by latent variable models where the observed data (experimental or computationally generated) is augmented by a set of hidden variables [13]. In the case of highdimensional, multiscale dynamical systems, the latent model corresponds to a reducedorder process that evolves at scales of practical relevance.
Complex distributions over the observables can be expressed in terms of simpler and tractable joint distributions over the expanded variable space. Furthermore,
structural characteristics of the original, highdimensional process can be revealed by interpreting the latent variables as generators of the observables.In that respect, a general setting is offered by Hidden Markov Models (HMM,
[64]) or more generally StateSpace Models (SSM) [18, 65, 80]. These assume the existence of an unobserved (latent) process described by a (stochastic) ODE:(6) 
which gives rise to the observables as:
(7) 
where and are unknown stochastic processes (to be inferred from data) and ,
are unknown measurable functions. The transition equation defines a prior distribution on the coarsegrained dynamics whereas the emission equation, the mapping that connects the reducedorder representation with the observable dynamics. The object of Bayesian inference is to learn the unobserved (unknown) model parameters from the observed data. Hence the coarsegrained model and its relation to the observable dynamics are inferred from the data.
The form of Equations (6) and (7) affords general representations. Linear and nonlinear PCA models arise as special cases by appropriate selection of the functions and random processes appearing in the transition and emission equations. Note for example that the transition equation (Equation (6)) for in the case of the PCAbased models reviewed earlier is given by Equation (5) and the emission equation (Equation (7)) that relates latent and observed processes is linear, deterministic and specified by the matrix of leading eigenvectors .
An extension to HMM is offerered by switchingstate models [74, 24, 72, 124] which can be thought of as dynamical mixture models [23, 66]. The latent dynamics consist of a discrete process that takes values, each corresponding to a distinct dynamical behavior. This can be represented by an
dimensional vector
whose entries are zero except for a single one which is equal to one and represents the active mode/cluster. Most commonly, the timeevolution of is modeled by a firstorder stationary Markov process:(8) 
where is the transition matrix and . In addition to , processes parameterize the reducedorder dynamics (see also discussion in section 2.2). Each is activated when . In the linear version (Switching Linear Dynamic System, SLDS ^{1}^{1}1sometimes referred to as jumplinear or conditional Gaussian models) and conditioned on , the observables arise by a projection from the active as follows:
(9) 
where are matrices () and is a positive definite matrix. Such models provide a natural, physical interpretation according to which the behavior of the original process is segmented into regimes or clusters, the dynamics of which can be lowdimensional and tractable. From a modeling point of view, the idea of utilizing a mixture of simple models provides great flexibility [16, 14, 125, 70] as it can be theorized that given a large enough number of such components, any type of dynamics can be sufficiently approximated. In practice however, a large number might be needed, resulting in complex models containing a large number of parameters.
Such mixture models have gained prominence in recent years in the machine learning community. In [15] for example, a dynamic mixture model was used to analyze a huge number of time series, each corresponding to a word in the English vocabulary as they appear in papers in the journal Science. The latent discrete variables represented topics and each topic implied a distribution on the space of words. As a result, not only a predictive summary (dimensionality reduction) of the highdimensional observables was achieved but also an insightful deconstruction of the original time series was made possible. In fact current research in statistics has focused on infinite mixture models where the number of components can be automatically inferred from the data ([129, 21, 12, 56, 57]). In the context of computer simulations of highdimensional systems, such models have been employed by [55, 82, 80, 79, 81] where maximum likelihood techniques were used to learn the model parameters.
In the next sections we present a novel model that generalizes SLDS. Unlike mixture models which assume that is the result of a single reducedorder process at a time, we propose a partialmembership model (referred to henceforth as PartialMembership Linear Dynamic System, PMLDS) which allows observables to have fractional memberships in multiple clusters. The latent, building blocks are experts [86, 87, 77, 75] which, on their own, provide an incomplete, biased prediction but when their “opinions” are appropriately synthesized, they can give rise to a highly accurate aggregate model.
From a modeling perspective such an approach has several appealing properties. The integrated coarsegrained model can be interpretable and lowdimensional even for large, multiscale systems as its expressive ability does not hinge upon the complexity of the individual components but rather is a result of its factorial character ([67]). Intricate dynamical behavior can be captured and decomposed in terms of simple building blocks. It is highlysuited for problems that lack scale separation and where the evolution of the system is the result of phenomena at a cascade of scales. Each of these scales can be described by a latent process and the resulting coarsegrained model will account not only for the slow dynamics but also quantify the predictive uncertainty due to the condensed, fastvarying features.
From an algorithmic point of view we present parallelizable, online inference/learning schemes, which can recursively update the estimates produced as more data become available i.e. if the time horizon of the observables increases. Unlike some statistical applications where long time series are readily available, in the majority of problems involving computational simulations of highdimensional, multiscale systems, data is expensive (at least over large time horizons) as they imply calls to the microscopic solvers. The algorithms presented are capable of producing predictive estimates “on the fly” and if additional data is incorporated, they can readily update the model parameters. In addition, such schemes can take advantage of the natural tempering effect of introducing the data sequentially which can further facilitate the solution of the global estimation problem. More importantly perhaps, the updating schemes discussed have linear complexity with respect to the dimensionality of the original process .
2.2 PartialMembership Linear Dynamic System
We present a hierarchical Bayesian framework which promotes sparsity, interpretability and efficiency. The framework described can integrate heterogeneous building blocks and allows for physical insight to be introduced on a casebycase basis. When dealing with highdimensional molecular ensembles for example, each of these building blocks might be an (overdamped) Langevin equation with a harmonic potential [17, 80, 83]. It is obvious that such a simplified model would perhaps provide a good approximation under specific, limiting conditions (e.g. at a persistent metastable state) but definitely not across the whole time range of interest. Due to its simple parameterization and computational tractability, it can easily be trained to represent one of the “experts” in the integrated reducedmodel. It is known that these models work well under specific regimes but none of them gives an accurate global description. In the framework proposed, they can however be utilized in a way that combines their strengths, but also probabilistically quantifies their limitations.
A transient, nonlinear PDE can be resolved into several linear PDEs whose simpler form and parameterization makes them computational tractable over macroscopic time scales and permits a coarser spatial discretization. Their combination with timevarying characteristics can give rise to an accurate global approximation. Their simplicity and limited range of applicability would preclude their individual use. In the framework proposed however, these simple models would only serve as good local approximants and their inaccurate predictions would be synthesized into an accurate, global model.
2.2.1 Representations with simple building blocks
We describe a probabilistic, dynamic, continuoustime, generative model which relates a sequence the observations at discrete time instants with a number of hidden processes. The proposed model consists of hidden processes () which are assumed to evolve independently of each other and are described by a set of (stochastic) ODEs:
(10) 
This equation essentially implies a prior distribution on the space of hidden processes parameterized by a set of (unknown a priori) parameters . It should be noted that while the proposed framework allows for any type of process in Equation (10), it is desirable that these are simple, in the sense that the parameters are lowdimensional and can be learned swiftly and efficiently. We list some desirable properties of the prior models [125]:

Stationarity: Unless specific prior information is available, it would be unreasonable to impose a time bias on the evolution of any of the reduced dynamics processes. Hence it is important that the models adopted are a priori stationary. Note that the posterior distributions might still exhibit nonstationarity.

Correlation Decay: It is easily understood that for any and , the correlation and should decay monotonically as goes to . This precludes models that do not explicitly account for the time evolution of the latent processes and assume that hidden states are not timedependent (e.g. static PCA models).

Other: Although this is not necessary, we adopt a continuous time model in the sense of [136] with an analytically available transition density which allows inference to be carried out seamlessly even in cases where the observables are obtained at nonequidistant times. As a result the proposed framework can adapt to the granularity of the observables and also provide exact probabilistic predictions at any time resolution.
The logistic normal distribution was used to model the weights associated with each of the hidden processes depicted in Figure
2. In particular, an isotropic OrnsteinUhlenbeck process with , and . Graphs depict the resulting observable time history (left column) and its histogram (right column) arising from the unobserved processes in Figure 2 and for three values of . It is noted that time histories exhibit fast and slow scales of the processes in Figure 2. Furthermore, by changing a single parameter (i.e. ) one can obtain two, three or a single metastable state (right column  peaks of the histogram).Although more complex models can be adopted we assume here that independent, isotropic OrnsteinUhlenbeck (OU) processes are used to model the hidden dynamics . The OU processes used comply with the aforementioned desiderata. In particular, the following parameterization is employed:
(11) 
where are Wiener processes (independent for each ), , and are positive definite matrices of dimension . The aforementioned model has a Gaussian invariant (stationary) distribution . The transition density denoted by for time separation is also a Gaussian where:
(12) 
It is not expected that simple processes on their own will provide good approximations to the essential dynamics exhibited in the data . In order to combine the dynamics implied by the processes in Equation (10), we consider an dimensional process such that and and define an appropriate prior. The coefficients express the weight or fractional membership to each process/expert at time [76]. We use the logisticnormal model [15] as a prior for . It is based on a Gaussian process, whose dynamics are also prescribed by an isotropic OU process:
(13) 
and the transformation:
(14) 
The invariant and transition densities of are obviously identical to the ones for with appropriate substitution of the parameters. The hidden processes and associated weights give rise to the observables as follows (compare with Equation (9)):
(15) 
where are matrices () and is a positive definite matrix. The aforementioned equation implies a series of linear projections on hyperplanes of dimension . The dynamics along those hyperplanes are dictated by a priori independent process . It is noted however the reduced dynamics are simultaneously described by all the hidden processes (Figure 4). This is in contrast to PCA methods where a single such projection is considered and mixture PCA models where even though several hidden processes are used, at each time instant it is assumed that a single one is active. Due to the factorial nature of the proposed model, multiple dynamic regimes can be captured by appropriately combining a few latent states. While mixture models (Figure 1(c)) provide a flexible framework, the number of hidden states might be impractically large. As it is pointed out in [67], in order to encode for example a time sequence with of information one would need distinct states. It is noted that even though linear projections are implied by Equation (15) and Gaussian noise is used, the resulting model for is nonlinear and nonGaussian as it involves the factorial combination of processes with which are a posteriori nonGaussian.
The parameters express the relative importance of the various reduced models or equivalently the degree to which each data point is associated with each of the reduced dynamics . It is important to note that the proposed model allows for time varying weights and can therefore account for the possibility of switching between different regimes of dynamics. Figure 3 depicts a simple example () which illustrates the flexibility of the proposed approach.
The unknown parameters of the coarsegrained model consist of:

dynamic variables denoted for notational economy by (i.e. , , for ).
2.3 Inference and learning
Inference in the probabilistic graphical model described involves determining the probability distributions associated with the unobserved (hidden) static
and dynamic parameters of the model. In the Bayesian setting adopted this is the posterior distribution of the unknown parameters of the coarsegrained model. Given the observations (computational or experimental) of the original, highdimensional process , we denote the posterior by :(16) 
The normalization constant is not of interest when sampling for or but can be quite useful for model validation purposes.
A telescopic decomposition holds for the likelihood which according to Equation (15) is given by:
(17) 
where the densities in the product are described in Equation (20). Equation (15) defines the likelihood which is basically the weighted product of the likelihoods under each of the hidden processes/experts :
(18) 
where the normalizing constant ensures that the density intergrates to one with respect to . According to Equation (15):
(19) 
The likelihood can be written in a more compact form as:
(20) 
where:
(21) 
and:
(22) 
The firstorder Markovian processes adopted for the prior modeling of the dynamic parameters (Equations (11), (13), 12) ) imply that:
(23) 
where is the prior on the initial condition which in this work is taken to be the stationary distribution of the underlying OU processes (see discussion in Section 2.2.1) and denoted for notational economy by .
The posterior encapsulates uncertainties arising from the potentially stochastic nature of the original process as well as due to the fact that a finite number of observations were used. The difficulty of the problem is that both the dynamic () and the static parameters () are unknown. We adopt a hybrid strategy whereby we sample from the full posterior for the dynamic parameters and provide point estimates for the static parameters . If uniform priors are used for then the procedure proposed reduces to a maximum likelihood estimation. Nonuniform priors have a regularization effect which can promote the identification of particular features.
While the hybrid strategy proposed is common practice in pertinent models ([64]), in the current framework it is also necessitated by the difficulty in sampling in the highdimensional state space of (note that the projection matrices in particular are of the dimension of the observables and ) as well as the need for scalability in the context of highdimensional systems. The static parameters are estimated by maximizing the logposterior.
(24) 
Maximization of is more complex than a standard optimization task as it involves integration over the unobserved dynamic variables . While maximization can be accelerated by using gradientbased techniques (e.g. gradient ascent), the dimensionality of makes such an approach impractical as it can be extremely difficult to scale the parameter increments. We propose therefore adopting an ExpectationMaximization framework (EM) which is an iterative, robust scheme that is guaranteed to increase the logposterior at each iteration ([41, 64]). It is based on constructing a series of increasing lower bounds of the logposterior using auxiliary distributions :
(25) 
It is obvious that this inequality becomes an equality when in place of the auxiliary distribution the posterior is selected. Given an estimate at step , this suggests iterating between an Expectation step (Estep) whereby we average with respect to to evaluate the lower bound:
(26) 
and a Maximization step (Mstep) with respect to (and in particular the first part in Equation (26) since the second does no not depend on ):
(27) 
As the optimal auxiliary distributions are intractable, we propose employing a Sequential Monte Carlo (SMC or particle filter, [45, 37]) scheme for estimating the expectations in the MStep, i.e. Equation (27). SMC samplers provide a parallelizable framework for nonlinear, nonGaussian filtering problems whereby the target distribution is represented with a population of particles and weights such that the expectation in Equation (27) can be approximated by:
(28) 
In section 2.3.1 we discuss a particle filter that takes advantage of the particular structure of the posterior and employs the locally optimal importance sampling distribution. Nevertheless, SMC samplers involve sequential importance sampling, and their performance decays with increasing as the dimension of the state space increases even when resampling and rejuvenation mechanisms are employed ([7]). Recent efforts based on exponential forgetting have shown that the accuracy of the approximation can be improved (while keeping the number of particles fixed) by employing smoothing ([69]) over a fixedlag in the past ([20]).
In this paper we make use of an approximate but highly efficient alternative proposed in [6, 7, 8]. This is based on the socalled splitdata likelihood (SDL) first discussed in [116], which consists of splitting the observations into blocks (overlapping or nonoverlapping) of length and using the pseudolikelihood which arises by assuming that these blocks are independent. It is shown in [7]
that this leads to an alternative KullbackLeibler divergence contrast function and under some regularity conditions that the set of parameters optimizing this contrast function includes the true parameter. Because the size of the blocks is fixed, the degeneracy of particle filters can be averted and the quality of the approximations can be further improved by applying a backward smoothing step over each block (
[69]). Let denote the index of the block of length considered and and . If . The likelihood is approximated by:(29) 
When has reached a stationary regime with invariant density , then for any , are identically distributed according to:
(30) 
In a batch EM algorithm using the splitdata likelihood and the block of data, the Mstep would involve maximization with respect to of (see also Equation (27)):
(31) 
We utilize an online EM algorithm where the iteration numbers coincide with the block index (i.e. ) which effectively implies that the estimates for are updated every time a new data block is considered. The expectation step (Estep) is replaced by a stochastic approximation ([119, 19]) while the Mstep is left unchanged. In particular, at iteration ():
(32) 
and update the value of the parameters as:
(33) 
The algorithm relies on a nonincreasing sequence of positive stepsizes such that and . In this work we adopted with . Naturally the integrals above over the hidden dynamic variables are estimated using SMCbased, particulate approximations of . For small the convergence will in general be slow as the splitblock likelihood assumption will be further from the truth. For larger , convergence is faster but the performance of the filter decays. For that purpose we also employed a backward smoothing filter over each block using the algorithm described in [69]. The computational cost of the smoothing algorithm is .
In practice, and in particular for the exponential distributions utilized in the proposed model (e.g. Equation (
20)), the EM iterations reduce to calculating a set of (multivariate) sufficient statistics . In particular, instead of the logposterior lower bound in Equation (32) we update the sufficient statistics as follows:(34) 
where denotes the set of sufficient statistics associated with the block of data . Specific details are provided in the Appendix. It is finally noted, that learning tasks in the context of the probabilistic model proposed, should also involve identifying the correct structure (e.g. the number of different experts ). While this problem poses some very challenging issues which are currently the topic of active research in various contexts (e.g. nonparametric methods), this paper is exclusively concerned with parameter learning. In section 3, we discuss Bayesian validation techniques for assessing quantitatively the correct model structure which are computationally feasible due to the efficiency of the proposed algorithms.
2.3.1 Locally optimal Sequential Monte Carlo samplers
In this section we discuss Monte Carlo approximations of the expectations appearing in Equation (34) with respect to the density . Note that in order to simplify the notation we consider an arbitrary block of length (e.g. )and do not explicitly indicate the iteration number of the EM algorithm. Hence the target density is:
(35) 
where the dynamic variables are (Equation (21)). Based on earlier discussions, the evolution dynamics of these variables are independent i.e.:
(36) 
and:
(37) 
Since there is a deterministic relation between and (Equation (14)) we use them interchangeably. In particular we use in the evolution equations since the initial and transition densities are Gaussian (Equation (13)) and in the likelihood densities as the expressions simplify in Equation (15)). The initial and transition densities for are also Gaussian. Given that are a priori independent, we have that:
(38) 
where the mean is given by Equation (12) and (from Equation (12) as well).
SMC samplers operate on a sequence of target densities which, for any , are approximated by a set of random samples (or particles) . These are propagated forward in time using a combination of importance sampling, resampling and MCMCbased rejuvenation mechanisms [38, 37, 138, 133]. Each of these particles is associated with an importance weight () which is updated sequentially along with the particle locations in order to provide a particulate approximation:
(39) 
where is the Dirac function centered at . Furthermore, for any measurable (as in Equation (34)) and [36, 26, 20]:
(40) 
The particles are constructed recursively in time using a sequence of importance sampling densities . The importance weights are determined from the fact that:
(41) 
Let the particulate approximation of . Note that for and for the Gaussian initial densities of the proposed model, this consists of exact draws and weights . At time we proceed as follows ([45]):

Sample and set

Compute incremental weights:
(42) and update the weights:
(43) 
Compute and if perform multinomial resampling to obtain a new population with equally weighted particles ( was used in this study). Set and go to step 1.
It can be easily established ([46]) that the locally optimal importance sampling density is:
(44) 
In practice, it is usually impossible to sample from and/or calculate the integral in the denominator. As a result, approximations are used which nevertheless result in nonzero variance in the estimators. In this paper we take advantage of the fact that the transition density of as well as the likelihood, conditioned on are Gaussians and propose an importance sampling density of the form:
(45) 
This implies using the prior to draw
Comments
There are no comments yet.