A common class of models used for time series modelling and prediction is the class of state space models (SSMs). This class includes nonlinear structures, like stochastic volatility models, regime switching models, mixture models, and models with random dynamic jumps; plus linear structures, such as linear Gaussian unobserved component models. (See Durbin and Koopman, 2001, Harvey et al., 2004, and Giordani et al., 2011, for extensive reviews).
The key feature of SSMs is their dependence on hidden, or latent, ‘local’ variables, or states, which govern the dependence of the observed data, in conjunction with a vector of unknown ‘global’ parameters. This feature leads to inferential challenges with, for example, the likelihood function for the global parameters being analytically unavailable, except in special cases. Whilst frequentist methods have certainly been adopted (seeDanielsson and Richard, 1993, Ruiz, 1994, Andersen and Sørensen, 1996, Gallant and Tauchen, 1996, Sandmann and Koopman, 1998, Bates, 2006, Ait-Sahalia and Kimmel, 2007, and Ait-Sahalia et al., 2020
, amongst others), it is arguable that Bayesian Markov chain Monte Carlo (MCMC) methods have become the most common tool for analysing general SSMs, with such techniques expanded in more recent times to accommodate particle filtering, via pseudo-marginal variants such as particle MCMC (PMCMC) (Andrieu et al., 2011; Flury and Shephard, 2011). See Giordani et al. (2011) and Fearnhead (2011) for a detailed coverage of this literature, including the variety of MCMC-based algorithms adopted therein.
Whilst (P)MCMC methods have been transformative in the SSM field, they do suffer from certain well-known limitations. Most notably, they require that either the (complete) likelihood function is available in closed form or that an unbiased estimator of it is available. Such methods also do not necessarily scale well to high-dimensional problems; that is, to models with multiple observed and/or state processes. If the assumed data generating process (DGP) is intractable, inference can proceed using approximate Bayesian computation (ABC) (Dean et al., 2014; Creel and Kristensen, 2015; Martin et al., 2019), since ABC requires only simulation - not evaluation - of the DGP. However, ABC also does not scale well to problems with a large number of parameters (see, e.g., Corollary 1 in Frazier et al., 2018 for details).
Variational Bayes (VB) methods (see Blei et al., 2017 for a review) can be seen as a potential class of alternatives to either (P)MCMC- or ABC-based inference in SSMs. In particular, and in contrast to these methods, VB scales well to high-dimensional problems, using optimization-based techniques to effectively manage a large number of unknowns (Tran et al., 2017; Quiroz et al., 2018; Koop and Korobilis, 2018; Chan and Yu, 2020; Loaiza-Maya et al., 2021).
In this short paper, we make three contributions to the literature on the application of VB to SSMs. The first contribution is to highlight the fundamental issue that lies at the heart of the use of VB in an SSM setting, linking this to an issue identified elsewhere in the literature as the ‘incidental parameter problem’ (Neyman and Scott, 1948; Lancaster, 2000). In brief, without due care, the application of VB to the local parameters in an SSM leads to a lack of Bayesian consistency for the global parameters. We believe this to be a novel insight into the role of VB in SSMs, and one that can lead to best practice, if heeded. The second contribution is to review some existing variational methods, and to link their prospects for consistency to the manner in which they do, or do not, circumvent the incidental parameter problem. Thirdly, we undertake a numerical comparison of several competing variational methods, in terms of both inferential and predictive accuracy. The key findings are that: i) correct management of the local variables leads to inferential accuracy that closely matches that of exact (MCMC-based) Bayes; ii) inadequate treatment of the local variables leads, in contrast, to noticeably less accurate inference; iii) predictive accuracy shows some robustness to inferential inaccuracy, but only for small sample sizes. Once the size of the sample is very large, the consistency (or otherwise) of a VB method impinges on predictive accuracy, with a clear ranking becoming evident across the methods for some DGPs; with certain VB methods unable to produce similar out-of-sample accuracy results to exact Bayes in some settings.
Throughout the remainder, we make use of the following notational conventions. Generic are used to denote densities, and is used to denote posteriors conditioned only on data, and where the conditioning is made explicit depending on the situation. For any arbitrary collection of data , we abbreviate this collection as . For a sequence , the terms , and have their usual meaning. Similarly, we let denote . We let denote a metric on . The proofs of all theoretical results, certain definitions, plus additional numerical results, are included in the Supplementary Appendix.
2 State Space Models: Exact Inference
An SSM is a stochastic process consisting of the pair , where is a Markov chain taking values in the measurable space , and is a process taking values in a measure space , such that, conditional on , the sequence is independent. The model is formulated through the following state and measurement equations: for a vector of unknown random parameters
taking values in the probability space, where admits the density function ,
where , , and are known functions and and
are independent sequences of i.i.d. random variables independent of the initial value, which is distributed according to the initial measure . The system in (1) and (2) gives rise to the following conditional and transition densities:
where denotes the transition kernel wrt the measure . For simplicity, throughout the remainder we disregard terms dependence on the initial measure and the invariant measure , when no confusion will result.
Given the independence of conditional on , and the Markovian nature of , the complete data likelihood is
The (average) observed data log-likelihood is thus
and the maximum likelihood estimator (MLE) ofis . As is standard knowledge, is available in closed form only for particular forms of and the canonical example being when (1) and (2) define a linear Gaussian state space model (LGSSM). Similarly, for denoting the prior density, the exact (marginal) posterior for , defined as
is available (e.g. via straightforward MCMC methods) only in limited cases, the LGSSM being one such case. In more complex settings and/or settings where either or , or both, are high-dimensional, accessing (4) can be difficult, with standard MCMC methods leading to slow mixing, and thus potentially unreliable inferences (Betancourt, 2018).
To circumvent these issues, recent research has suggested the use of variational methods for SSMs: these methods can be used to approximate either the log-likelihood function in (3) or the marginal posterior in (4), depending on the mode of inference being adopted. The focus of this paper, as already highlighted, is on variational Bayes and, in particular, on the accuracy of such methods in SSMs. However, as part of the following section we also demonstrate the asymptotic behaviour of frequentist variational point estimators of , as this result will ultimately help us interpret the behavior of the variational posterior in SSMs.
3 State Space Models: Variational Inference
The idea of VB is to produce an approximation to the joint posterior in (4) by searching over a given family of distributions for the member that minimizes a user-chosen divergence measure between the posterior of interest and the family. This replaces the posterior sampling problem with one of optimization over the family of densities used to implement the approximation. We now review the use of variational methods in SSMs, paying particular attention to the Markovian nature of the states.
VB approximates the posterior by minimizing the KL divergence between a family of densities , with generic element , and :
Optimizing the KL divergence directly is not feasible since it depends on the unknown ; the very quantity we are trying to approximate. However, minimizing the KL divergence between and is equivalent to maximizing the so-called variational evidence lower bound (ELBO):
which we can access. Hence, for a given class , we may define the variational approximation as
The standard approach to obtaining is to consider a class of product distributions
with often restricted to be mean-field, i.e., independent of , , and independent of .
Regardless of the variational family adopted, , and hence , involve both and . The product form of allows us to write:
where the last line follows from Fubini’s theorem and the fact that , by assumption, is a proper density function, for all Further, defining
by Jensen’s inequality
Thus can be viewed as an approximation (from below) to the observed data log-likelihood. Defining
the can then be expressed as
This representation decomposes into three components, two of which only depend on the variational approximation of the global parameters , and a third component, , that Yang et al. (2020) refer to as the average (wrt ) “Jensen’s gap”, which encapsulates the error introduced by approximating the latent states using a given variational class. While the first and last term in the decomposition can easily be controlled by choosing an appropriate class for , it is the average Jensen’s gap that ultimately determines the behavior of the variational approximation.
3.2 Consistency of variational point estimators
The decomposition in (9) has specific implications for variational inference in SSMs, which can be most readily seen by first considering the case where we only employ a variational approximation for the states, and consider point estimation of the parameters . In this case, we can think of the variational family as , where is the Dirac delta function at , and we can then write
where we abuse notation and represent functions with arguments only by the parameter value , and also make use of the short-hand notation for Define the variational point estimator as
At a minimum, we would hope that the variational estimator converges to the same point as the MLE. To deduce the behavior of , we employ the following high-level regularity conditions.
(i) The parameter space is compact, and . (ii) There exists a deterministic function , continuous for all , and such that . (iii) For some value , for all , there exists a such that .
Low level regularity conditions that imply Assumption 3.1 are given in Douc et al. (2011). Since the main thrust of this paper is to deduce the accuracy of variational methods in SSMs, and not to focus on the technical details of the SSMs in particular, we make use of high-level conditions to simplify the exposition and reduce necessary technicalities that may otherwise obfuscate the main point.
The following result shows that consistency of (for ) is guaranteed if the variational family for the states is ‘good enough’.
Define , and note that . If Assumption 3.1 is satisfied, and if , then .
The above result demonstrates that for the variational point estimator to be consistent, the (infeasible) average Jensen’s gap must converge to zero. Intuitively, this requires that the error introduced by approximating the states grows more slowly than the rate at which information accumulates in our observed sample, i.e., . The condition is stated at the true value, , rather than at the estimated value, as it will often be easier to deduce satisfaction of the condition, or otherwise, at convenient points in the parameter space.
As the following example illustrates, even in the simplest SSMs, the scaled (average) Jensen’s gap need not vanish in the limit, and can ultimately pollute the resulting inference on .
Example 3.1 (Linear Gaussian Model).
Consider the following SSM,
with and independent sequences of i.i.d. standard normal random variables. We observe a sequence from the above model, but the states are unobserved. Furthermore, consider that are unknown while is known.
We make use of the autoregressive nature of the state process to approximate the posterior for via the variational family:
When evaluated at ,
is the actual (infeasible) joint distribution of the states, and thus should provide a reasonable approximation to the state posterior.
Let and , . (i) Consider that and known, then the variational estimator is consistent if and only if . (ii) Consider that and known, then the variational estimator for is consistent if and only if .
3.3 Lack of Bayes consistency of the variational posterior
While the above results pertain to variational point estimators of , a similar result can be stated in terms of the so-called ‘idealized’ variational posterior. To state this result, we approximate the state posterior using the class of variational approximations,
where denotes the vector of so-called ‘variational parameters’ that characterize the elements in . With reference to (7), making the dependence of on the variational parameter explicit leads to the criterion , where in (7) is replaced by . Optimizing over for fixed yields the profiled criterion,
and the ‘idealized’ variational posterior for ,
Note that, unlike with the frequentist optimization problem, the idealized VB posterior incorporates a component of Jensen’s gap directly into the definition of that posterior. A sufficient condition for the ‘VB ideal’to concentrate onto is that is the maximum of a well-defined limit counterpart to .
(i) There exists a map such that . (ii) There exist a deterministic function and a such that the following are satisfied: (a) for all there exists some such that ; (b) . (iii) For any , . (iv) For all large, .
Under Assumption 3.2, for any ,
Assumption 3.2(2.b) implies that converges to (uniformly in and ); while part (2.a) is an identification condition and states that is maximized at some , which may differ from . This identification condition makes clear that if , then and the idealized posterior for will not concentrate onto . This can be interpreted explicitly in terms of Jensen’s gap as defined in (8) by recalling that under Assumption 3.1, , and by considering the limit of (the scaled) Jensen’s gap evaluated at ,
for some . If Assumption 3.2(2.a) is satisfied at , then , and .
show that, regardless of whether one conducts variational frequentist or Bayesian inference in SSMs, consistent inference forwill require that a version of Jensen’s gap converges to zero. Moreover, as Example 3.1 has demonstrated, this is not likely to occur even in simple SSMs.
4 Existing Variational Approaches: Implications for Inference
As the previous discussion illustrates, the need to approximate the posterior of introduces a discrepancy between the exact posterior, obtained via standard methods, and that which results from VB methods. In this way, we can view the latent states as incidental or nuisance parameters (see Lancaster, 2000, for a review), which are needed to make feasible the overall optimization problem, but which, in and of themselves, are not the object of interest. As the theoretical results demonstrate, the introduction of, and requirement to perform inference on, the incidental parameters means that the (ideal) VB posterior may not concentrate onto the true parameter value that has generated the data, . It is then clear that VB methods must somehow solve the incidental parameter problem if they wish to obtain reliable inference for .
That being said, the incidental parameter problem has not stopped researchers from using VB methods to conduct inference in SSMs. The general conclusions elucidated above apply, in principle, to all such methods. However, it is useful to explore specific categories of VB methods in greater detail, and comment on their ability to deliver consistent inference for . We begin with reference to a variational approach to point estimation of , followed by brief outlines of two classes of approach that target the posterior for via variational methods. In Section 5, we continue the exposition of VB methods (only), but explicitly within the context of prediction. The implications of Bayesian consistency (or lack thereof) for predictive accuracy are discussed therein, and with numerical examples used to document the differences between various approaches.
4.1 Profiling out the states
The point estimation approach proposed by Westling and McCormick (2019) splits the optimization procedure into two parts: one for conditional on ; and one for conditional on . Such an approach allows one to view variational estimators as (profiled) M-estimators. In particular, this profiled strategy posits a class of variational posteriors for , , which depend on parameters , and proposes to estimate by maximizing the profiled criterion in (10). As detailed in Section 3, the optimizer of the limit criterion will not in general coincide with the optimizer of , i.e. Hence, consistency of such a variational estimator for is not guaranteed. This is indeed highlighted by example in Westling and McCormick (2019), where the authors show that inconsistency can occur, even in the case of independent observations, if delicate care is not taken with the choice of variational class for
4.2 Integration approaches
A possible VB approach is to first ‘integrate out’ the latent states so that there is no need to perform joint inference on . Such an approach can be motivated by the fact that if we take (i.e. take the variational approximation for to be equivalent to the exact posterior for conditional on ), then we can rewrite as
with the final line exploiting the fact that integrates to one for all Thus, if we are able to use as our variational approximation for the states the actual (conditional) posterior, we can transform a variational problem for into a variational problem for alone.
The above approach is adopted by Loaiza-Maya et al. (2021), and is applicable in any case where draws from can be reliably and cheaply obtained, with the resulting draws then used to ‘integrate out’ the states via the above KL divergence representation. While the approach of Loaiza-Maya et al. (2021) results in the above simplification, the real key to their approach is that it can be used to unbiasedly estimate the gradient of (equivalent, in turn, to the gradient of the joint ELBO in (6), by the above argument). This, in turn, allows optimization over to produce an approximation to the posterior . Indeed, such an approach can be applied in many SSMs, such as unobserved component models like the LGSSM, in which draws from
can be generated exactly via, for example, forward (Kalman) filtering and backward sampling (Carter and Kohn, 1994; Frühwirth-Schnatter, 1994); or various nonlinear models (e.g. those featuring stochastic volatility), in which efficient Metropolis- Hastings-within-Gibbs algorithms are available (Kim et al., 1998; Jacquier et al., 2002; Primiceri, 2005; Huber et al., 2020).
In cases where we are not able to sample readily from it may still be possible to integrate out the states using particle filtering methods. To this end, assume that we can obtain an unbiased estimate of the observed data likelihood using a particle filter, which we denote by . We follow Tran et al. (2017) and write as to make the estimator’s dependence on the random filtering explicit through the dependence on a random variable , with subsequently defined by the condition . For denoting the density of , Tran et al. (2017) consider VB for the augmented posterior
which, marginal of , has the correct target posterior due to the unbiasedness of the estimator . The authors refer to the resulting method as variational Bayes with an intractable likelihood function (VBIL). The VBIL posteriors can be obtained by considering a variational approximation to that minimizes the KL divergence between and :
where in this case
For fixed , , but in general is a biased estimator of , from which it follows that However, in contrast to the general approximation of the states discussed in Section 3, which intimately relies on the choice of the approximating density , VBIL can achieve consistent inference on by choosing an appropriate number of particles in the production of .
To see this, we recall that a maintained assumption in the literature on PMCMC methods is that, for all and
, the conditional mean and variance of the densitysatisfy , and , where is bounded uniformly over ; see, e.g., Assumption 1 in Doucet et al. (2015) and Assumption 1 in Tran et al. (2017). However, in general, is assumed to be chosen so that and , . Note that, under this choice for , for any
From this condition, we see that the VBIL inference problem is asymptotically the same as the VB inference problem for alone. Consequently, existing results on the posterior concentration of VB methods for alone can be used to deduce posterior concentration of the VBIL posterior for .
4.3 Structured approximations of the states
Yet another approach for dealing with variational inference in the presence of states is to consider a structured approximation that allows for a dynamic updating of the approximation for the posterior of the states. Such an approximation can be achieved by embedding in the class of variational densities an analytical filter, like the Kalman filter. Koop and Korobilis (2018) propose the use of the Kalman filter within VB (VBKF) as a means of approximating the posterior density of the states using Kalman recursions. In particular, the authors approximate the posterior by approximating the relationship between and , which may in truth be non-linear in , by the random walk model , with , and then use Kalman filtering to update the states in conjunction with a linear approximation to the measurement equation. Using this formulation, the variational approximation is of the form , where and where the terms are explicitly calculated using the Kalman recursion: and where is the Kalman gain, is the predicted variance of the state, and in the application of Koop and Korobilis (2018), .
While the solution proposed by the VBKF is likely to lead to better inference on the states, especially when behaves like a random walk, ultimately we are still ‘conducting inference’ on , and thus we still encounter the incidental parameter problem as a consequence. Indeed, taking as the variational family for the Kalman filter approximation yields, at time , a conditionally normal density with mean and variance . Hence, we have a variational density that has the same structure as in Lemma 3.2, but which allows for a time varying mean and variance. Given this similarity, there is no reason to suspect that such an approach will yield inferences that are consistent. Indeed, further intuition can be obtained by noting that, in the VBKF formulation, the simplification of the state equation means that we disregard any dependence between the states and the values of that drive their dynamics.
The variational approach of Chan and Yu (2020) can be viewed similarly: the suggested algorithm assumes and exploits a particular dynamic structure for the states that allows for analytical (posterior) updates and thus leads to computationally simple estimates for the variational densities of As with the VBKF approach, the assumed nature of the state process used by Chan and Yu (2020) to estimate implies that, in general, it is unlikely that Bayesian consistency can be achieved. Since this method plays a role in the numerical prediction exercises below, we forgo further discussion of the specific details of the approach until Section 6.
5 VB-based Prediction: Implications for Predictive Accuracy
As highlighted by the above analysis and, indeed, as also acknowledged by other authors (e.g. Koop and Korobilis, 2018; Gunawan et al., 2020), VB provides, at best, an approximation to the posterior and, as a result, may well yield less accurate inferences than those produced by the exact posterior. However, as highlighted by Quiroz et al. (2018) and Frazier et al. (2021), amongst others, VB approximations can perform admirably in predictive settings, in the sense of replicating the out-of-sample accuracy achieved by exact predictives, when such comparators are available. (See Frazier et al., 2019 for a comparable finding in the context of predictions based on ABC.) Therefore, even though the VB posterior may not necessarily converge to the true value that has generated the data (as shown in Lemma 3.3), so long as the value onto which it is concentrating is not too far away, it may be that predictions produced by VB approaches will also perform well in practice.
In this section, we shed further light on the phenomena of the predictive accuracy of VB methods, and connect the performance of these methods to the inconsistency for that can result as the sample size diverges. The results suggest that, while there is little difference between variational methods in predictive settings with either a small sample size or a small number of out-of-sample observations, there is a clear hierarchy in terms of predictive accuracy across methods as the sample size becomes larger and as the out-of-sample evaluation increases.
Recall the conditional density of given and is , so that the predictive pdf for can be expressed as
where the last line follows from the Markovianity of the state transition equation (see equation (1)). In many large SSMs, using MCMC methods to estimate (11) is infeasible or prohibitive computationally, due to the difficulty of sampling from Instead, VB methods can produce an estimate of by approximating, in various ways, the two pieces in equation (11) underlined as (1) and (2). All such methods replace the second underlined term by some approximate posterior for , but differ in how they access the first underlined term. In all the cases of which we are aware, we can separate VB methods for prediction in SSMs into two classes: a class which makes explicit use of a variational approximation to the states, to replace ; and a class that uses an accurate simulation-based estimate of . We discuss these two strategies in greater detail in the following section.
5.2 Methods of producing the variational predictive
5.2.1 Approximation approaches
The VB methods that approximate by constructing an approximation to all make use of a variational approximation of , in addition to using the structure of the state equation. To illustrate this, it is perhaps easiest to consider the case where we seek to estimate (11) by generating values of and using as our estimate of the kernel density obtained from the simulations. In this way, we can see that simulation of requires simulating the following random variables, in sequence:
More precisely, consider a fixed value of drawn from some variational approximation of , call it. Given the realization , we simulate from the VB approximation of the states . Next, given , we can generate from by generating from the transition density of the states, , and under the draws and . Lastly, is generated according to the conditional distribution . While the above steps are simple to implement, the critical point to realize is that since has not been generated from , in general is not a draw from . Hence, the draw does not correctly reflect the structure of the assumed model, and cannot be viewed as being a draw from the exact predictive density in (11).
Notable uses of the above approach to prediction appear in Quiroz et al. (2018), Koop and Korobilis (2018) and Chan and Yu (2020). While similar in form and structure, these three specific approaches are distinct in the sense that the each use different methods to construct (in addition to the differences in the construction of ) and thus to generate .
5.2.2 Simulation approaches
As an alternative, one may estimate using exact draws of , and , conditional on the draw of from some For example, if draws from the exact posterior of the states, , are readily available via an efficient MCMC algorithm, can be estimated via the same set of steps as delineated above, apart from being drawn directly from , rather than some ; see, for example, Loaiza-Maya et al. (2021). In this case, is a draw from and, consequently, the draw correctly reflects the model structure. Moreover, due to the Markovian nature of (1), posterior draws of the full vector of states are not required, only draws of As such, any forward (particle) filtering method is all that is required to produce draws of that are conditional on the full vector of observations.