Constructing Metropolis-Hastings proposals using damped BFGS updates
This paper considers the problem of computing Bayesian estimates of system parameters and functions of them on the basis of observed system performance data. This is a previously studied issue where stochastic simulation approaches have been examined using the popular Metropolis--Hastings (MH) algorithm. This prior study has identified a recognised difficulty of tuning the proposal distribution so that the MH method provides realisations with sufficient mixing to deliver efficient convergence. This paper proposes and empirically examines a method of tuning the proposal using ideas borrowed from the numerical optimisation literature around efficient computation of Hessians so that gradient and curvature information of the target posterior can be incorporated in the proposal.READ FULL TEXT VIEW PDF
Particle Metropolis-Hastings enables Bayesian parameter inference in gen...
Monte Carlo inference has asymptotic guarantees, but can be slow when us...
Pseudo-marginal Metropolis-Hastings (pmMH) is a versatile algorithm for
Pseudo-marginal Metropolis-Hastings (pmMH) is a powerful method for Baye...
Bayesian models are naturally equipped to provide recursive inference be...
We propose a new kernel for Metropolis Hastings called Directional Metro...
We present a new multiple-try Metropolis-Hastings algorithm designed to ...
Constructing Metropolis-Hastings proposals using damped BFGS updates
State-space models (SSMs) are ubiquitous within many scientific disciplines including system identification (Ljung, 1999) and finance (Durbin and Koopman, 2012). A common problem within SSMs is the estimation of the parameters given some observed data . In this paper, we consider this problem for SSMs expressed by
where , and denote known densities parameterized by . Here, and denote the state and the observation from the system at time . Note that this parameterisation (1) includes most non-linear and non-Gaussian SSMs and an input can be added as arguments to and .
One common approach for estimating in (1) is to employ the implied one-step-ahead prediction distribution to form the likelihood
of the observed data. Then an estimate of is given by the maximising argument of the likelihood, i.e.,
which is the well-known maximum likelihood (ML) estimate (Ljung, 1999). In practice, is intractable for most SSMs but can be estimated unbiasedly using particle methods (Doucet and Johansen, 2011), These methods can be employed within e.g., gradient-based optimisation (Poyiadjis et al., 2011) and the Expectation–Maximisation algorithm (Schön et al., 2011) to approximately solve the non-convex intractable problem in (3).
where is a prior distribution for that can be used to encode prior user information about the parameters. However, this posterior is intractable as the likelihood (2) cannot be computed in closed-form.
Instead, we make use of a stochastic simulation approach to address this difficulty which constructs a random number generator such that
This is a widely applied approach in the statistics literature where there has been an explosion of its use in applications in the last decade.
A standard approach to building a suitable random number generator is to employ (particle) Metropolis–Hastings (MH; Robert and Casella, 2004; Andrieu et al., 2010) which is a very general algorithm for computing realisations from if it can be evaluated point-wise. Unfortunately achieving reasonable convergence requires careful tuning of the algorithm, which essentially entails leveraging information about the unknown posterior.
This difficulty is well recognised within the literature which is usually mitigated by employing adaptive methods (Andrieu and Thoms, 2008) and the inclusion of geometric information (Girolami and Calderhead, 2011). The latter approach requires the computation of the Hessian of the log-posterior, which can be challenging to compute directly even for a linear Gaussian SSMs where standard Kalman methods are applicable.
The problem is even worse when particle methods (Doucet and Johansen, 2011) are employed to estimate the latent state and the likelihood. This is the result of the empirical observation that the Hessian estimates obtained using particle methods often are noisy and inaccurate even using a large amount of particles. However, sometimes the gradient estimates are accurate even using a small amount of particles. It is therefore of interest to study the problem of estimating the Hessian using noisy gradient information.
The contribution of this paper is explore the use of damped Brodyen–Fletcher–Goldfarb–Shanno (BFGS) updates for approximating the local curvature encoded by the Hessian using only gradient information. Ths is very widely employed approach for solving smooth numerical optimisation problems (Nocedal and Wright, 2006).
Related work regarding the use of BFGS within MH includes Zhang and Sutton (2011) where the authors apply this idea to regression problems. This work is extended to a class of SSMs with intractable likelihoods by Dahlin et al. (2015b)
. The major novelty in the present paper comes from using the damped BFGS update to ensure that the Hessian is positive semi-definite even when particle methods are employed. Finally, the proposed approach delivers superior performance compared to earlier attempts to make use of BFGS within MH. Hence, we obtain good performance of MH without the need for tedious user-tuning which is a step towards automated Bayesian inference methods.
There exists a suite of so-called Markov chain Monte Carlo (MCMC; Robert and Casella, 2004
) methods for constructing a Markov chain producing realisationswith user-specified invariant distribution . Since under mild assumptions realisations from Markov chains have distribution convergent to the invariant distribution of the chain, this provides a means to build a random number generator (5) with arbitrary target distribution .
These samples can then be employed to approximate the posterior. Given the posterior, point estimates such as the conditional mean
can be obtained. These are of interest as they possess a minimum mean squared error property and does not rely on asymptotic results as the ML estimator (3). Moreover, error bounds for each element
of the estimated parameter vector can be obtained by computing the (marginal) posterior density
This results in that the expected value of any arbitrary (measurable) function given by
can be approximated by
using samples from the random number generator. Furthermore, we have that the estimator obeys the strong law of large numbers and is consistent, i.e.,
Choosing then gives an approximation to the conditional mean estimate (6) and choosing as an appropriate indicator function then gives approximations of as sample histograms.
The MH algorithm is arguably one of the most widely employed MCMC techniques to implement (5). It operates by taking an arbitrary proposal Markov chain and modulating it by randomly accepting realisations from this base
chain. The acceptance probability depends on the target. In theory (but not in practice) the proposal distribution can be selected quite freely but in the majority of cases a Gaussian proposal is used,
where denotes the candidate parameter with and denoting a mean and covariance function, respectively.
After generating a candidate parameter via (10), it is accepted with the acceptance probability
where . We set if the candidate parameter is accepted and if it is rejected. Note that, we are only required to be able to point-wise evaluate to implement MH as presented in Algorithm 1.
An essential point is that the speed of the convergence (9) depends on how correlated the realisations are. The more uncorrelated the faster the convergence and hence the better the approximation (8) for a given finite number
of realisations. This is well understood in the MCMC literature, where the variance in the stochastic approximation (8) is established as being proportional to the integrated autocorrelation
which also is known as the inefficiency factor (IF). In turn, the correlation of the realisations is critically dependent on the choice of the proposal .
From the work of Girolami and Calderhead (2011) it is known that the mixing (i.e. autocorrelation) can be greatly improved by the inclusion of gradient and curvature information regarding the posterior into the proposal (10). This is especially important when the posterior is non-isotropic, i.e., some parameters influence the value of the posterior to a larger degree than others. The influence can also vary over the parameter space, which makes local information about the curvature important to increase mixing.
A significant further problem is that the curvature information is difficult to obtain in an efficient manner for general SSMs when particle methods are employed within MH as discussed in the introduction. The gradient of the log-posterior is quite simple to estimate efficiently using Fisher’s identity (Cappé et al., 2005),
where either a Kalman or particle smoother can be employed to compute or approximate the expectation.
The Louis’ identity (Cappé et al., 2005) can be used in an analogue manner to estimate the negative Hessian of the log-posterior . Unfortunately, the resulting estimator often suffers from large noise sensitivity which results in frequent loss of positive definiteness. This is problematic as the inverse of the Hessian is often included in the proposal (10) as its covariance.
Another approach is to compute an estimate of the Hessian by using pilot runs. This results in a so-called pre-conditioning matrix, which can be used to scale the proposal. However, results from numerical studies in e.g., Girolami and Calderhead (2011), Nemeth et al. (2016) and Dahlin et al. (2015a) indicate that this parameter-independent approach is sub-optimal in terms of mixing.
To address these problems, we propose to leverage knowledge from the optimisation literature, where the curvature information often is estimated from gradient information. This is the approach used in the highly successful quasi-Newton algorithms that allow for optimising non-linear functions. Again, this is of interest as gradient estimates often are accurate and relatively computationally cheap to obtain using particle methods in comparison to Hessian estimates of the posterior.
In classical mathematical optimisation, the class of so-called quasi-Newton methods (Nocedal and Wright, 2006) were developed to incorporate curvature information into the search direction calculation in order to accelerate convergence. The details of such methods can be easily found in standard references such as Nocedal and Wright (2006), but in essence, these methods utilise gradient and iterate information in order to form estimates of the Hessian, or its inverse.
One of the most celebrated quasi-Newton methods is the BFGS approach, which employs a rank-2 update to the current Hessian estimate to form a better estimate via the recursion
It can be observed that the Hessian estimate will remain positive-definite if for all iterations. In an optimisation setting this condition on can be guaranteed by a a line-search algorithm that satisfies Wolfe conditions, see (Nocedal and Wright, 2006, Chapter 8). Unfortunately, in the MH setting, this cannot be enforced since it would result in a Markov chain that converges to a single point.
To ameliorate this problem, here we employ the so-called damped BFGS method where is replaced by via
In addition to the damping term, we further employ a limited memory (Nocedal and Wright, 2006, Chapter 9) implementation of the damped-BFGS approach so that the computational load remains modest.
To construct a good MH proposal, the gradient information together with the Hessian estimate from BFGS will be used in the mean and covariance function entering (10). A typical choice resulting from a second-order Taylor expansion (Dahlin et al., 2015a) of the log-posterior is
where denotes a step size specified by the user. Hence, we can see the proposal as a local Gaussian approximation of the posterior, which should allow for efficient sampling from it. Another way of motivating (15) is to see it as a random walk on a Riemannian manifold, see Girolami and Calderhead (2011) for details.
The inclusion of the BFGS algorithm to estimate the Hessian requires us to make some changes to MH. The main problem is that information from iterations back in the algorithm is used to construct the proposal distribution. In the standard version of MH, only information from the last iteration is allowed to be used in the proposal due to the Markov property.
To solve this problem, we are required to extend the Markov chain from a first-order chain to an -order chain. This allows MH to retain its validity as discussed by Zhang and Sutton (2011) and Dahlin et al. (2015b). The major algorithmic change to Algorithm 1 is that the gradient and Hessian is computed in Step 4 using a smoother and Algorithm 2. Moreover, the proposal step in MH is replaced by sampling from
We provide three numerical illustrations to gain understanding about the proposed algorithm and compare it to other alternatives in the literature. The implementation details are summarised in Appendix A and the source code can be downloaded as described in Section 6.
We begin by considering a linear Gaussian state-space (LGSS) model as it is possible to solve the state inference problem exactly using the Kalman smoother. This enables us to compute the log-posterior and its gradients using exact recursions, which will give an indication about the optimal performance of various MH proposals. The model is given by
with and , and . A synthetic data set consisting of a realisation with observations is simulated from the model using the parameters .
We make use of Monte Carlo simulations using the same data to compute the IF (12), and estimate the computational time for different proposals. Table 1 summarises the median results from these simulations, which include the acceptance rate, the fraction of Hessian estimates that are corrected, the maximum of IF and the time required for each iteration and the time required to obtain one effective sample from the posterior. For the IF, we also provide the IQR (the distance between the and quantiles).
The BFGS-type proposals are benchmarked against pre-conditioned versions of MH0 and MH1 denoted pMH0 and pMH1, respectively. In pMH1, we set , where denotes an estimate of the posterior covariance computed using pilot runs. In pMH0, we use the same approach as for pMH1 but also set .
Three different BFGS proposals are used: (d)amped, (i)gnoring the curvature condition and (e)nforcing the curvature condition. For the latter two, the Hessian estimates are often negative definite and therefore require some correction. We apply the three different methods outlined in Appendix B
: (flip)ping the negative eigenvalues, (reg)ularising the estimate and the (hyb)rid method.
The time per effective sample (in milliseconds) is presented in the right-most column in the table. We note that the proposed method based on damped BFGS to approximate the curvature information locally performs the best. It requires milliseconds to produce one sample from the posterior. This is smaller than for pMH0, which would be the standard approach in this setting. Furthermore, we note that the other BFGS-type proposals require a large amount of corrections of the Hessian, which is not desirable this might introduce numerical instability.
Figure 1 presents the posterior estimate, Markov chain trace with its corresponding ACF for a particular case in the simulation study. We note that the mixing is much better for the BFGS-based proposal (lower) compared with pMH0 (upper) and pMH1 (middle). Furthermore, the ACF for the BFGS-based proposal exhibit a quite different behaviour compared with the other two proposals due to the step dependence in the Markov chain. Comparing the posterior estimates, we conclude that the proposed method generates good estimates centered around the correct parameter and with reasonable variance.
We repeat the same experiment when the Kalman filter and smoother is replaced by a particle filter and fixed-lag particle smoother as described inDahlin et al. (2015a). Again, Table 1 summarises the results with all timings now expressed in seconds. The pMH0 and pMH1 perform worse in this case due to the noise in the estimates of the log-posterior and its gradients.
However, the proposed method performs well and generates one effective sample from the posterior every seconds. This is a substantial decrease compared with the other methods and in particular an acceleration by at least a factor of two compared with the quasi-Newton approach proposed by Zhang and Sutton (2011) and Dahlin et al. (2015b). Hence, the proposed method outperforms both the standard approaches to designing a proposal and with current state-of-the-art in quasi-Newton proposals.
To demonstrate a practical application of the proposed method, we consider the problem of estimating the volatility of Bitcoin prices between November 7, 2015 and November 7, 2017. The log-returns (the change in percent of the Bitcoin price between two days) is presented in the upper part of Figure 2 as green dots. Note that the log-returns have zero mean but that the variance is changing over time. We aim to capture this change in volatility (as this variance is known) by the model
which is a so-called stochastic volatility (SV) model with leverage. Here, the unknown latent parameters are the same as in the LGSS model together with the correlation , i.e., . The aim is to estimate the log-volatility given the data, which can be done by marginalising over the posterior estimate of the parameters, see Andrieu et al. (2010) or Dahlin and Schön (2017).
The resulting estimates of the log-volaility (the latent state) is presented in the middle plot of Figure 2. Note that it varies over the time period and is large when the Bitcoin prices are volatile and exhibit large day-to-day changes. Furthermore, the mean of the log-volatility process is quite large, which translates into that the log-volatility typically is large (compared with e.g., prices of stocks). The correlation is probably quite close to zero. This is quite different from which is typical for stocks, which implies the large drops in stock prices raises the volatility (as investors sell their assets).
This information is very useful in many financial applications such as pricing futures on Bitcoins as well as computing various risk measures required to be presented by banks and financial institutions to regulatory agencies.
The numerical illustrations indicate the the proposed method can outperform many existing methods used to create good proposals for MH. Furthermore, we would like to again underline that the proposed method requires basically no pilot runs, which are required for all pre-conditioned methods. Moreover, the damped BFGS approach always provides a positive definite estimate of the Hessian, so no potentially numerically unstable Hessian correction is required.
Furthermore, the gradient information is crucial for estimation in large dimension parameter spaces, which are common in SSMs and in transfer function models. All these benefits could potentially allow for a wide adoption of MH for identifying dynamical systems.
There are plenty of interesting avenues for future work within the scope of this paper. SR1 updates (Nocedal and Wright, 2006, Ch. 6.2) are an alternative to BFGS which are known to provide more accurate estimates of the Hessian in many cases. Furthermore, trust region approaches from optimisation could potentially be useful in MH to protect with problems with numerical stability. Finally, more extensive numerical evaluations are required for models with larger parameter spaces. In this case, alternatives to or better algorithms for particle smoothing are required to obtain reasonable gradient estimates.
The source code and data used in this paper are available from GitHub https://github.com/compops/qnmh-sysid2018/ and via Docker (see README.md).
Inference in hidden Markov models. Springer Verlag, 2005.
The Oxford Handbook of Nonlinear Filtering. Oxford University Press, 2011.
In Sections 5.1 and 5.2, we use a standard Kalman filter with the RTS smoother to compute the log-posterior and its gradients. Furthermore, a bootstrap particle filter is employed with particles and a fixed-lag smoother with lag , see Dahlin et al. [2015a] for an algorithmic description. The number of particles is selected using the results in Doucet et al. . We initialise all MH algorithms in the true parameters for simplicity and run them for iterations and discard the first as burn-in.
The pre-conditioning matrices for MH0 and MH1 are computed using a number of pilot runs. The step-lengths are selected using existing rule-of-thumbs [Roberts and Rosenthal, 1998] as and when using Kalman methods and and when using particle methods [Nemeth et al., 2016]. For the qMH approaches, we use memory length and make use of a random walk proposal for the first iterations with step lengths for all three parameters. The step size is used for all qMH algorithms after the initial iterations.
A reparametrization of the LGSS model is done to make all the parameters in the Markov chain unconstrained (able to assume any real value) given by
where are the new states of the Markov chain. This change of variables introduces a Jacobian term into the acceptance probability, see Dahlin and Schön [2017, Section 6.3.2]. Finally, we use the following prior densities
denotes a truncated Gaussian distribution onand
denotes the Gamma distribution with mean.
In Section 5.3, we make use of the same settings as for the LGSS model but increase to . Moreover, we change the priors ( is kept as before) slightly to
which better reflect the parameter values usually found in real-world data. The correlation is reparametrized in the same manner as . The Bitcoin data is computed as , where denotes the daily exchange rates versus the US Dollar obtained from https://www.quandl.com/BITSTAMP/USD.
The first approach regularise the Hessian by
with denoting the smallest (negative) eigenvalue. This shifts all eigenvalues to be positive. The second method flips the eigenvalues,
denotes the matrix of eigenvectors and the diagonal matrix of the absolute value of the eigenvalues of, respectively. The third approach is the hybrid method from Dahlin et al. [2015a]. In which the estimate is replaced by a global approximation of the posterior covariance
where denotes the sample estimate of the posterior covariance computed using the latter half of the burn-in phase.