Constructing Metropolis-Hastings proposals using damped BFGS updates

01/04/2018 ∙ by Johan Dahlin, et al. ∙ 0

This paper considers the problem of computing Bayesian estimates of system parameters and functions of them on the basis of observed system performance data. This is a previously studied issue where stochastic simulation approaches have been examined using the popular Metropolis--Hastings (MH) algorithm. This prior study has identified a recognised difficulty of tuning the proposal distribution so that the MH method provides realisations with sufficient mixing to deliver efficient convergence. This paper proposes and empirically examines a method of tuning the proposal using ideas borrowed from the numerical optimisation literature around efficient computation of Hessians so that gradient and curvature information of the target posterior can be incorporated in the proposal.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

qnmh-sysid2018

Constructing Metropolis-Hastings proposals using damped BFGS updates


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

State-space models (SSMs) are ubiquitous within many scientific disciplines including system identification (Ljung, 1999) and finance (Durbin and Koopman, 2012). A common problem within SSMs is the estimation of the parameters given some observed data . In this paper, we consider this problem for SSMs expressed by

(1)

where , and denote known densities parameterized by . Here, and denote the state and the observation from the system at time . Note that this parameterisation (1) includes most non-linear and non-Gaussian SSMs and an input can be added as arguments to and .

One common approach for estimating in (1) is to employ the implied one-step-ahead prediction distribution to form the likelihood

(2)

of the observed data. Then an estimate of is given by the maximising argument of the likelihood, i.e.,

(3)

which is the well-known maximum likelihood (ML) estimate (Ljung, 1999). In practice, is intractable for most SSMs but can be estimated unbiasedly using particle methods (Doucet and Johansen, 2011), These methods can be employed within e.g., gradient-based optimisation (Poyiadjis et al., 2011) and the Expectation–Maximisation algorithm (Schön et al., 2011) to approximately solve the non-convex intractable problem in (3).

In this paper, we take another approach to estimate by using the Bayesian paradigm (Peterka, 1981; Robert, 2007). This amounts to computing the posterior

(4)

where is a prior distribution for that can be used to encode prior user information about the parameters. However, this posterior is intractable as the likelihood (2) cannot be computed in closed-form.

Instead, we make use of a stochastic simulation approach to address this difficulty which constructs a random number generator such that

(5)

This is a widely applied approach in the statistics literature where there has been an explosion of its use in applications in the last decade.

A standard approach to building a suitable random number generator is to employ (particle) Metropolis–Hastings (MH; Robert and Casella, 2004; Andrieu et al., 2010) which is a very general algorithm for computing realisations from if it can be evaluated point-wise. Unfortunately achieving reasonable convergence requires careful tuning of the algorithm, which essentially entails leveraging information about the unknown posterior.

This difficulty is well recognised within the literature which is usually mitigated by employing adaptive methods (Andrieu and Thoms, 2008) and the inclusion of geometric information (Girolami and Calderhead, 2011). The latter approach requires the computation of the Hessian of the log-posterior, which can be challenging to compute directly even for a linear Gaussian SSMs where standard Kalman methods are applicable.

The problem is even worse when particle methods (Doucet and Johansen, 2011) are employed to estimate the latent state and the likelihood. This is the result of the empirical observation that the Hessian estimates obtained using particle methods often are noisy and inaccurate even using a large amount of particles. However, sometimes the gradient estimates are accurate even using a small amount of particles. It is therefore of interest to study the problem of estimating the Hessian using noisy gradient information.

The contribution of this paper is explore the use of damped Brodyen–Fletcher–Goldfarb–Shanno (BFGS) updates for approximating the local curvature encoded by the Hessian using only gradient information. Ths is very widely employed approach for solving smooth numerical optimisation problems (Nocedal and Wright, 2006).

Related work regarding the use of BFGS within MH includes Zhang and Sutton (2011) where the authors apply this idea to regression problems. This work is extended to a class of SSMs with intractable likelihoods by Dahlin et al. (2015b)

. The major novelty in the present paper comes from using the damped BFGS update to ensure that the Hessian is positive semi-definite even when particle methods are employed. Finally, the proposed approach delivers superior performance compared to earlier attempts to make use of BFGS within MH. Hence, we obtain good performance of MH without the need for tedious user-tuning which is a step towards automated Bayesian inference methods.

2 Sampling from the posterior

There exists a suite of so-called Markov chain Monte Carlo (MCMC; Robert and Casella, 2004

) methods for constructing a Markov chain producing realisations

with user-specified invariant distribution . Since under mild assumptions realisations from Markov chains have distribution convergent to the invariant distribution of the chain, this provides a means to build a random number generator (5) with arbitrary target distribution .

These samples can then be employed to approximate the posterior. Given the posterior, point estimates such as the conditional mean

(6)

can be obtained. These are of interest as they possess a minimum mean squared error property and does not rely on asymptotic results as the ML estimator (3). Moreover, error bounds for each element

of the estimated parameter vector can be obtained by computing the (marginal) posterior density

(7)

where denotes the vector without its th element. Unfortunately, both (6) and (7) require the evaluation of multidimensional integrals, which can be computationally challenging, particularly as grows.

This results in that the expected value of any arbitrary (measurable) function given by

can be approximated by

(8)

using samples from the random number generator. Furthermore, we have that the estimator obeys the strong law of large numbers and is consistent, i.e.,

(9)

Choosing then gives an approximation to the conditional mean estimate (6) and choosing as an appropriate indicator function then gives approximations of as sample histograms.

The MH algorithm is arguably one of the most widely employed MCMC techniques to implement (5). It operates by taking an arbitrary proposal Markov chain and modulating it by randomly accepting realisations from this base

chain. The acceptance probability depends on the target

. In theory (but not in practice) the proposal distribution can be selected quite freely but in the majority of cases a Gaussian proposal is used,

(10)

where denotes the candidate parameter with and denoting a mean and covariance function, respectively.

After generating a candidate parameter via (10), it is accepted with the acceptance probability

(11)

where . We set if the candidate parameter is accepted and if it is rejected. Note that, we are only required to be able to point-wise evaluate to implement MH as presented in Algorithm 1.

Inputs: , and . Output: .

 

1:  Compute .
2:  for  to  do
3:     Sample using (10).
4:     Compute using Kalman or particle methods.
5:     Sample uniformly over .
6:     if  given by (11). then
7:        Accept , i.e. .
8:     else
9:        Reject , i.e. .
10:     end if
11:  end for
Algorithm 1 Metropolis-Hastings (MH)

An essential point is that the speed of the convergence (9) depends on how correlated the realisations are. The more uncorrelated the faster the convergence and hence the better the approximation (8) for a given finite number

of realisations. This is well understood in the MCMC literature, where the variance in the stochastic approximation (

8) is established as being proportional to the integrated autocorrelation

(12)

which also is known as the inefficiency factor (IF). In turn, the correlation of the realisations is critically dependent on the choice of the proposal .

From the work of Girolami and Calderhead (2011) it is known that the mixing (i.e. autocorrelation) can be greatly improved by the inclusion of gradient and curvature information regarding the posterior into the proposal (10). This is especially important when the posterior is non-isotropic, i.e., some parameters influence the value of the posterior to a larger degree than others. The influence can also vary over the parameter space, which makes local information about the curvature important to increase mixing.

3 Hessian estimation

A significant further problem is that the curvature information is difficult to obtain in an efficient manner for general SSMs when particle methods are employed within MH as discussed in the introduction. The gradient of the log-posterior is quite simple to estimate efficiently using Fisher’s identity (Cappé et al., 2005),

(13)

where either a Kalman or particle smoother can be employed to compute or approximate the expectation.

The Louis’ identity (Cappé et al., 2005) can be used in an analogue manner to estimate the negative Hessian of the log-posterior . Unfortunately, the resulting estimator often suffers from large noise sensitivity which results in frequent loss of positive definiteness. This is problematic as the inverse of the Hessian is often included in the proposal (10) as its covariance.

Another approach is to compute an estimate of the Hessian by using pilot runs. This results in a so-called pre-conditioning matrix, which can be used to scale the proposal. However, results from numerical studies in e.g., Girolami and Calderhead (2011), Nemeth et al. (2016) and Dahlin et al. (2015a) indicate that this parameter-independent approach is sub-optimal in terms of mixing.

3.1 Damped limited-memory BFGS

To address these problems, we propose to leverage knowledge from the optimisation literature, where the curvature information often is estimated from gradient information. This is the approach used in the highly successful quasi-Newton algorithms that allow for optimising non-linear functions. Again, this is of interest as gradient estimates often are accurate and relatively computationally cheap to obtain using particle methods in comparison to Hessian estimates of the posterior.

In classical mathematical optimisation, the class of so-called quasi-Newton methods (Nocedal and Wright, 2006) were developed to incorporate curvature information into the search direction calculation in order to accelerate convergence. The details of such methods can be easily found in standard references such as Nocedal and Wright (2006), but in essence, these methods utilise gradient and iterate information in order to form estimates of the Hessian, or its inverse.

One of the most celebrated quasi-Newton methods is the BFGS approach, which employs a rank-2 update to the current Hessian estimate to form a better estimate via the recursion

(14)

It can be observed that the Hessian estimate will remain positive-definite if for all iterations. In an optimisation setting this condition on can be guaranteed by a a line-search algorithm that satisfies Wolfe conditions, see (Nocedal and Wright, 2006, Chapter 8). Unfortunately, in the MH setting, this cannot be enforced since it would result in a Markov chain that converges to a single point.

To ameliorate this problem, here we employ the so-called damped BFGS method where is replaced by via

In addition to the damping term, we further employ a limited memory (Nocedal and Wright, 2006, Chapter 9) implementation of the damped-BFGS approach so that the computational load remains modest.

4 Quasi-Newton-based proposals

To construct a good MH proposal, the gradient information together with the Hessian estimate from BFGS will be used in the mean and covariance function entering (10). A typical choice resulting from a second-order Taylor expansion (Dahlin et al., 2015a) of the log-posterior is

(15)

where denotes a step size specified by the user. Hence, we can see the proposal as a local Gaussian approximation of the posterior, which should allow for efficient sampling from it. Another way of motivating (15) is to see it as a random walk on a Riemannian manifold, see Girolami and Calderhead (2011) for details.

The inclusion of the BFGS algorithm to estimate the Hessian requires us to make some changes to MH. The main problem is that information from iterations back in the algorithm is used to construct the proposal distribution. In the standard version of MH, only information from the last iteration is allowed to be used in the proposal due to the Markov property.

To solve this problem, we are required to extend the Markov chain from a first-order chain to an -order chain. This allows MH to retain its validity as discussed by Zhang and Sutton (2011) and Dahlin et al. (2015b). The major algorithmic change to Algorithm 1 is that the gradient and Hessian is computed in Step 4 using a smoother and Algorithm 2. Moreover, the proposal step in MH is replaced by sampling from

(16)

using the procedure in Algorithm 2 with . Finally, we change Step 9 in Algorithm 1 to when the candidate parameter is rejected due to that the proposal now is centered aroung .

Inputs: and . Output: .

 

1:  Extract the unique elements from and sort them in ascending order (with respect to the log-target) to obtain .
2:  if   then
3:     Initialise the Hessian estimate .
4:     for  to  do
5:        Calculate and based on the th pair in .
6:        Carry out the update (14) to obtain .
7:     end for
8:     Set .
9:  else
10:     Set .
11:  end if
12:  Sample from (16) to obtain .
Algorithm 2 Quasi-Newton proposal

5 Numerical illustrations

We provide three numerical illustrations to gain understanding about the proposed algorithm and compare it to other alternatives in the literature. The implementation details are summarised in Appendix A and the source code can be downloaded as described in Section 6.

5.1 LGSS model with synthetic data

We begin by considering a linear Gaussian state-space (LGSS) model as it is possible to solve the state inference problem exactly using the Kalman smoother. This enables us to compute the log-posterior and its gradients using exact recursions, which will give an indication about the optimal performance of various MH proposals. The model is given by

(17a)
(17b)

with and , and . A synthetic data set consisting of a realisation with observations is simulated from the model using the parameters .

We make use of Monte Carlo simulations using the same data to compute the IF (12), and estimate the computational time for different proposals. Table 1 summarises the median results from these simulations, which include the acceptance rate, the fraction of Hessian estimates that are corrected, the maximum of IF and the time required for each iteration and the time required to obtain one effective sample from the posterior. For the IF, we also provide the IQR (the distance between the and quantiles).

Time
Alg. Reg. Acc. Cor. max IF Iter. Samp.

Kalman

pMH0 - 0.12 - 0.62 55
pMH1 - 0.37 - 1.14 130
dBFGS - 0.76 - 2.16
iBFGS flip 0.52 0.95 1.83 77
iBFGS reg 0.45 0.95 1.77 61
iBFGS hyb 0.60 0.95 1.84 76
eBFGS hyb 0.62 1.00 1.89 64

Particles

pMH0 - 0.04 - 0.3 211
pMH1 - 0.25 - 0.3 103
dBFGS - 0.32 - 0.3
iBFGS hyb 0.33 0.93 0.3 27
Table 1: Performance statistics as the median over Monte Carlo runs for different proposals in MH.

The BFGS-type proposals are benchmarked against pre-conditioned versions of MH0 and MH1 denoted pMH0 and pMH1, respectively. In pMH1, we set , where denotes an estimate of the posterior covariance computed using pilot runs. In pMH0, we use the same approach as for pMH1 but also set .

Figure 1: The posterior estimates (left), trace plot (center) and ACF estimate (right) for using pMH0 (top), pMH1 (middle) and dBFGS (bottom). The dashed lines in the left and center plots indicate the estimated posterior mean. The dashed lines in the right plot indicate the confidence intervals. The grey lines if the left plots indicate the prior distribution.

Three different BFGS proposals are used: (d)amped, (i)gnoring the curvature condition and (e)nforcing the curvature condition. For the latter two, the Hessian estimates are often negative definite and therefore require some correction. We apply the three different methods outlined in Appendix B

: (flip)ping the negative eigenvalues, (reg)ularising the estimate and the (hyb)rid method.

The time per effective sample (in milliseconds) is presented in the right-most column in the table. We note that the proposed method based on damped BFGS to approximate the curvature information locally performs the best. It requires milliseconds to produce one sample from the posterior. This is smaller than for pMH0, which would be the standard approach in this setting. Furthermore, we note that the other BFGS-type proposals require a large amount of corrections of the Hessian, which is not desirable this might introduce numerical instability.

Figure 1 presents the posterior estimate, Markov chain trace with its corresponding ACF for a particular case in the simulation study. We note that the mixing is much better for the BFGS-based proposal (lower) compared with pMH0 (upper) and pMH1 (middle). Furthermore, the ACF for the BFGS-based proposal exhibit a quite different behaviour compared with the other two proposals due to the step dependence in the Markov chain. Comparing the posterior estimates, we conclude that the proposed method generates good estimates centered around the correct parameter and with reasonable variance.

5.2 LGSS model revisited

We repeat the same experiment when the Kalman filter and smoother is replaced by a particle filter and fixed-lag particle smoother as described in

Dahlin et al. (2015a). Again, Table 1 summarises the results with all timings now expressed in seconds. The pMH0 and pMH1 perform worse in this case due to the noise in the estimates of the log-posterior and its gradients.

However, the proposed method performs well and generates one effective sample from the posterior every seconds. This is a substantial decrease compared with the other methods and in particular an acceleration by at least a factor of two compared with the quasi-Newton approach proposed by Zhang and Sutton (2011) and Dahlin et al. (2015b). Hence, the proposed method outperforms both the standard approaches to designing a proposal and with current state-of-the-art in quasi-Newton proposals.

Figure 2: Top: the log-returns (green) for Bitcoin and their estimated confidence intervals (orange) using the model and the estimate of the log-volatility. Bottom, the posterior estimates for (pruple), (magents), (green) and (yellow) obtained by dBFGS. The dotted and gray lines indicate the estimated posterior mean and the prior distributions, respectively.

5.3 SV model with Bitcoin price data

To demonstrate a practical application of the proposed method, we consider the problem of estimating the volatility of Bitcoin prices between November 7, 2015 and November 7, 2017. The log-returns (the change in percent of the Bitcoin price between two days) is presented in the upper part of Figure 2 as green dots. Note that the log-returns have zero mean but that the variance is changing over time. We aim to capture this change in volatility (as this variance is known) by the model

which is a so-called stochastic volatility (SV) model with leverage. Here, the unknown latent parameters are the same as in the LGSS model together with the correlation , i.e., . The aim is to estimate the log-volatility given the data, which can be done by marginalising over the posterior estimate of the parameters, see Andrieu et al. (2010) or Dahlin and Schön (2017).

The resulting estimates of the log-volaility (the latent state) is presented in the middle plot of Figure 2. Note that it varies over the time period and is large when the Bitcoin prices are volatile and exhibit large day-to-day changes. Furthermore, the mean of the log-volatility process is quite large, which translates into that the log-volatility typically is large (compared with e.g., prices of stocks). The correlation is probably quite close to zero. This is quite different from which is typical for stocks, which implies the large drops in stock prices raises the volatility (as investors sell their assets).

This information is very useful in many financial applications such as pricing futures on Bitcoins as well as computing various risk measures required to be presented by banks and financial institutions to regulatory agencies.

6 Conclusions

The numerical illustrations indicate the the proposed method can outperform many existing methods used to create good proposals for MH. Furthermore, we would like to again underline that the proposed method requires basically no pilot runs, which are required for all pre-conditioned methods. Moreover, the damped BFGS approach always provides a positive definite estimate of the Hessian, so no potentially numerically unstable Hessian correction is required.

Furthermore, the gradient information is crucial for estimation in large dimension parameter spaces, which are common in SSMs and in transfer function models. All these benefits could potentially allow for a wide adoption of MH for identifying dynamical systems.

There are plenty of interesting avenues for future work within the scope of this paper. SR1 updates (Nocedal and Wright, 2006, Ch. 6.2) are an alternative to BFGS which are known to provide more accurate estimates of the Hessian in many cases. Furthermore, trust region approaches from optimisation could potentially be useful in MH to protect with problems with numerical stability. Finally, more extensive numerical evaluations are required for models with larger parameter spaces. In this case, alternatives to or better algorithms for particle smoothing are required to obtain reasonable gradient estimates.

The source code and data used in this paper are available from GitHub https://github.com/compops/qnmh-sysid2018/ and via Docker (see README.md).

References

  • Andrieu and Thoms [2008] C. Andrieu and J. Thoms. A tutorial on adaptive MCMC. Statistics and Computing, 18(4):343–373, 2008.
  • Andrieu et al. [2010] C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010.
  • Cappé et al. [2005] O. Cappé, E. Moulines, and T. Rydén.

    Inference in hidden Markov models

    .
    Springer Verlag, 2005.
  • Dahlin and Schön [2017] J. Dahlin and T. B Schön. Getting started with particle Metropolis-Hastings for inference in nonlinear dynamical models. Journal of Statistical Software, 2017. in press.
  • Dahlin et al. [2015a] J. Dahlin, F. Lindsten, and T. B. Schön. Particle Metropolis-Hastings using gradient and Hessian information. Statistics and Computing, 25(1):81–92, 2015a.
  • Dahlin et al. [2015b] J. Dahlin, F. Lindsten, and T. B. Schön. Quasi-Newton particle Metropolis-Hastings. In Proceedings of the 17th IFAC Symposium on System Identification (SYSID), pages 981–986, Beijing, China, October 2015b.
  • Doucet and Johansen [2011] A. Doucet and A. Johansen. A tutorial on particle filtering and smoothing: Fifteen years later. In D. Crisan and B. Rozovsky, editors,

    The Oxford Handbook of Nonlinear Filtering

    . Oxford University Press, 2011.
  • Doucet et al. [2015] A. Doucet, M. K. Pitt, G. Deligiannidis, and R. Kohn. Efficient implementation of Markov chain Monte Carlo when using an unbiased likelihood estimator. Biometrika, 102(2):295–313, 2015.
  • Durbin and Koopman [2012] J. Durbin and S. J. Koopman. Time series analysis by state space methods. Oxford University Press, 2 edition, 2012.
  • Girolami and Calderhead [2011] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):1–37, 2011.
  • Ljung [1999] L. Ljung. System identification: theory for the user. Prentice Hall, 1999.
  • Nemeth et al. [2016] C. Nemeth, C. Sherlock, and P. Fearnhead. Particle Metropolis-Adjusted Langevin Algorithms. Biometrika, 103(3):701–717, 2016.
  • Nocedal and Wright [2006] J. Nocedal and S. Wright. Numerical optimization. Springer Verlag, 2 edition, 2006.
  • Peterka [1981] V. Peterka. Bayesian system identification. Automatica, 17(1):41–53, 1981.
  • Poyiadjis et al. [2011] G. Poyiadjis, A. Doucet, and S. S. Singh. Particle approximations of the score and observed information matrix in state space models with application to parameter estimation. Biometrika, 98(1):65–80, 2011.
  • Robert [2007] C. P. Robert. The Bayesian choice. Springer Verlag, 2007.
  • Robert and Casella [2004] C. P. Robert and G. Casella. Monte Carlo statistical methods. Springer Verlag, 2 edition, 2004.
  • Roberts and Rosenthal [1998] G. O. Roberts and J. S. Rosenthal. Optimal scaling of discrete approximations to Langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255–268, 1998.
  • Schön et al. [2011] T. B. Schön, A. Wills, and B. Ninness. System identification of nonlinear state-space models. Automatica, 47(1):39–49, 2011.
  • Zhang and Sutton [2011] Y. Zhang and C. A. Sutton. Quasi-Newton methods for Markov chain Monte Carlo. In Proceedings of the 2011 Conference on Neural Information Processing Systems (NIPS), Granada, Spain, December 2011.

Appendix A Implementation details

In Sections 5.1 and 5.2, we use a standard Kalman filter with the RTS smoother to compute the log-posterior and its gradients. Furthermore, a bootstrap particle filter is employed with particles and a fixed-lag smoother with lag , see Dahlin et al. [2015a] for an algorithmic description. The number of particles is selected using the results in Doucet et al. [2015]. We initialise all MH algorithms in the true parameters for simplicity and run them for iterations and discard the first as burn-in.

The pre-conditioning matrices for MH0 and MH1 are computed using a number of pilot runs. The step-lengths are selected using existing rule-of-thumbs [Roberts and Rosenthal, 1998] as and when using Kalman methods and and when using particle methods [Nemeth et al., 2016]. For the qMH approaches, we use memory length and make use of a random walk proposal for the first iterations with step lengths for all three parameters. The step size is used for all qMH algorithms after the initial iterations.

A reparametrization of the LGSS model is done to make all the parameters in the Markov chain unconstrained (able to assume any real value) given by

where are the new states of the Markov chain. This change of variables introduces a Jacobian term into the acceptance probability, see Dahlin and Schön [2017, Section 6.3.2]. Finally, we use the following prior densities

where

denotes a truncated Gaussian distribution on

and

denotes the Gamma distribution with mean

.

In Section 5.3, we make use of the same settings as for the LGSS model but increase to . Moreover, we change the priors ( is kept as before) slightly to

which better reflect the parameter values usually found in real-world data. The correlation is reparametrized in the same manner as . The Bitcoin data is computed as , where denotes the daily exchange rates versus the US Dollar obtained from https://www.quandl.com/BITSTAMP/USD.

Appendix B Hessian corrections

The first approach regularise the Hessian by

with denoting the smallest (negative) eigenvalue. This shifts all eigenvalues to be positive. The second method flips the eigenvalues,

where and

denotes the matrix of eigenvectors and the diagonal matrix of the absolute value of the eigenvalues of

, respectively. The third approach is the hybrid method from Dahlin et al. [2015a]. In which the estimate is replaced by a global approximation of the posterior covariance

where denotes the sample estimate of the posterior covariance computed using the latter half of the burn-in phase.