On the Computational Complexity of Geometric Langevin Monte Carlo

Manifold Markov chain Monte Carlo algorithms have been introduced to sample more effectively from challenging target densities exhibiting multiple modes or strong correlations. Such algorithms exploit the local geometry of the parameter space, thus enabling chains to achieve a faster convergence rate when measured in number of steps. However, often acquiring local geometric information increases computational complexity per step to the extent that sampling from high-dimensional targets becomes inefficient in terms of total computational time. This paper analyzes the computational complexity of manifold Langevin Monte Carlo and proposes a manifold adaptive Monte Carlo sampler aimed at balancing the benefits of exploiting local geometry with computational requirements to achieve a high effective sample size for a given computational cost. The suggested strategy randomly switches between a local geometric and an adaptive proposal kernel via a schedule to regulate the frequency of manifold-based updates. An exponentially decaying schedule is put forward that enables more frequent updates of geometric information in early transient phases of the chain, while saving computational time in late stationary phases. The average complexity can be manually set depending on the need for geometric exploitation posed by the underlying model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

04/14/2020

An introduction to computational complexity in Markov Chain Monte Carlo methods

The aim of this work is to give an introduction to the theoretical backg...
10/11/2015

Kernel Sequential Monte Carlo

We propose kernel sequential Monte Carlo (KSMC), a framework for samplin...
07/30/2018

The Efficiency of Geometric Samplers for Exoplanet Transit Timing Variation Models

Transit timing variations (TTVs) are a valuable tool to determine the ma...
02/01/2022

Lagrangian Manifold Monte Carlo on Monge Patches

The efficiency of Markov Chain Monte Carlo (MCMC) depends on how the und...
10/22/2020

Random Coordinate Underdamped Langevin Monte Carlo

The Underdamped Langevin Monte Carlo (ULMC) is a popular Markov chain Mo...
07/15/2020

Hardware Acceleration of Monte-Carlo Sampling for Energy Efficient Robust Robot Manipulation

Algorithms based on Monte-Carlo sampling have been widely adapted in rob...
11/04/2021

Ex^2MCMC: Sampling through Exploration Exploitation

We develop an Explore-Exploit Markov chain Monte Carlo algorithm (Ex^2MC...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Geometric Markov chain Monte Carlo (MCMC) dates back to the work of dua_ken_pen__hyb

, which introduced Hamiltonian Monte Carlo (HMC) to unite MCMC with molecular dynamics. Statistical applications of HMC began with its use in neural network models by

nea__bay .

In the meanwhile, the Metropolis-adjusted Langevin algorithm (MALA) was proposed by rob_ros__opt to employ Langevin dynamics for MCMC sampling. Both HMC and MALA evaluate the gradient of the target density, so they utilize local geometric flow.

gir_cal__rie introduced new differential geometric MCMC methods. Given a state , gir_cal__rie

defines a distance between two probability densities

and as the quadratic form for an arbitrary metric . Thus, the position-specific metric

induces a Riemann manifold in the space of parameterized probability density functions

. gir_cal__rie uses to define proposal kernels that explore the state space effectively by introducing Riemann manifold Langevin and Hamiltonian Monte Carlo methods.

Computing the geometric entities of differential geometric MCMC methods creates a performance bottleneck that restricts the applicability of the involved methods. For example, manifold MCMC methods require to calculate the metric tensor

of choice. Typically, is set to be the observed Fisher information matrix, which equals the negative Hessian of the log-target density at state . Consequently, the complexity of manifold MCMC algorithms is dominated by Hessian-related computations, such as the gradient of or the inverse of the Hessian.

gir_cal__rie constructed the simplified manifold Metropolis-adjusted Langevin algorithm (SMMALA) that is of the same order of complexity per Monte Carlo iteration but faster than MMALA and RMHMC for target distributions of low complexity. The faster speed of SMMALA over MMALA and RMHMC is explained by lower order terms and constant factors appearing in big-oh notation, which are ordinarily omitted but affect runtime in the case of less costly targets.

SMMALA has been used in conjunction with population MCMC for the Bayesian analysis of mechanistic models based on systems of non-linear differential equations, see cal_gir__sta ; sch_pap__ews . Despite the capacity of SMMALA to exploit local geometric information so as to cope with non-linear correlations and modest increase in the complexity of the target density, in case of more expensive targets its computational complexity can render performance inferior to other algorithms such as the Metropolis-adjusted Langevin algorithm (MALA) or adaptive MCMC, see cal_eps_sil__bay .

Various attempts have been made to ameliorate the computational implications of geometric MCMC methods. Along these lines, lan_tha_chr__emu used Gaussian processes to emulate the Hessian matrix and Christoffel symbols associated with the observed Fisher information . sim_bad_cem__sto developed a stochastic quasi-Newton Langevin Monte Carlo algorithm which takes into account the local geometry, while approximating the inverse Hessian by using a limited history of samples and their gradients. Alternatively, per__prox used convex analysis and proximal techniques instead of differential calculus in order to construct a Langevin Monte Carlo method for high-dimensional target distributions that are log-concave and possibly not continuously differentiable.

The present paper serves two purposes. Initially, it studies the computational complexity of geometric Langevin Monte Carlo algorithms. Subsequently, it develops the so-called geometric adaptive Monte Carlo (GAMC) sampling scheme based on a random environment of geometric and adaptive proposal kernels. For expensive targets, a carefully selected random environment of a local geometric Langevin kernel and of adaptive Metropolis kernels can give rise to a sampler with higher sampling efficacy than a sampler based on the local geometric Langevin kernel alone.

2 Background

The role of this section is to provide a brief overview of Langevin Monte Carlo and of adaptive Metropolis, which will be combined in later sections to construct a Monte Carlo sampling method in random environment.

To establish notation, consider a Polish space equipped with the Borel -algebra . Let be a sequence in and

a subsequence of , where , and denote non-negative integers. Moreover, let be a kernel from to .

It is assumed that for every subsequence of , measure is absolutely continuous with respect to some measure on . Such an assumption ensures existence of the Radon-Nikodym derivative so that

(1)

for any .

The Radon-Nikodym derivative is used in the paper as a proposal density in geometric or adaptive Monte Carlo methods to sample from a possibly unnormalized target density . In the context of Monte Carlo sampling, and are called chain and proposal kernel, respectively.

2.1 Basics of Langevin Monte Carlo

Langevin Monte Carlo (LMC) is a case of Metropolis-Hastings. The normal proposal density at the -th iteration of LMC is given by

(2)

where and denote the state at the -th iteration and the proposed state, respectively. is a positive definite matrix of size and refers to a tuning parameter known as the integration stepsize. The location is a function of , and , whereas the covariance of the proposal kernel depends on and . Both and are defined so that the proposed states admit a Langevin diffusion approximated by a first-order Euler discretization. The Metropolis-Hastings acceptance probability is set to its standard form

(3)

if , and otherwise.

The proposal kernel corresponding to the normal density of equation (2) is defined by setting , , and to be the Lebesgue measure in equation (1).

The integration stepsize , also known as drift step, is associated with the first order Euler discretization and significantly affects the rate of state space exploration. If is selected to be relatively large, many of the proposed candidates will be far from the current state, and are likely to have a low probability of acceptance, so the chain with proposal kernel will have low acceptance rate. Reducing will increase the acceptance rate, but the chain will take longer to traverse the state space.

In a Bayesian setting, the target is a possibly unnormalized posterior density , where denotes the available data. Replacing by in (3) makes Langevin Monte Carlo applicable in Bayesian problems.

To fully specify a Langevin Monte Carlo algorithm, the location and covariance of normal proposal (2) need to be defined. In what follows, variations of geometric Langevin Monte Carlo methods are distinguished by their respective proposal location and covariance.

2.2 Metropolis-adjusted Langevin algorithm

If is not a function of the state , the Metropolis-adjusted Langevin algorithm (MALA) arises as

(4)
(5)

is known as the precondition matrix, see rob_stra__lan

. It is typically set to be the identity matrix

, in which case MALA is defined in its conventional form.

MALA uses the gradient flow to make proposals effectively. According to the theoretical analysis of rob_ros__opt , the optimal scaling has been found to be the value of which yields a limiting acceptance rate of in high-dimensional parametric spaces (as ).

2.3 Manifold Langevin Monte Carlo

It is possible to incorporate further geometric structure in the form of a position-dependent metric , see gir_cal__rie ; xif_she_liv__lan . The Langevin diffusion is defined on a Riemann manifold endowed by the metric . At the -th iteration of the associated manifold Metropolis-adjusted Langevin algorithm (MMALA), candidate states are drawn from a normal proposal density with location and covariance given by

(6)
(7)

where the -th coordinate of is

(8)

, and in (8) denote the respective -th coordinate of , -th element of and -th element of .

As seen from (8), the term increases the computational complexity of operations on the proposal density for target densities with high number of dimensions or with high correlation between parameters. To reduce the computational cost, can be dropped from (6), simplifying the proposal location to

(9)

The method with location and covariance specified by (9) and (7) is known as simplified Metropolis-adjusted Langevin algorithm (SMMALA).

The optimal stepsize for MMALA and SMMALA is empirically suggested by gir_cal__rie to be set so as to obtain an acceptance rate of around ; this choice has not been analyzed yet from a theoretical standpoint analogously to the choice of scaling for MALA by rob_ros__opt .

2.4 Synopis of Langevin Monte Carlo algorithms

The proposal mechanisms of MALA, SMMALA and MMALA define valid MCMC methods that converge to the target distribution. In fact, if the metric is constant, then vanishes, so each of SMMALA and MMALA coincides with pre-conditioned MALA.

Each of the three Langevin Monte Carlo samplers incorporate different amount of local geometry in the proposal mechanism. MALA makes use only of the gradient of the log-target. SMMALA relies additionally on the position-specific metric tensor . MMALA takes into account the curvature of the manifold induced by , which implies calculating the metric derivatives in (8). Depending on the characteristics of the manifold and its curvature, the proposals of the three Langevin Monte Carlo samplers may exhibit varying efficiency in converging to the target distribution, thus leading to differences in effective sample size.

Increasing inclusion of local geometry in the proposal mechanism escalates computational complexity. More specifically, MALA, SMMALA and MMALA require computing first, second and third order derivatives of the target. It is thus clear that there is a trade-off between geometric exploitation of the target from within the proposal density and associated complexity of the proposal density, which translates to a trade-off between effective sample size and runtime for MALA, SMMALA and MMALA.

2.5 Basics of adaptive Metropolis

Sampling methods that propose samples by using past values of the chain, thus breaking the Markov property, are referred to as adaptive Monte Carlo. The first adaptive Metropolis (AM) algorithm, as introduced by haa_sak_tam__ana , used a proposal kernel based on the empirical covariance matrix of the whole chain at each iteration.

In its first appearance in haa_sak_tam__ana , the AM algorithm was defined for target densities of bounded support to ensure convergence. rob_ros__exa extended AM to work with targets of unbounded support by suggesting a mixture proposal density also based on the empirical covariance matrix of the whole chain at each iteration.

Several variations of the AM algorithm have appeared. For example, haa_lai_mir__dra combined adaptive Metropolis and delayed rejection methodology to construct the delayed rejection adaptive Metropolis (DRAM) sampler, which outperforms its constituent methods in certain situations. More recently, vih__rob introduced the so-called robust adaptive Metropolis (RAM) algorithm, which scales the empirical covariance matrix of the chain to yield a desired mean acceptance rate, typically in multidimensional settings. A thorough overview of adaptive Metropolis methods can be found in gri_wal__ona .

The ergodicity properties of adaptive Monte Carlo were studied in and_mou__ont ; rob_ros__cou . In particular, rob_ros__cou defined the diminishing adaptation and bounded convergence conditions. Their joint satisfaction ensures asymptotic convergence to the target distribution. Thus, diminishing adaptation and bounded convergence provide a useful machinery for constructing adaptive Monte Carlo algorithms.

2.6 The first adaptive Metropolis algorithm

Consider a chain up to iteration generated by AM.haa_sak_tam__ana defined the proposal density of AM for the next candidate state to be normal with mean equal to the current point and covariance based on the empirical covariance matrix

(10)

of the whole history . The constant is set to a small positive value to constrain the empirical covariance within for some constants , thereby ensuring convergence for target densities of bounded support. The tuning parameter , which may only depend on dimension , allows to scale the covariance of the proposal density.

It follows from (10) that the empirical covariance at the -th AM iteration calculates recursively as

(11)

The sample mean in (11) is also calculable recursively according to

(12)

The recursive equations (11) and (12) make the empirical covariance and sample mean of the chain computationally tractable for an arbitrarily large chain length.

2.7 Adaptive Metropolis with mixture proposal

rob_ros__exa initiated AM with proposal density at iteration given by

(13)

The acceptance probability for AM in rob_ros__exa is

(14)

if , and otherwise.

The proposal kernel corresponding to the mixture density of equation (13) is defined by setting , , , and to be the Lebesgue measure in equation (1).

The first component of mixture is updated adaptively using the whole chain history in the calculation of the empirical covariance matrix and the second component is introduced to stabilize the algorithm. A small positive constant in (13) ensures convergence for a large family of target densities of unbounded support, including those that are log-concave outside some arbitrary bounded region. Two tuning parameters appear in (13), namely and , which may only depend on dimension . Each of these parameters allow to scale the covariance of the respective mixture component.

3 Complexity bounds

This section provides upper bounds for the computational complexity of geometric Langevin Monte Carlo. It concludes with a short overview of the computational cost of adaptive Metropolis algorithms.

3.1 Complexity bounds for differentiation

Calculations associated with the proposal and target densities determine the computational cost of Langevin Monte Carlo methods. In brief, the main computational requirements include sampling from and evaluating the proposal, as well as evaluating the target and its derivatives.

According to (5), the proposal covariance of MALA is constant, therefore it is derivative-free. On the other hand, the proposal covariance of SMMALA and MMALA introduced in (7) require to compute the negative Hessian of the log-target as the position-specific metric . As seen from (4), (9) and (6), the proposal location of MALA, SMMALA and MMALA entail the gradient, Hessian and Hessian derivatives of the log-target. Hence, fully specifying the proposal density of MALA, SMMALA and MMALA requires up to first, second and third order derivatives of the log-target.

Assume an -dimensional log-target with complexity , with the notation indicating dependence of on . Considering the highest order of log-target differentiation associated with each sampler, the incurring costs for target-related evalutions in MALA, SMMALA and MMALA grow as , and , respectively.

3.2 Complexity bounds for linear algebra

Having computed the log-target and its derivatives, the Langevin Monte Carlo normal proposal (2) is available to sample from and evaluate. The major computational cost of evaluating and sampling from the normal proposal (2) is related to linear algebra calculations, namely to the inversion and Cholesky decomposition of the proposal covariance .

A candidate state can be sampled from the normal proposal (2) of a Langevin Monte Carlo method with mean and covariance in the Cholesky approach by letting

where denotes the Cholesky factorization

and , see chi_gre__und . So, sampling from the proposal has a complexity of , since it requires the inversion of metric and the Cholesky decomposition of , each of which are operations.

The acceptance probability (3) of SMMALA and MMALA requires to evaluate the normal proposal (2) at . As it has become apparent, a proposal density evaluation has a complexity of due to the inversion of needed by the proposal covariance .

To be precise, the normal proposal density (2) must be evaluated twice, once at and once at , due to its appearance both in the numerator and denominator of the acceptance ratio (3). This implies twice the number of log-target differentiations and matrix inversions to compute , and their inverses. Nevertheless, the scaling factor of two can be omitted in big- bounds since the numerics associated with the state are known from iteration , with the exception of .

In summary, SMMALA and MMALA require the Cholesky factorization to sample from the normal proposal and the matrix inverse to evalute the normal proposal at . Consequently, the cost of linear algebra computations associated with the normal proposal density (2) is of order for each of these two samplers.

The Coppersmith-Winograd algorithm has been optimized to perform matrix multiplication and therefore matrix inversion in time, see dav_sto__imp , wil__brea and leg__pow . Hence, optimized implementations of SMMALA and MMALA can evaluate and sample from their proposal densities in time given the log-target and its derivatives.

MALA, as opposed to SMMALA and MMALA, relies on a constant preconditioning matrix , therefore and are evaluated once and cached at the beginning of the simulation avoiding the penalty. Since , and are cached upon initializing MALA, the complexity of sampling and evaluating the normal proposal density of MALA is capped by the quadratic form at . If is set to be the identity matrix, then the quadratic form simplifies to the inner product and the complexity of linear algebra calculations associated with the MALA proposal density further reduces to order .

3.3 Differentiation versus linear algebra costs

Adding up the differentiation and linear algebra costs yields the order of complexity for Langevin Monte Carlo. Hence, it follows that MALA, SMMALA and MMALA run in , and time, respectively.

For simple models, the order of complexity of the log-target is not greater than that of the proposal, so it is assumed that the log-target scales as . Thus, for simple models, the respective complexity bounds for MALA, SMMALA and MMALA reduce to , and after dropping scaling factors and lower-order terms. The complexity of the typical MALA with identity preconditioning matrix further reduces to for a simple model.

On the other hand, computationally expensive models are characterized by . For such models, the cost of computations implicating the log-target is much higher than the cost of proposal-related calculations. In other words, if the log-target is of high complexity, then derivative calculations supersede linear algebra calculations, and this is why the computational cost of manifold MCMC algorithms tends to be reported as a function of the order of derivatives appearing in the algorithm. For instance, the complexity of SMMALA, which scales as , can be simply written as for a computationally intensive model.

An example of a computationally expensive model is a system of non-linear ordinary differential equations (ODEs), where each log-target calculation requires solving the ODE system numerically. It is then expected that the log-target and its derivative evaluations will dominate the cost of Langevin Monte Carlo simulations.

3.4 Complexity bounds for adaptive Metropolis

From this point forward, the term adaptive Metropolis (AM) will refer to the AM algorithm of rob_ros__exa with mixture proposal (13), as interest is in targets of unbounded support. AM does not evaluate any target-related derivatives. In lieu of differentiation costs, the target-specific complexity of AM is of order .

The components of mixture density (13) are centred at the current state, and the empirical covariance of the adaptive component is computed recursively. Thus, fully specifying the AM proposal density is computationally trivial given the chain history.

Sampling from and evaluating the fully specified normal mixture (13) of AM incurs the typical linear algebra computational costs encountered in Langevin Monte Carlo, namely a Cholesky decomposition and an inversion of the empirical covariance matrix . So, linear algebra manipulations of the AM proposal amount to a complexity of order .

The recursive formula (11) allows to replace the Cholesky factorization of by two rank one updates and one rank one downdate, thus reducing the Cholesky runtime bound of AM from to . gill_gol_wal__met and see__low elaborate on low rank updates for Cholesky decompositions.

In total, the computational cost of AM sums up to . It reduces to if optimized algorithms are chosen to invert and if low rank updates are used for factorizing .

For simple targets scaling as , AM takes up to operations, so it is more costly than MALA and cheaper than SMMALA and MMALA. For expensive targets with complexity , AM runs in time, so it is cheaper than MALA, SMMALA and MMALA.

3.5 Summary of complexity bounds

Table 1 summarizes the complexity bounds per step of geometric Langevin Monte Carlo and of adaptive Metropolis for simple and for expensive targets after dropping lower-order terms. Simple and expensive targets are assumed to have respective complexities and .

Method Target
Simple Expensive
MALA
SMMALA
MMALA
AM
Table 1: Complexity bounds per step of MALA, SMMALA, MMALA and AM samplers for simple and expensive targets after dropping lower-order terms. Simple and expensive targets have complexities and , respectively. The complexity of AM for simple targets assumes usage of efficient linear algebra algorithms.

For simple targets, MALA has lower order of complexity than AM, which in turn has lower order of complexity than SMMALA and MMALA. For expensive targets, MALA, SMMALA, MMALA and AM share the same order of complexity , with respective scaling factors , , and . These scaling factors are negligible for very expensive targets, but they affect the total computational cost for a range of targets of modest to high compexity.

4 Geometric adaptive Monte Carlo

Manifold Langevin Monte Carlo pays a higher computational price than adaptive Metropolis to achieve increased effective sample size via geometric exploitation of the target. To get the best of both worlds, the goal is to construct a Monte Carlo sampler that attains fast mixing per step but with less cost per step. Along these lines, the present paper introduces a hybrid sampling method that switches between expensive geometric Langevin Monte Carlo and cheap adaptive Metropolis updates.

4.1 Sampling in random environment

The sampler will be defined as a discrete-time stochastic process in IID random environment. The environment is a sequence

of independent random variables admitting a Bernoulli distribution with probability

.

Let be the last time before iteration that the geometric kernel was used, defined as the stopping time

The sequence of stopping times induces a sequence of random proposal kernels

switching between adaptive proposal kernels and geometric proposal kernel . GAMC provides a general Monte Carlo sampling scheme, which is instantiated depending on the choice of kernels and .

It is noted that the dimension of the first argument in the definition of random kernel varies between iterations due to the random stopping time .

For every , kal__ran ensures that the Radon-Nikodym derivative of random measure exists almost surely.

Equation (1) is linked to the Radon-Nikodym derivative of random measure by setting , , , and to be the Lebesgue measure.

Using the proposal density , the Metropolis-Hastings acceptance probability at the -th iteration of GAMC is set to

if , and otherwise.

The process can be constructed from kernels by extending the Ionescu Tulcea theorem (see nev__the ) to processes in random environment.

A framework for generating chains via random proposal kernels is discussed in rob_ros__cou . Non-Markovian chains in random environment, such as the process constructed via , have received less attention than adaptive Monte Carlo methods in the literature.

4.2 Algorithmic formulation

Algorithm 1 provides a pseudocode representation of the proposed GAMC sampler. The sequence of probabilities is deterministic. Section 4.4 provides a condition on that ensures convergence of the GAMC sampler.

for  to  do : number of iterations
     Sample
     
     if  then Use adaptive kernel
         
     else if  then Use geometric kernel
         
     end if
     Sample Uniform density
     Sample
     
     if   then
         
     else
         
     end if
end for
Algorithm 1 GAMC

At its -th iteration, the GAMC sampler uses either AM proposal kernel dependent on the past states as determined by the stopping time or LMC proposal kernel dependent only on the current state .

Algorithm 1 demonstrates that the proposal covariance is based on the position-specific metric whenever possible and falls back to the empirical covariance otherwise. Thus, initializes , and the latter is recursively updated via (11) until the next geometric update re-initializes the empirical covariance.

4.3 Convergence properties

This section establishes the convergence properties of GAMC. Recall that is the probability of picking the geometric kernel at the -iteration of GAMC.

Proposition 1

If , then the convergence properties of GAMC are solely determined by the convergence properties of its AM counterpart.

Proof

Due to the Borel-Cantelli lemma, the assumption implies that the GAMC proposal kernel is set to the geometric proposal kernel only a finite number of times almost surely. Hence, if the AM algorithm based on the adaptive proposal kernel

of GAMC is ergodic or satisfies the weak law of large numbers, then so does the corresponding GAMC sampler almost surely.

Corollary 1

If and the adaptive proposal kernel of GAMC is specified via the mixture proposal density (13), then GAMC satisfies the weak law of large numbers.

Proof

Using the notation of section 2, set , equipped with the Borel -algebra . Let be a possibly unnormalized target density and the associated target distribution

where is the Lebesgue measure.

The AM algorithm of rob_ros__exa , as defined by (13) and (14), satisfies the weak law of large numbers. Hence, according to proposition 1, any chain generated by GAMC also satisfies

in probability for any bounded function .

For AM kernels satisfying a set of different conditions (see sak_vih__erg ; bai_rob_ros__ont ; haa_sak_tam__ana ), AM and consequently GAMC are ergodic.

4.4 Choice of schedule for geometric steps

A design decision to make is how to set the sequence of probabilities of choosing geometric over adaptive steps. The choice of affects the convergence properties and the computational complexity of GAMC.

One possibility is to make the frequency of gometric steps more pronounced in early transient phases of the chain and let the computationally cheaper adaptive kernel take over asymptotically in late stationary phases. This possibility is confined by the requirement of convergence, which in turn can be fulfilled by the condition of proposition 1.

An example of a sequence of probabilities that conform to these practical guidelines and convergence requirements is

(15)

where is a positive-valued tuning parameter. Larger values of in (15) yield faster reduction in the probability of using the geometric kernel.

The probabilities of GAMC play an analogous role as temperature in simulated annealing. Thereby, can be thought as a schedule for regulating the choice of proposal kernel. There is a rich literature on cooling schedules for simulated annealing (kir_gel_vec__opt ; haj__coo ; loc__sim ; nou_and__aco ; mar_sie__aco ), some of which can be employed as .

In this paper, GAMC is equipped with the exponential schedule (15). Under schedule (15), GAMC and AM share similar convergence properties and complexity bounds asymptotically. Yet GAMC has faster mixing per step than AM due to exploitation of local geometric information in early phases of the chain. The tuning parameter in (15) regulates the frequency of geometric steps and therefore the ratio of mixing per step and computational cost per step.

4.5 Expected complexity

The concept of complexity carries three distinct meanings in the context of MCMC. Firstly, MCMC samplers need to be tuned so as to achieve a balance between proposing large enough jumps and ensuring that a reasonable proportion of jumps are accepted. By way of illustration, MALA attains its optimal acceptance rate of as by setting its drift step to be in the vicinity of . Because of this, it is said that the algorithmic efficiency of MALA scales as the number of parameters increases.

Secondly, the quality of MCMC methods depends on their rate of mixing per step. Along these lines, the effective sample size (ESS) is used for quantifying the mixing properties of an MCMC method. The ESS of a chain of length

is interpreted as the number of samples in the chain bearing the same amount of variance as the one found in

independent samples.

A third criterion for assessing MCMC algorithms is their computational cost per step. This criterion corresponds to the ordinary concept of algorithmic complexity, as it entails a count of numerical operations performed by an MCMC algorithm. To give an example, the computational complexity of MALA with an identity preconditioning matrix for a simple model is of order , as explained in section 3.3.

Of these three indicators of complexity, ESS and computational runtime are the ones typically used for understanding the applicability of MCMC methods. To get a single-number summary, the ratio of ESS over runtime is usually employed.

The present section states the expected complexity per step of GAMC given the selected length of simulation, while section 5 provides an empirical assessment of GAMC via its ESS and CPU runtime.

Proposition 2

Denote by and the computational complexities per geometric and adaptive Monte Carlo step of GAMC, respectively. The expected complexity per step of GAMC for generating an -length chain is

(16)
Proof

The expected number of geometric steps equals

whence the conclusion follows directly.

Corollary 2

If the exponential schedule (15) is used for regulating the choice of proposal kernel, then the expected complexity per step of GAMC for generating an -length chain expresses as

(17)
Proof

Under the exponential schedule (15), observe that

whence (16) yields (17).

Corollary 3

As the number of iterations gets large (), the expected complexity per step of GAMC under the exponential schedule (15) reduces to the complexity of its AM counterpart.

Proof

Since

the bound (17) diminishes asymptotically to .

As an example, consider the GAMC sampler with AM proposal kernels induced by (13) and SMMALA proposal kernel as induced by (2), (9) and (7). For such a configuration of GAMC, as seen from table 1, expensive targets with complexity are associated with complexities and in (17). So, the expected complexity per step of GAMC for generating an -length chain is

(18)

for expensive targets, which is bounded below by the AM complexity of and above by the SMMALA complexity of . For instance, setting and in (18) yields an expected complexity per step of GAMC equal to . For large number of iterations, the expected complexity per step of GAMC in (18) tends to the lower bound of AM complexity (see corollary 3).

More generally, the convergence properties and computational complexity of GAMC are determined asymptotically by the AM proposal kernel used in GAMC. Despite the shared asymptotic properties of GAMC and AM, the SMMALA steps in early transient phases of GAMC provide an improvement in mixing over AM. For example, setting and in (18) produces an expected of SMMALA steps, which is a potentially sufficient perturbation in early stages of parameter space exploration so as to move to target modes of higher probability mass.

4.6 Analytically intractable geometric steps

In practice, challenges in the implementation of manifold MCMC algorithms might raise additional computational implications. In particular, two notoriously recurring issues relate to the Cholesky decomposition of metric and to the calculation of up to third order derivatives of .

Various factors, such as finite-precision floating point arithmetic, can lead to an indefinite proposal covariance matrix

. This in turn breaks the Cholesky factorization of . Several research avenues have introduced alternative positive definite approximations of indefinite matrices (hig__com01 ; hig__com02 ; hig_stra__and ) and approximate Riemann manifold metric choices (bet__age ; hou__hes ; kle__ada ), which offer proxies for an indefinite covariance matrix .

Non-trivial models can render the analytic derivation of log-target derivatives impossible or impractical. Automatic differentiation (AD), a computationally driven research activity that has evolved since the mid 1950’s, helps compute derivatives in a numerically exact way. Indeed, gri__ona has shown that AD is backward stable in the sense of wil__mod . Thus, small perturbations of the original function due to machine precision still yield accurate derivatives calculated via AD.

There are different methods of automatic differentiation that mainly differ in the way they traverse the chain rule; reverse mode AD is better suited for functions

, in contrast to forward mode AD that is more suitable for functions (gri_wal__eva ). Consequently, reverse mode AD is utilized for computing derivatives of probability densities, and finds use in statistical inference. Reverse mode AD is not worse than that of the respective analytical derivatives of a target density in terms of complexity, but it poses high memory requirements. Hybrid AD procedures combining elements of forward and backward propagation of derivatives can be constructed for achieving a compromise between execution time and memory usage when differentiating functions of the form .

5 Simulation study

In this simulation study, GAMC uses the exponential schedule (15) to switch randomly between the AM kernel specified via mixture (13) and the SMMALA kernel. GAMC is compared empirically against its AM and SMMALA counterparts, as well as against MALA, in terms of mixing and cost per step via three examples. The examples revolve around a multivariate t-distribution with correlated coordinates, and two planetary systems, one with a single planet and one with two planets.

Ten chains are generated by each sampler for each example. iterations are run for the realization of each chain, of which the first are discarded as burn-in, so

samples per chain are retained in subsequent descriptive statistics.

To assess the quality of mixing of a sampler, the ESS of each chain generated by the sampler is computed. The ESS of a coordinate of the vector

of parameters is defined as , where and

denote the estimated ordinary and Monte Carlo variance of the chain associated with the parameter coordinate. The initial monotone sequence estimator of

gey__pra is used for calculating .

To assess the computational cost of a sampler, the CPU runtime of each chain generated by the sampler is recorded. The ESS per parameter coordinate and CPU runtime are reported by taking their respective means across the set of ten simulated chains.

The computational efficiency of a sampler is defined as the ratio of minimum ESS among all parameter coordinates over CPU runtime. Finally, the speed-up of a sampler relatively to MALA is set to be the ratio of MALA efficiency over the efficiency of the sampler.

The hyperparameter values

, in (13) and in (15) are used across all simulations, as the result of empirical tuning. On the other hand, hyperparameter in (13) is set via empirical tuning in the burn-in phase of each chain separately.

Automatic differentiation and the SoftAbs approximation of (bet__age ) are used in all three examples.

The associated numerics and visualizations for the three examples are available in table 2 and appendix A, respectively. Table 2 gathers the ESS, runtime, efficiency and speed-up summaries, as these arise after averaging across the ten simulated chains per sampler. Figures 1 and 2 of appendix A visualize the running mean, autocorrelation and trace of one specimen chain per sampler out of the ten simulated chains.

Student’s t-distribution
Method AR ESS t ESS/t Speed
min mean median max
MALA 0.59 135 159 145 234 9.33 14.52 1.00
AM 0.03 85 118 117 155 17.01 5.03 0.35
SMMALA 0.71 74 87 86 96 143.63 0.52 0.04
GAMC 0.26 1471 1558 1560 1629 31.81 46.23 3.18
One-planet system
Method AR ESS t ESS/t Speed
min mean median max
MALA 0.55 4 76 18 394 57.03 0.07 1.00
AM 0.08 1230 1397 1279 2035 48.84 25.18 378.50
SMMALA 0.71 464 597 646 658 208.46 2.23 33.45
GAMC 0.30 1260 2113 2151 3032 76.80 16.41 246.59
Two-planet system
Method AR ESS t ESS/t Speed
min mean median max
MALA 0.59 5 52 10 377 219.31 0.02 1.00
AM 0.01 18 84 82 248 81.24 0.22 9.05
SMMALA 0.70 53 104 100 161 1606.92 0.03 1.37
GAMC 0.32 210 561 486 1110 328.08 0.64 26.39
Table 2: Comparison of sampling efficacy between MALA, AM, SMMALA and GAMC for the t-distribution, one-planet and two-planet system. AR: acceptance rate; ESS: effective sample size; t: CPU runtime in seconds; ESS/t: smaller ESS across model parameters divided by runtime; Speed: ratio of ESS/t for MALA over ESS/t for each other sampler. All tabulated numbers have been rounded to the second decimal place, apart from effective sample sizes, which have been rounded to the nearest integer. The minimum, mean, median and maximum ESS across the effective sample sizes of the twenty, six and eleven parameters (associated with the respective t-distribution, one-planet and two-planet system) are displayed.

A package, called GAMCSampler, implements GAMC using the Julia programming language. GAMCSampler is based on Klara, a package for MCMC inference written in Julia by one of the three authors. GAMCSampler is open-source software available at

https://github.com/scidom/GAMCSampler.jl

along with the three examples of this paper. The packages ForwardDiff (rev_lub_pap__for ) and ReverseDiff, which are also written in Julia, provide forward and reverse-mode automatic differentiation functionality. Among these two AD packages, ForwardDiff has been put into practice in the simulations due to being more mature and more optimized than ReverseDiff.

5.1 Mutlivariate t-distribution

Monte Carlo samples are drawn from an -dimensional Student- target with degrees of freedom and covariance matrix

(19)

for some constant that determines the level of correlation between parameter coordinates. The elements of the -th diagonal of the covariance matrix equal . The scale matrix of the -distribution scales by a factor of so that the covariance matrix of the -distribution is .

In this example, the setting is not Bayesian, so there is no prior distribution involved (pap_mir_gir__mon ). Instead, MCMC sampling acts as a random number generator to simulate from a -distribution . The simulated chains are randomly initialized away from the zero mode of the -distribution, and they are expected to converge to zero. In other words, the zero mode of is seen as the parameter vector to be estimated.

The dimension of the -target is set to , relatively high correlation is induced by selecting in (19), and some amount of probability mass is maintained in the -distribution tails by choosing degrees of freedom. The present example does not reach the realm of a fully-fledged application, especially in terms of log-target complexity, yet it gives a first indication of some common computational costs appearing in more realistic applications, including automatic differentiation and SoftAbs metric evaluations.

Figure (a)a displays the running means of four chains that correspond to the seventeenth coordinate of the twenty-dimensional parameter , with a single chain generated by each of MALA, AM, SMMALA and GAMC. The running mean of the chain simulated using MALA does not appear to converge rapidly to the true mode of zero. This accords with theoretical knowledge. In particular, rob_twe__exp and liv_gir__inf have shown that if a target density has tails heavier than exponential or lighter than Gaussian, then a MALA proposal kernel does not yield a geometrically ergodic Markov chain. Furthermore, it can be seen that the chain generated by GAMC converges faster than the chains produced by AM, MALA and SMMALA.

Table 2 reports the minimum, mean, median and maximum ESS of the parameter coordinates. As seen from table 2, GAMC achieves roughly ten times larger ESS in comparison to AM, MALA and SMMALA for the t-distribution example. Figures (b)b, (a)a, (c)c, (e)e and (g)g show the autocorrelation and trace plots of the four chains with running means presented by figure (a)a. Figure (b)b demonstrates that GAMC has the lowest autocorrelation among the four compared samplers. The trace plots provide further circumstantial evidence of the faster mixing of GAMC for the Student-t target. The mixing properties of GAMC in the case of t-distribution come as a surprise, since GAMC was designed to reduce the cost paid for faster mixing rather than to achieve the fastest possible mixing in absolute terms. GAMC has shorter CPU runtime in comparison to SMMALA, but longer runtime than MALA and AM.

With a speed-up of , GAMC is about three times more efficient than MALA and orders of magnitude more efficient than AM and SMMALA for the Student-t target of this example.

5.2 Radial velocity of a star in planetary systems

The study of exoplanets has emerged as an important area of modern astronomy. While astronomers utilize a variety of different methods for detecting and characterizing the properties of exoplanets and their orbits, each of the prolific methods to date shares several characteristics. First, translating astronomical observations into planet physical and orbital properties requires significant statistical analysis. Second, characterizing planetary systems with multiple planets requires working with high-dimensional parameter spaces. Third, the posterior probability densities are often complex with correlated parameters, non-linear correlations or multiple posterior modes. MCMC has proven invaluable for providing accurate estimates of planet properties and orbital parameters and is now used widely in the field.

For analyzing simple data sets such as one planet detected at high signal-to-noise ratio, the random walk Metropolis-Hastings sampler is effective and the choice of MCMC sampling algorithm is unlikely to be important (for__qua ). For analyzing more complex data sets such as a star with several planets, more care is necessary to avoid poor mixing of the Markov chains. One approach is “artisinal” MCMC, where proposal densities are hand-crafted for a particular problem by making use of physical intuition and validation on simulated data sets (for__im ). However, it is desirable to identify more sophisticated algorithms that can be efficient with minimal tuning or human intervention. Here, GAMC is applied to simulated radial velocity planet search data sets so as to illustrate the potential of the sampler for future astronomical or other scientific applications.

One prolific method for characterizing the orbits of extrasolar planets is the radial velocity method. Astronomers make a series of precise measurements of the line-of-sight velocity of a target star. The velocity of the star changes with time due to the gravitational tug of any planets orbiting it. A basic radial velocity data set consists of a list of observation times , , and measured velocities .

The observed velocity of the star at time is modelled as the unknown velocity plus some measurement error , as seen in (20). For many planetary systems with planets, the stellar line-of-sight velocity can typically be well approximated by (21). Independent Gaussian measurement errors with variances are assumed according to (22). Expressions (20), (21) and (22) introduce the following model for the radial velocity of a star in a planetary system consisting of planets:

(20)
(21)
(22)

In equation (21), is the systemic line-of-sight velocity of the planetary system, is the velocity amplitude induced by the -th planet, is the orbital period of the -th planet, is the orbital eccentricity of the -th planet, is the mean anomaly at time of the -th planet, is the argument of pericenter of the -th planet, and is the true anomaly at time of the -th planet.

The true anomaly is an angle that specifies the location of the planet and star along their orbit at a given time. The true anomaly is related to the eccentric anomaly by . The eccentric anomaly can be calculated from the mean anomaly from Kepler’s equation, , and an iterative solver. The mean anomaly increases at a linear rate with time according to the equation .

A parameter vector of length five is associated with the -th planet, as seen from (21). Thus, a total of model parameters

appear in a planetary system with planets. The notation can be used in place of to indicate that the stellar line-of-sight velocity (21) of the star depends on the parameters .

According to (22), the sum of squares of the normalized measurement errors

follow a chi-squared distribution with

degrees of freedom,

(23)

The log-likelihood arises from (23) and (20) as

(24)

where , , . It is assumed that the measurement uncertainties are known, so , and make up the available data.

For the relatively simple model described by (24) and (21), astronomers commonly use a set of priors elicited during a 2013 SAMSI program on astrostatistics. Modified Jefferys priors are adopted for the velocity amplitudes and orbital periods . Uniform priors are employed for the orbital eccentricities , velocity offsets and angle offsets according to , , .

GAMC is benchmarked on two simulated data sets, of which one consists of planet and the other one comprises planets. In each case, observed velocities