Hypercoercivity of Piecewise Deterministic Markov Process-Monte Carlo

In this paper we derive spectral gap estimates for several Piecewise Deterministic Markov Processes, namely the Randomized Hamiltonian Monte Carlo, the Zig-Zag process and the Bouncy Particle Sampler. The hypocoercivity technique we use, presented in (Dolbeault et al., 2015), produces estimates with explicit dependence on the parameters of the dynamics. Moreover the general framework we consider allows to compare quantitatively the bounds found for the different methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/26/2018

Hypocoercivity of Piecewise Deterministic Markov Process-Monte Carlo

In this work, we establish L^2-exponential convergence for a broad class...
07/29/2020

On explicit L^2-convergence rate estimate for piecewise deterministic Markov processes

We establish L^2-exponential convergence rate for three popular piecewis...
08/13/2018

Randomized Hamiltonian Monte Carlo as Scaling Limit of the Bouncy Particle Sampler and Dimension-Free Convergence Rates

The Bouncy Particle Sampler is a Markov chain Monte Carlo method based o...
11/18/2020

Subgeometric hypocoercivity for piecewise-deterministic Markov process Monte Carlo methods

We extend the hypocoercivity framework for piecewise-deterministic Marko...
12/27/2020

Adaptive Schemes for Piecewise Deterministic Monte Carlo Algorithms

The Bouncy Particle sampler (BPS) and the Zig-Zag sampler (ZZS) are cont...
02/18/2022

Efficient computation of the volume of a polytope in high-dimensions using Piecewise Deterministic Markov Processes

Computing the volume of a polytope in high dimensions is computationally...
11/10/2021

PDMP Monte Carlo methods for piecewise-smooth densities

There has been substantial interest in developing Markov chain Monte Car...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider a probability distribution

defined on the Borel -field of some domain or where . Assume that has a density with respect to the Lebesgue measure also denoted and of the form where is a continuously differentiable function and is referred to as the potential associated with

. Sampling from such distributions is of interest in computational statistical mechanics and in Bayesian statistics and allows one, for example, to compute efficiently expectations of functions

with respect to

by invoking empirical process limit theorems, e.g. the law of large numbers. In practical set-ups, sampling exactly from

directly is either impossible or computationally prohibitive. A standard and versatile approach to sampling from such distributions consists of using Markov Chain Monte Carlo (MCMC) techniques

[30, 41, 54], where the ability of simulating realizations of ergodic Markov chains leaving invariant is exploited. Markov Process Monte Carlo (MPMC) methods are the continuous time counterparts of MCMC but their exact implementation is most often impossible on computers and requires additional approximation, such as time discretization of the process in the case of the Langevin diffusion. A notable exception, which has recently attracted significant attention, is the class of MPMC relying on Piecewise Deterministic Markov Processes (PDMP) [17], which in addition to being simpler to simulate than earlier MPMC, are nonreversible, offering the promise of better performance. We now briefly introduce a class of processes covering existing algorithms. The generic mathematical notation we use in the introduction is fairly standard and fully defined at the end of the section.

Known PDMP Monte Carlo methods rely on the use of the auxiliary variable trick, that is the introduction of an instrumental variable and probability distribution defined on an extended domain, of which is a marginal distribution, which may facilitate simulation. In the present set-up, one introduces the velocity variable associated with a probability distribution defined on the -field of , where the subset is assumed to be closed. Standard choices for

include the centered normal distribution with covariance matrix

, where is the

-dimensional identity matrix, the uniform distribution on the unit sphere

, or the uniform distribution on . Let and define the probability measure . The aim is now to sample from the probability distribution .

We denote by the set of bounded functions of . The PDMP Monte Carlo algorithms we are aware of fall in a class of processes associated with generators of the form, for and ,

(1)

where , for , , and for are operators we specify below, and

(2)

which is assumed to be finite. For any , will be referred to as a jump rate and as the refreshment rate.

In the case where and

is the zero-mean Gaussian distribution on

with covariance matrix , we also consider generators of the form, for any and ,

(3)

where .

For any , the jump operators

we consider are associated with continuous vector fields

of the form, for any and ,

(4)

These operators correspond to reflections of the velocity through the hyperplanes orthogonal to

at the event position , i.e. a flip of the component of the velocity in the direction given by inducing an elastic “bounce” of the position trajectory with the hyperplane. As we shall see, the vector fields are tied to the potential by the relation , required to ensure that

is left invariant by the associated semi-group. Informally, assuming for the moment that

and for some , the corresponding process follows the solution of Hamilton’s equations for a random time of distribution governed by an inhomogeneous Poisson process with rate . When an event occurs and the current state of the process is , one chooses between the possible updates of the state available, with probability proportional to , with the particularity here that the position is left unchanged.

The vector fields and jump rates are linked by the relations for and , together with other conditions, required to ensure that is an invariant distribution of the associated semi-group. A standard choice, sometimes referred to as canonical, consists of choosing jump rates for and .

Denote by the set of measurable functions such that . We let be the norm induced by the scalar product

(5)

making a Hilbert space.

The operator will be referred to as the refreshment operator, a standard example of which is where is the following orthogonal projector in : for any ,

(6)

in which case the velocity is drawn afresh from the marginal invariant distribution, while the position is left unchanged. In this scenario the informal description of the process given above carries on with added to the rate , an additional possible update to the velocity chosen with probability proportional to . Another possible choice is the generator of an Ornstein-Uhlenbeck operator leaving invariant.

In all the paper we assume the following condition to hold for either or , a condition satisfied by the examples covered in this manuscript.

A 1.
  1. The operator is closed in , generates a strongly continuous contraction semi-group on , i.e. , for any , , for any , and .

  2. is a a stationary measure for , i.e. for any , .

  3. There exists a core for such that is dense in and , where is the adjoint of on .

Note that if generates a strongly continuous contraction semi-group then is dense by [27, Theorem 2.12] and the adjoint of on is therefore well-defined and closed by [49, Theorem 5.1.5], and is dense.

We now describe how various choices of and lead to known algorithms. For simplicity of exposition, we assume for the moment that , is the zero-mean Gaussian distribution with covariance matrix and , but as we shall see later our results cover more general scenarios.

  • The particular choice and corresponds to the procedure described in [23] as a motivation for the popular hybrid Monte Carlo method. This process is also known as the Linear Boltzman/kinetic equation in the statistical physics literature [5] or randomized Hamiltonian Monte Carlo [11]. In this scenario the process follows the isocontours of for random times distributed according to an inhomogeneous Poisson law of parameter , triggering events where the velocity is sampled afresh from .

  • The scenario where , and for , where is the canonical basis, corresponds to the Zig-Zag (ZZ) process [6], where the component of the process follows straight lines in the direction which remains constant between events. In this scenario, the choice of to update the velocity, consists of negating its -th component; see also [29] for related ideas motivated by other applications.

  • The standard Bouncy Particle Sampler (BPS) of [51], extended by [12], correspond to the choice , and .

  • More elaborate versions of the ZZ and BPS processes, motivated by computational considerations, take advantage of the possibility to decompose the energy as and corresponds to the choice [43, 12], where in the former the sign flip operation is replaced with a component swap.

  • It should be clear that one can consider more general deterministic dynamics with , effectively covering the Hamiltonian Bouncy Particle Sampler, suggested in [55].

  • We remark that the well-known Langevin algorithm corresponds to , and the situation where is the Ornstein-Uhlenbeck process.

More general bounces involving randomization (see [55, 58, 44]) can also be considered in our framework, at the cost of additional complexity and reduced tightness of our bounds.

The main aim of the present paper is the study of the long time behaviour for the class of processes described above using hypercoercivity methods popularized by [57]. More precisely, consider the semigroup associated to the PDMP with generator defined above, we aim to find simple and verifiable conditions on and ensuring the existence of and , and their explicit computation in terms of characteristics of the data of the problem, such that for any and ,

(7)

Establishing such a result is of interest to practitioners for multiple reasons. Explicit bounds may provide insights into expected performance properties of the algorithm in various situations or regimes. In particular the above leads to an upper bound on the integrated autocorrelation, which is a performance measure of Monte Carlo estimators of , , defined by

(8)

where is a trajectory of a PDMP process of generator with distributed according to . For a class of problems of, say, increasing dimension , weak dependence of and on indicates scalability of the method. It is worth pointing out that the result above is equivalent to the existence of and such that for any measure such that

where the leftmost inequality is standard and a consequence of the Cauchy-Schwarz inequality. Our hypocoercivity result therefore also allows characterization of convergence to equilibrium of PDMPs in various scenarios and regimes, leading in particular to the possibility to compare performance of algorithms started from the same initial distribution. Establishing similar results for different metrics may be a useful complement to our characterization of algorithmic computational complexity and is left for future work.

In [46, 57], convergence of the type (7) is established using an appropriate -norm associated with . The method which was developed in these papers is closely related to hypoellipticity theory [39, 26, 37]

for Partial Differential Equation and in particular the kinetic Fokker-Planck equation. Convergence for linear Boltzman equations was first derived in

[36, 46]. Since then, several works have extended and completed these results [21, 35, 1, 14, 28, 45].

Notation and conventions

The canonical basis of is denoted by and the -dimensional identity matrix . The Euclidean norm on or is denoted by , and is associated with the usual Frobenius inner product for any in or .

Let be a smooth submanifold of , for . For any , denote by the set of -times differentiable functions from to , stands for the subset of bounded functions in with bounded differentials up to order . and stand for and respectively.

For and , stands for the partial derivative of with respect to the -coordinate, if it exists. Similarly, for , , denote by when exists. For , stands for the gradient of defined for any by . For ease of notation, we also denote by the densely defined closed extension of on , see [40, p. 88]. For any , and , define

We set for ,

(9)

and simply stands for . For any , we let denote the Laplacian of . stands for the identity operator. For two self-adjoint operators and on a Hilbert space equipped with the scalar product and norm , denote by if for all . Then, define with domain, if not specified, . For a bounded operator on , we let . is said to be an orthogonal projection if is a bounded symmetric operator and . An unbounded operator is said to be symmetric (respectively anti-symmetric) is for any , (respectively ). If is densily defined, is said to be self-adjoint if . If in addition is closed, is said to be a core for if the closure of is . Denote by the constant function equals to from a set to . For any unbounded operator , we denote by and . For any probability measure on a measurable space , we denote by the Hilbert space of measurable functions satisfying , equipped with the inner product , and . We will use the same notation for vector and matrix fields or , i.e.  and no confusion should be possible. When we replace with in this notation. For any denote by the Dirac distribution at . We define the total variation distance between two probability measures on by . For a square matrix we let be its main diagonal and for a vector we let be the square matrix of diagonal and with zeros elsewhere. For we let denote their minimum. For (), we denote by () the Hadamard product between and defined for any () by (). For any , denotes the Kronecker symbol which is if and otherwise. For any , , we let .

2 Main results and organization of the paper

We now state our main results. In the following, for any densely defined operator we let denote its -adjoint. First we specify conditions imposed on the potential .

H 1.

The potential and satisfies

  1. there exists such that, for any , ;

  2. (10)

From [50, 3], H 1-(b) is equivalent to assuming that satisfies a Poincaré inequality on , that is the existence of such that, for any satisfying ,

(11)

Further, H 1-(b) also implies the existence of and such that for any ,

(12)

H 1-(b) indeed implies that the quantity considered is bounded from below, the scaling in in front of will appear natural in the sequel. We have opted for this formulation of the assumption required of the potential to favour intuition and link it to the necessary and sufficient condition for geometric convergence of Langevin diffusions, but our quantitative bounds below will be given in terms of the Poincaré constant for simplicity (see [4, Section 4.2] for quantitative estimates of depending on potentially further conditions on ). H 1-(a) is realistic in most applications, can be checked in practice and has the advantage of leading to simplified developments. It is possible to replace this assumption with and rephrase our results in terms of any finite upper bound of this quantity (see [21, Sections 2 and 3]). Finally the Poincaré inequality (11) implies by [4, Proposition 4.4.2] that there exists such that

(13)
H 2.

The family of vector fields satisfies

  1. for , ;

  2. for all , ;

  3. for all there exists such that for all ,

    (14)

This assumption is in particular trivially true for the Zig-Zag and the Bouncy Particle Samplers. In turn we assume the jump rates to be related to the family of vector fields through the following conditions.

H 3.

There exist a continuous function , and satisfying for any ,

(15)

such that for any and , .

We note that the canonical choice satisfies these conditions and that the first condition of (15) is equivalent to , implying that for all and therefore that the left hand side inequality in (15) is automatically satisfied. If we further assume the existence of such that for all , then the second inequality is satisfied with and . As remarked in [2], the first condition of (15) holds for rates based on the choice

such that satisfies for all . The canonical choice corresponds to , but the (smooth) choice is also possible.

H 4.

Assume that and satisfy the following conditions.

  1. is stable under bounces, i.e. for all and , , where is defined by (4).

  2. For any , , we have , for any .

  3. For any bounded and measurable function , such that , ;

  4. has finite fourth order marginal moment

    (16)

    and for any such that , .

Note that in the case where and are rotation invariant, i.e. for any rotation on , and for any , , then H 4-(a)-(b)-(c) are automatically satisfied.

By H 4-(c), we have taking for any and therefore for any such that , . In addition, under H 4-(d), from the Cauchy-Schwarz inequality, we obtain that

(17)

and note that in the Gaussian case we have the relation . Finally, under H 4, for any and , , that is is symmetric on .

In this paper we consider operators on satisfying the following conditions.

H 5.
  1. Functions depending only on the position belong to the kernel of : and for any , ;

  2. satisfies the detailed balance condition: and ;

  3. admits a spectral gap of size on : for any , ; in addition, for any , it holds for any , and .

Typically, is of the form where is a self-adjoint operator on with spectral gap equals . Then, condition LABEL:as:operator_onvelocities-(a) is equivalent to , which implies that for any , we have

(18)

so that the process associated with preserves the probability measure .

Note that LABEL:as:operator_onvelocities-(a) implies that , whereas LABEL:as:operator_onvelocities-(c) implies that , where is defined by (6). Assumption H 5 is satisfied when , or with the generator of the Ornstein-Uhlenbeck process defined for any by

(19)
H 6.

The refreshment rate is bounded from below and from above as follows: there exist and such that for all ,

(20)

Under the previous assumptions we can prove exponential convergence of the semigroup.

Theorem 1.

Assume that , given by (1) or (3) satisfies A 1 with and H 1, H 2, H 3, H 4, H 5 and H 6 hold. Then there exist and such that, for any , and ,

(21)

The constants and are given in explicit form in (37) in Theorem 2 (Section 3), in terms of the constant appearing in H 1, H 2, H 4, H 5 and H 6, where can be taken to be given in (39), , and where

(22)

and .

Proof.

The proof is postponed to Section 4.1. ∎

The following details the expected scaling behaviour with of and . The proof can be found in Section 4.3.

Corollary .

Consider the assumptions and notation of Theorem 1. Further suppose that there exists satisfying

(23)

which together with and are independent of . Then and there exists , independent of and , such that for large enough,

(24)

Thus, if , , and are fixed, we get that is in general at most of order if .

We now discuss the assumptions of the theorem, and application of its conclusion to various instances of PDMP-MC and two examples of potentials. Assumption H 1 is problem dependent and verifiable in practice, while H 2, H 4, H 5 and H 6 are user controllable and we have already discussed standard choices satisfying these conditions. More delicate may be establishing that A 1 holds and that is indeed a core for the generator . As shown in [25], BPS and ZZ are well defined Markov process whose generators admit as a core and similar arguments can be used to establish that it is also a core for the RHMC. Further, it is not difficult to show that for the class of processes described earlier, for any , , therefore implying that is an invariant distribution and that A 1 holds.

First we note that the spectral gap is indeed expected to be proportional to , since if is a PDMP with generator of the form (1) or (3) for , then is a PDMP with generator of the same form with . We therefore set below, a condition satisfied when is the uniform distribution on the sphere or