Consistent estimation of the spectrum of trace class data augmentation algorithms

11/02/2017 ∙ by Saptarshi Chakraborty, et al. ∙ University of Florida 0

Markov chain Monte Carlo is widely used in a variety of scientific applications to generate approximate samples from intractable distributions. A thorough understanding of the convergence and mixing properties of these Markov chains can be obtained by studying the spectrum of the associated Markov operator. While several methods to bound/estimate the second largest eigenvalue are available in the literature, very few general techniques for consistent estimation of the entire spectrum have been proposed. Existing methods for this purpose require the Markov transition density to be available in closed form, which is often not true in practice, especially in modern statistical applications. In this paper, we propose a novel method to consistently estimate the entire spectrum of a general class of Markov chains arising from a popular and widely used statistical approach known as Data Augmentation. The transition densities of these Markov chains can often only be expressed as intractable integrals. We illustrate the applicability of our method using real and simulated data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Markov chain Monte Carlo (MCMC) techniques have become an indispensable tool in modern computations. With major applications in high dimensional settings, MCMC methods are routinely applied in various scientific disciplines. A major application of MCMC is to evaluate intractable integrals. To elaborate, let be an arbitrary measure space and let

be a probability measure on

, with associated probability density (with respect to the measure ). The quantity of interest is the integral

where is a well-behaved function. In many modern applications, the above integral is highly intractable. In particular, it is not available in closed form, a (deterministic) numerical integration is extremely inefficient (often due to the high dimensionality of ), and it can not be estimated by classical Monte Carlo techniques, as random (IID) generation from is not feasible. In such cases, one typically resorts to Markov Chain Monte Carlo (MCMC) methods. Here, a Markov chain

with equilibrium probability distribution

is generated (using any standard MCMC strategies such as Metropolis Hastings, Gibbs sampler etc.) and then a Monte Carlo average based on those Markov chain realizations is used to estimate .

If the Markov chain is Harris ergodic (which is the case if the corresponding Markov transition density (MTD) is strictly positive everywhere), then the cumulative averages based on the Markov chain realizations consistently estimate the integral of interest (see Asmussen and Glynn (2011)). The accuracy of the estimate depends on two factors: (a) the convergence behavior of the Markov chain to its stationary distribution, and (b) the dependence between the successive realizations of the chain at stationarity. An operator theoretic framework provides a unified way of analyzing these two related factors. Let us consider the Hilbert space of real valued functions

with finite second moment with respect to

. This is a Hilbert space where the inner product of is defined as

and the corresponding norm is defined by . Then the Markov transition density (MTD) corresponding to the Markov chain defines an operator that maps to

(1)

We will assume that the Markov chain is reversible. In terms of the associated operator , this means that is self-adjoint. The spectrum of the self-adjoint operator , denoted by , is the set of for which is non-invertible (here denotes the identity operator that leaves a function unchanged). It is known that if is positive, i.e., if for all , (which is the case when is the operator corresponding to a Data Augmentation (DA) Markov chain, see Section 3), then (see, e.g., Retherford (1993)).

In this paper, we will focus on situations when the operator is trace class, i.e., is countable and its elements are summable (Conway (1990, p. 214)). All finite state space Markov chains trivially correspond to trace class operators. Also, in recent years, an increasingly large class of continuous state space Markov chains from statistical applications have been shown to correspond to trace class operators (see, e.g., Choi and Román (2017); Chakraborty and Khare (2017); Pal, Khare, and Hobert (2017); Qin and Hobert (2016); Hobert, Jung, Khare, and Qin (2015); Rajaratnam, Sparks, Khare, and Zhang (2017)). Let , where are the decreasingly ordered eigenvalues of . Then and the difference is called the spectral gap for the compact Markov operator . The spectral gap plays a major role in determining the convergence behavior of the Markov chain. In particular, any can be expressed as where

is the sequence of eigenfunctions corresponding to

, and

(2)

for any positive interger . Hence, determines the asymptotic rate of convergence of to the stationary distribution. Furthermore, provides maximal absolute correlation between and when is large (i.e.,

is sufficiently close to the target), and enables us to compute upper bounds of the asymptotic variance of MCMC estimators based on ergodic averages.

There is a substantial literature devoted to finding a theoretical bound for the second largest eigenvalue of a Markov operator. For finite state space Markov chains, see Lawler and Sokal (1988); Sinclair and Jerrum (1989); Diaconis and Stroock (1991); Saloff-Coste (2004); Yuen (2000); Diaconis and Saloff-Coste (1996); François (2000); Diaconis and Saloff-Coste (1993) to name just a few. In many statistical applications, the Markov chains move on large continuous state spaces, and techniques based on drift and minorization (see Rosenthal (1995); Jones and Hobert (2001)) have been used to get bounds on for some of these Markov chains. However, these bounds can in many cases be way off. Techniques to estimate the spectral gap based on simulation have been developed in Garren and Smith (2000); Raftery and Lewis (1992), and more recently in Qin et al. (2017) for trace class data augmentation Markov chains.

While bounding or estimating the spectral gap is clearly useful, a much more detailed and accurate picture of the convergence can be obtained by analyzing the entire spectrum of the Markov operator. This is apparent from (2), and specific examples can be found in Diaconis et al. (2008); Hobert et al. (2011); Khare and Zhou (2009). Moreover, if we have two competing Markov chains to sample from the same stationary density, having knowledge of their respective spectra allows for a detailed and careful comparison (See Section 4.3 for an illustration). However, the literature for general methods to evaluate/estimate the entire spectrum (all the eigenvalues) of a Markov operator is rather sparse. Adamczak and Bednorz (2015) provide an elegant and simple way of consistently estimating the spectrum of a general Hilbert-Schmidt integral operator with symmetric kernel using approximations based on random matrices simulated from a Markov chain. The approach in Adamczak and Bednorz (2015) can in particular be adapted for estimating the spectra of Markov operators. In fact, as we show in Section 2, in this context, the regularity condition needed for their method is exactly equivalent to the underlying Markov operator being trace class.

However, in order for the approach (and the technical consistency results) in Adamczak and Bednorz (2015) to be applicable, the Markov transition density and the stationary density

are required to be available in closed form. These assumptions are not satisfied by an overwhelming majority of Markov chains arising in modern statistical applications. This is particularly true for the so-called Data Augmentation (DA) algorithm, which is a widely used technique for constructing Markov chains by introducing unobserved/latent random variables. In this context, often, (a) the transition density can only be expressed as an intractable high-dimensional integral, and/or (b) the stationary density is only available up to an unknown normalizing constant 

111one would typically need to evaluate a complicated high-dimensional integral to obtain this constant, which is often infeasible, see Albert and Chib (1993); Hobert et al. (2011); Roy (2012); Polson et al. (2013); Choi and Hobert (2013); Hobert et al. (2015); Qin and Hobert (2016); Pal et al. (2017) to name just a few.

The main objective of this paper is to develop a random matrix approximation method to consistently estimate the spectrum of DA Markov operators for situations where (a) and/or (b) holds. In particular, we show that if the transition densities in the method of

Adamczak and Bednorz (2015) are replaced by appropriate Monte Carlo based approximations, the spectrum of the resulting random matrix consistently estimates the spectrum of the underlying Markov operator (Theorem 3.1). More generally, we show that the method and the result can be easily adapted to situations where the stationary density is known only up to a normalizing constant (Theorem 3.2).

No regularity conditions are needed for our results if the state space , or the latent variable space is finite. We would like to mention that in many statistical applications with finite state spaces, the state space can be extremely large, with millions/billions of states. The intractability of the transition density and the size of the state space often make numerical techniques for eigenvalue estimation completely infeasible. However, as we show in the context of the example in Section 4.3, our method can provide reasonable answers in less than 5 minutes using modern parallel processing machinery. If both the state space and the latent variable space are infinite, two regularity conditions need to be verified in order to use our results. One of them requires the Markov operator to be trace class, and the other one is a variance condition; each require checking that an appropriate integral is finite. An illustration is provided in Section 4.2 for the Gibbs sampler of Polson, Scott, and Windle (2013).

The remainder of the article is organized as follows. In Section 2 we first review the approach developed by Adamczak and Bednorz (2015) for MTDs with closed form expressions. Then we show that in the context of Markov operators, their regularity condition for consistency is equivalent to assuming that the operator is trace class. In Section 3 we introduce our approach for estimating the spectrum of DA Markov operators with intractable Markov transition densities and establish weak and strong consistency of the resulting estimates under a mild regularity assumption. In Section 4.1 we consider a toy normal-normal DA Markov chain (Diaconis et al., 2008), where all the eigenvalues are known, and examine the accuracy of the eigenvalue estimates provided by our algorithm. We then move on to real applications. In Section 4.2 we illustrate our method on the Polya Gamma Markov chain of Polson et al. (2013). We verify that this Markov chain satisfies the regularity condition needed for consistency and work out the first few eigenvalue estimates for the nodal dataset provided in the boot (Canty and Ripley (2017)) R package. In Section 4.3 we consider a Bayesian analysis of the two component normal mixture model and examine two competing DA Markov chains proposed in Hobert et al. (2011) to sample from the resulting posterior distribution. We illustrate the usefulness and applicability of our method by estimating and comparing the first few eigenvalues of the two DA chains for simulated data. We end with a discussion in Section 5.

2 Random Matrix Approximation method of Adamczak and Bednorz (2015)

The objective of this section is to describe the method of operator spectra estimation via random matrices, first proposed in Koltchinskii and Giné (2000) and then in Adamczak and Bednorz (2015) in the context of Markov operators. We begin this section with a brief description of the general method, and then discuss how one can potentially use it to estimate spectra of trace class Markov operators. This discussion is followed by a short lemma that establishes an equivalence between the regularity condition used in Adamczak and Bednorz (2015), and the condition of the Markov chain being trace class.

Let be a Hilbert-Schmidt integral operator (an integral operator whose eigenvalues are square summable) defined through a symmetric (in argmuents) kernel as:

(3)

and interest lies in obtaining , the spectrum of . In general, there does not exist any method of evaluating for arbitrary . However, Koltchinskii and Giné (2000) suggest a novel, elegant and simple approach of estimating via random matrices. Let denote an IID sample of size from the distribution . Then the authors show that a (strong) consistent estimator of is given by the set by the eigenvalues of the random matrix

for large , where denotes the Dirac delta function. The strength of the result lies in the fact that it works for any Hilbert Schmidt operator, irrespective of the dimension and structure of , as long as an IID sample from can be drawn. Unfortunately, IID simulations are not always feasible, especially in high dimensional settings (otherwise there would be need for MCMC!), thus limiting the applicability of the method. Adamczak and Bednorz (2015) generalize Koltchinskii and Giné (2000)’s result by allowing to be an MCMC sample (i.e., realizations of a Markov chain with equilibrium distribution ), and prove consistency of the resulting estimates.

Let denote a positive self-adjoint trace class Markov operator as defined in (1). Of course is Hilbert Schmidt (eigenvalues are summable implies they are square summable), and is symmetric in its argument due to reversibility of the associated Markov chain. Thus by expressing in the form (3) can potentially be estimated by Adamczak and Bednorz (2015)’s method, which only requires an MCMC sample from . The resulting method, which uses the same random data generated during the original run of the Markov chain in the recipe proposed in Adamczak and Bednorz (2015) to estimate the spectrum, will be called the Random Matrix Approximation (RMA) method henceforth, and is described below.

  1. [label = Step 0:, start = 0, leftmargin=1.5cm]

  2. Given a starting point , draw realizations from the Markov chain associated with .

  3. Given , for each pair with , compute the MTDs and the kernels , and construct the matrix

    (4)
  4. Calculate the eigenvalues of and estimate by .

Algorithm 1 Random Matrix Approximation (RMA) method of estimating for a Markov operator with MTD and stationary density available in closed form.

Sacrificing independence and identicalness of the random sample in Adamczak and Bednorz (2015)’s RMA method, however, comes at a price (as compared to Koltchinskii and Giné (2000)’s method, which uses IID samples). In particular, to ensure strong consistency in Adamczak and Bednorz (2015)’s method, an additional regularity condition is required to be satisfied by the Markov operator , namely, a function needs to exist for which for all . Interestingly, as we show in the following Lemma 2.1, this condition for , is equivalent to that of being trace-class in the current setting. At the core of the proof for Lemma 2.1, the following two alternative characterizations of trace class and Hilbert Schmidt operators (see, e.g., Jörgens (1982)) are used. The operator as defined in (1) is trace class if and only if

(5)

whereas it is Hilbert Schmidt if and only if

(6)
Lemma 2.1.

Consider a reversible Markov operator as defined in (1). Define for . Then the following two conditions are equivalent:

  1. [label = ()]

  2. there exists such that and for all .

  3. is trace class.

Proof.

Note that is necessarily non-negative and measurable and symmetric in its arguments (since is self-adjoint). We prove the two implications and separately.

:

Let there exist such that and for all . This means for all . Therefore,

which, from (5), implies that is trace class.

:

Let be trace class. This means is also Hilbert Schmidt, and therefore from (6),

We shall prove the existence of by construction. Let us first denote by the sequence of eigenvalues of . Then by the trace class property of , , and by spectral theorem (see, e.g. Jörgens (1982)), for all , where is an orthonormal basis of . Hence, for all

where and the inequality is due to Cauchy-Schwarz. The proof is completed by noticing that . ∎

As a consequence of Lemma 2.1, we are now in a position to adapt the consistency result from Adamczak and Bednorz (2015) for the RMA method described above in Algorithm 1. Before stating the result, we introduce required notations from Koltchinskii and Giné (2000) and Adamczak and Bednorz (2015). Recall that for any operator (finite or infinite) we use the notation to denote its spectrum. Thus, for a finite matrix , will denote the set of its eigenvalues. Since the Markov operators we consider are trace class (and therefore, Hilbert Schmidt), their spectra can be identified with the sequences of eigenvalues, where is the Hilbert space of all square summable real sequences. Because our goal is to approximate the (possibly infinite) spectrum of an integral operator by the finite spectrum of a matrix, we will identify the latter with an element of , by appending an infinite sequence of zeros to it. Moreover, because in the current setting we only consider positive self-adjoint Markov operators which have non-negative spectra, we shall confine our attention to , the “non-negative half” of . As in Koltchinskii and Giné (2000), the metric we use for comparing spectra is the metric, which is defined for as

(7)

where is the set of all permutations of natural numbers. Note that for any two points on , the above metric can be expressed as an distance of the sorted versions of the two points. In particular, for , if and denotes the points on which have the same coordinates as and , but arranged in a decreasing order, then

(8)

It is to be noted that (8) can be generalized to any points on , not necessarily on (see Koltchinskii and Giné (2000)), but because of non-negativity of the spectra of our interest, we only consider in the current setting.

From the Hoffman–Wielandt inequality (Hoffman and Wielandt (1953), Koltchinskii and Giné (2000, Theorem 2.2)), it follows that for normal operators and ,

(9)

where denotes the Hilbert Schmidt norm of an operator defined by

for any orthonormal basis of . Note that if is finite (i.e., a matrix), say , then where denotes the Frobenious norm of defined as

The following theorem, a rephrasing of Theorem 2.1 from Adamczak and Bednorz (2015) adapted to the current setting using Lemma 2.1, establishes (strong) consistency of the spectrum estimator obtained by RMA method for a positive self adjoint trace class Markov operator.

Theorem 2.1.

Let = be a reversible Markov chain with MTD , invariant measure , and suppose the associated Markov operator as given in (1) is positive and trace class. Define for and let

Then, for every initial measure of the chain , with probability one, as ,

3 A novel Monte Carlo based Random Matrix Approximation method for DA Markov chains

As we see in Section 2, the RMA method of Adamczak and Bednorz (2015) requires evaluation of the ratio for every pair . Unfortunately, as mentioned in the introduction, one or both of and are often intractable and do not have closed form expressions in many statistical applications. This is particularly true in the context of the Data Augmentation (DA) algorithm, where along with the variable of interest , one introduces a latent variable such that generations from the conditional distributions of ( given ) and are possible. Then, given a starting point , at each iteration , one first simulates from the distribution of and then generates , from the distribution of . The ’s generated in this method are retained and used as the required MCMC sample. Hence, the Markov transition density can be written as

(10)

Here and are conditional densities with respect to the measures and respectively, and are simple and easy to sample from in a typical DA algorithm. Unfortunately, the integral providing the Markov transition density of the associated chain often does not have a closed form expression, and cannot be efficiently approximated by a deterministic numerical integration algorithm (usually due to high dimensionality). The RMA method described in Algorithm 1 and the associated consistency result (Theorem 2.1) cannot be applied in these cases. In this section we propose a Monte Carlo based random matrix approximation (MCRMA) algorithm to estimate the spectrum of DA algorithms with intractable transition densities (Algorithm 2). To contrast with MCRMA, we shall call the RMA method described in Section 2 the Exact RMA or ERMA. The consistency of the MCRMA algorithm is established in Theorem 3.1.

Often, in addition to the intractability of the transition density, the stationary density is also available only up to an unknown normalizing constant (which is again hard to estimate in many modern applications as the stationary density is supported on a high-dimensional space). We adapt our algorithm to this situation (Algorithm 3), and establish consistency as well (Theorem 3.2).

3.1 Monte Carlo Random Matrix Approximation (MCRMA) Method

In this section, we will present a method to estimate the spectrum of a DA Markov operator where the transition density in (10) is intractable, but the staionary density is available in closed form. Given realizations of the positive trace class reversible Markov chain with MTD as in (10), the key idea is to approximate for each pair using classical Monte Carlo technique, and then construct an analogue of the RMA estimator that uses the approximate kernels instead of the original. The details of the method are provided in Algorithm 2 below.

  1. [label = Step 0:, start = 0]

  2. Given a starting point , draw realizations from the associated Markov chain . Call .

  3. Given , for each , generate generate IID observations from the density .

  4. For each pair with , construct the Monte Carlo estimate

    define the estimated kernel

    and construct the matrix

    (11)

    where is the Dirac delta function. Observe that is symmetric by construction, with zero diagonal entries.

  5. Calculate the eigenvalues of and estimate by .

Algorithm 2 Monte Carlo Random Matrix Approximation (MCRMA) method of estimating for a positive reversible Markov operator with associated MTD

It is to be noted that for Step 1 in the MCRMA algorithm to be feasible, the density should be easy to sample from. This is typically true for DA algorithms that are used in practice. In fact, the major motivation for using a DA algorithm is that the conditional densities and are easy to sample from, whereas it is hard to directly generate samples from . For Step 2 to be feasible, we need to be available in closed form. Again, this is true in most statistical applications, where is typically a standard density such as multivariate normal, gamma etc. Another crucial thing to note, from a computational point of view, is that the rows of the matrix can be constructed in an embarrassingly parallel fashion (since no relationship is assumed among the elements of ), thereby reducing the running time of the algorithm significantly.

Note that the MCRMA algorithm (Algorithm 2) provides a coarser approximation to the spectrum of as compared to the the ERMA algorithm (Algorithm 1). This is because we use the additional Monte Carlo based approximation (for ) in the constructed random matrices. An obvious and important question is: does an analog of the consistency result for the RMA algorithm (Theorem 2.1) hold in this more complex setting of the MCRMA algorithm? We state and prove such a consistency result below.

Theorem 3.1.

Let = be a reversible Markov chain on the state space with MTD , invariant measure and let associated Markov operator as given in (1) be positive. Suppose the MTD can be written as , where and are conditional densities with respect to measures and . Let denote the first realizations of the Markov chain, and given , construct the matrix as given in (11). Then the following hold:

  1. [label=()]

  2. If is finite, then (strong consistency) as and , almost surely, for any initial measure for the chain .

  3. If is infinite (countable or uncountable), and

    1. [label=()]

    2. is trace class, and

    3. (variance condition)

      where denotes the joint density of , , ,

    then for any initial measure for the chain ,

    1. [label = ()]

    2. (weak consistency) if as , then ,

    3. (strong consistency) if , then almost surely.

Proof.

On the outset, note that, by the triangle inequality and then (9), we have

(12)

Since almost surely (Theorem 2.1) as , therefore, we only need to show the almost sure or in probability convergence (to zero) of the first term on the right hand side of the above inequality, as and . We shall prove this convergence separately for the cases where is finite and infinite.

is finite

Note that, for any matrix , . Therefore,

(13)

as , since the (double) summation includes a finite number of terms, with each term converging almost surely to zero (Monte Carlo convergence). Thus, combining (12), (13) and Theorem 2.1, we have

is infinite:

By the variance condition 2b, we have,

Let be arbitrary. Since the Hilbert Schmidt norm for matrices are the same as the Frobenius norm, and since are both symmetric in its arguments, therefore, we have

Therefore, by Markov inequality,

Therefore,

  1. [label = ()]

  2. if as , then

    since . This means , which, by (12) implies

  3. if as , then

    which, by the Borel-Cantelli theorem, implies almost surely. Therefore, from (12),

Remark 3.1.

Let denote the Markov operator associated with the chain (the Markov chain of the generated latent data), defined for as

(14)

where denotes the Markov transition density of the chain, denotes the stationary density for (associated with ) and is the probability measure associated with . Then Khare and Hobert (2011) show that , which implies that instead of estimating , one can equivalently estimate . Note that,

with and being the same conditional densities as before. Therefore, an analogous MCRMA algorithm for estimating can be similarly formulated. Here, given the realizations , , , one first finds the Monte Carlo approximates of (via IID samples generated from ) at every paired realization , then defines a random matrix with the ratios (times an adjustment factor ), similar to the MCRMA with observations, and finally evaluates eigenvalues of the resulting random matrix. Consequently, an analogous consistency theorem will also hold for the resulting algorithm.

Because the chain is automatically generated as a by-product of the DA algorithm, from a practitioner’s point of view, using instead of makes little difference in MCRMA. However, substantial simplifications on the regularity conditions may be achieved by using . This is particularly true in cases where the latent variable space is finite (however large). In such cases, no regularity condition is required to be satisfied (case 1 in Theorem 3.1) to achieve strong consistency. See Section 4.3 for examples.

Remark 3.2.

Observe that the conditions 2a and 2b in Theorem 3.1 are not very demanding. For example, when the other assumptions are satisfied, if , weak convergence holds. On the other hand when , for some , condition 2b is satisfied, which ensures strong convergence (which, of course, implies weak convergence).

3.2 MCRMA with Specified Only up to a Constant

Note that Step 2 MCRMA method requires construction of a symmetric matrix whose th entry has in the denominator. This is clearly not feasible in cases where is known up to a constant, i.e., is of the form , where is an unknown constant, and the functional form of is completely known. In this section, we propose a simple strategy that adapts Algorithm 2 for such cases. The basic idea, displayed formally in Algorithm 3, is to follow the steps of Algorithm 2 but now with in the denominator of the random matrix instead of , and then simply rescale the eigenvalues so that the largest eigenvalue is 1. Clearly, this nullifies any estimation/evaluation of the normalizing constant. Theorem 3.2 establishes consistency for the resulting estimator by exploiting the fact that the largest eigenvalue of any Markov operator is 1.

  1. [label = Step 0:, start = 0]

  2. Given a starting point , draw realizations from the associated Markov chain . Call .

  3. Given , for each , generate generate IID observations