Rademacher complexity for Markov chains : Applications to kernel smoothing and Metropolis-Hasting

06/06/2018 ∙ by Patrice Bertail, et al. ∙ 0

Following the seminal approach by Talagrand, the concept of Rademacher complexity for independent sequences of random variables is extended to Markov chains. The proposed notion of "block Rademacher complexity" (of a class of functions) follows from renewal theory and allows to control the expected values of suprema (over the class of functions) of empirical processes based on Harris Markov chains as well as the excess probability. For classes of Vapnik-Chervonenkis type, bounds on the "block Rademacher complexity" are established. These bounds depend essentially on the sample size and the probability tails of the regeneration times. The proposed approach is employed to obtain convergence rates for the kernel density estimator of the stationary measure and to derive concentration inequalities for the Metropolis-Hasting algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Let be a probability space and suppose that is a sequence of random variables on valued in . Let denote a countable class of real-valued measurable functions defined on . Let , define

The random variable

plays a crucial role in machine learning and statistics: it can be used to bound the risk of an algorithm

(Vapnik, 1998) as well as to study M and Z estimators (van der Vaart, 1998)

; it serves to describe the (uniform) accuracy of function estimates such as the cumulative distribution function

(Shorack and Wellner, 2009; Boucheron et al., 2013) or kernel smoothing estimates

of the probability density function

(Einmahl and Mason, 2000; Giné and Guillou, 2002); in addition, kernel density estimators, as well as their variations, Nadaraya-Watson estimators, are at the core of many semi-parametric statistical procedures (Akritas and Van Keilegom, 2001; Portier, 2016) in which controlling -type quantities permits to take advantage of the tightness of the empirical process (van der Vaart and Wellner, 2007). Depending on the class many different bounds are known when forms an independent and identically distributed (i.i.d.) sequence of random variables. A complete picture is given in van der Vaart and Wellner (1996); Boucheron et al. (2013).

The purpose of the paper is to study the behavior of when is a Markov chain. The approach taken in this paper is based on renewal theory and is known as the regenerative method, see Smith (1955); Nummelin (1978); Athreya and Ney (1978). Indeed it is well known that sample paths of a Harris chain may be divided into i.i.d. regeneration blocks. These blocks are defined as data cycles between random times called regeneration times at which the chain forgets its past. Hence, many results established in the i.i.d. setup may be extended to the Markovian framework by applying the latter to (functionals of) the regeneration blocks. Refer to Meyn and Tweedie (2009) for the

strong law of large numbers

and the central limit theorem, to Levental (1988) for functional CLT, as well as Bolthausen (1980); Malinovskiĭ (1987, 1989); Bertail and Clémençon (2004); Bednorz et al. (2008); Douc et al. (2008)

for refinements of the central limit theorem and

Adamczak (2008); Bertail and Clémençon (2010); Bertail and Ciołek (2017) for exponential type bounds.

Other works dealing with concentration inequalities for Markov chains include, among others, Joulin and Ollivier (2010), where a concentration inequality is proved under a curvature assumption; Dedecker and Gouëzel (2015), where the technique of bounded differences is employed to derive an Hoeffding-type inequality; Wintenberger (2017) which extends the previous to the case of unbounded chains.

We introduce a new notion of complexity that we call the block Rademacher complexity which extends the classical Rademacher complexity associated for independent sequences of random variables (Boucheron et al., 2013) to Markov chains. As in the independent case, the block Rademacher complexity is useful to bound the expected values of empirical processes (over some classes of functions) and intervenes as well to control the excess probability. Depending on the probability tails of the regeneration times, which are considered to be either exponential or polynomial, we derive bounds on the block Rademacher complexity of Vapnik-Chervonenkis (VC) types of classes. Interestingly, the obtained bounds bears resemblance to the ones provided in Einmahl and Mason (2000); Giné and Guillou (2001) (for independent

) as they depend on the variance of the underlying class of functions

allowing to take advantage of classes having small fluctuations.

To demonstrate the usefulness and the generality of the proposed approach, we apply our results on different problems. The first one tackles uniform bounds for the kernel estimator of the stationary density and illustrates how to handle particular classes of functions having variance that decreases with the sample size. The second problem deals with the popular Metropolis Hasting algorithm which furnishes examples of Markov chains that fit our framework.

Kernel density estimator.

The asymptotic properties of kernel density estimators, based on independent and identically distributed data, are well understood since the seventies-eighties (Stute, 1982). However finite sample properties were only studied in the beginning of the century when the studies of empirical processes over VC class have been proved to be powerful to handle such kernel density estimates (Einmahl and Mason, 2000; Giné and Guillou, 2002). The functions class of interest in this problem is given by

where is called the kernel and is a positive sequence converging to called the bandwidth. Based on the property that is included on some VC class (Nolan and Pollard, 1987), some master results have been obtained by Einmahl and Mason (2000, 2005); Giné and Guillou (2001, 2002) who proved some concentration inequalities, based on the seminal work of Talagrand (1996), allowing to establish precisely the rate of uniform convergence of kernel density estimators. Kernel density estimates are particular because the variance of each element in goes to as . This needs to be considered to derive accurate bounds, e.g., the one presented in Giné and Guillou (2002). The proposed approach takes care of this phenomenon as, under reasonable conditions, our bound for Markov chains scales at the same rate as the ones obtained in the independent case. Note that our results extend the ones in Azaïs et al. (2016) where under similar assumptions the consistency is established.

The study of this specific class of statistics for dependent data has only recently received special attention in the statistical literature. To the best of our knowledge, uniform results are limited to the alpha and beta mixing cases when dependency occurs (Peligrad, 1992; Hansen, 2008) by using coupling techniques.

Metropolis-Hasting algorithm.

Metropolis-Hasting (MH) algorithm is one of the state of the art method in computational statistics and is frequently used to compute Bayesian estimators (Robert and Casella, 2004). Theoretical results for MH are often deduced from the analysis of geometrically ergodic Markov chains as presented for instance in Mengersen and Tweedie (1996); Roberts and Tweedie (1996); Jarner and Hansen (2000); Roberts and Rosenthal (2004); Douc et al. (2004). Whereas many results on the asymptotic behavior of MH are known, e.g., central limit theorem or convergence in total variation, only few non-asymptotic results are available for such Markov chains; see for instance (Łatuszyński et al., 2013) where the estimation error is controlled via a Rosenthal-type inequality. We consider the popular random walk MH, which is at the heart of the adaptive MH version introduced in Haario et al. (2001). Building upon the pioneer works Roberts and Tweedie (1996); Jarner and Hansen (2000) where the geometric ergodicity is established for the random walk MH, we show that whenever the class is VC the expected values of the sum over points of the chain is bounded by where

depends notably on the distribution of the regeneration times. By further applying this to the quantile function, we obtain a concentration inequality for Bayesian credible intervals.

Outline.

The paper is organized as follows. In section 2, the notations and main assumptions are first set out. Conceptual background related to the renewal properties of Harris chains and the regenerative method are also briefly exposed. In section 3, the notion of block Rademacher complexity for Markov chains is introduced as well as the notion of VC classes. Section 4 provides the main result of the paper : a bound on the Rademacher compexity. Our methodology is illustrated in section 5 on kernel density estimation and MH. Technical proofs are postponed to the Appendix.

2 Regenerative Markov chains

2.1 Basic definitions

In this section, for seek of completeness we recall the following important basic definitions and properties of regenerative Markov chains. An interested reader may look into Nummelin (1984a) or Meyn and Tweedie (2009) for detailed survey of regeneration theory.

Consider an homogeneous Markov chain on a countably generated state space  with transition probability  initial probability . The assumption that is countably generated allows to avoid measurability problems. For any , let denote the -th iterate of the transition probability .

Definition 1 (irreducibility).

The chain is -irreducible if there exists a -finite measure such that, for all set , when , for any there exists such that . With words, no matter the starting point is, the chain visits with strictly positive probability.

Definition 2 (aperiodicity).

Assuming -irreducibility, there exists and disjoints sets (set ) positively weighted by such that and The period of the chain is the g.c.d. of such integers, it is said to be aperiodic if .

Definition 3 (Harris recurrence).

Given a set and the time the chain first enters , a -irreducible Markov chain is said to be positive Harris recurrent if for all with , we have for all .

Recall that a chain is positive Harris recurrent and aperiodic if and only if it is ergodic (Nummelin, 1984a, Proposition 6.3), i.e., there exists a probability measure , called the stationary distribution, such that . The Nummelin splitting technique (presented in the forthcoming section) depends heavily on the notion of small set. Such sets exist for positive Harris recurrent chain (Jain and Jamison, 1967).

Definition 4 (small sets).

A set  is said to be -small if there exists a positive probability measure  supported by  and an integer such that

(1)

In the whole paper, we work under the following generic hypothesis in which the chain is supposed to be Harris recurrent. Let (resp. denote the probability measure such that and (resp. ), and  is the -expectation (resp. the -expectation).

  1. The chain is a positive Harris recurrent aperiodic Markov chain with countable state space , transition kernel and initial measure . Let be -small with and suppose that the hitting time satisfies

    (2)

This is only for clarity reasons that we assume that . As explained in Remark 7 below, the study of sums over Harris chain, i.e., when , can easily be derived from the case .

2.2 The Nummelin splitting technique

The Nummelin splitting technique (Nummelin, 1978; Athreya and Ney, 1978) allows to retrieve all regeneration properties for general Harris Markov chains. It consists in extending the probabilistic structure of the chain in order to construct an artificial atom (Nummelin, 1984b). Start by recalling the definition of regenerative chains.

Definition 5.

We say that a -irreducible , aperiodic chain is regenerative or atomic if there exists a measurable set  called an atom, such that and for all  we have . Roughly speaking, an atom is a set on which the transition probabilities are the same. If the chain visit a finite number of states then any state or any subset of the states is actually an atom.

Assume that the chain satisfies the generic hypothesis (H). Then the sample space is expanded in order to define a sequence of independent Bernoulli random variables with parameter . The construction relies on the mixture representation of on namely , with two components, one of which not depending on the starting point (implying regeneration when this component is picked up in the mixture) The regeneration structure can be retrieved by the following randomization of the transition probability each time the chain visits the set :

  • If and (which happens with probability ), then is distributed according to the probability measure ,

  • If and (that happens with probability ), then is distributed according to the probability measure

The bivariate Markov chain is called the split chain. It takes its values in and is atomic with atom given by . Define the sequence of regeneration times i.e.

and, for ,

It is well known that the bivariate chain  inherits all the stability and communication properties of the chain , as aperiodicity and -irreducibility. For instance, the regeneration times has a finite expectation (by recurrence property), more precisely, it holds that (Azaïs et al., 2016, Lemma 9)

It is known from regeneration theory (Meyn and Tweedie, 2009) that given sequence we can cut our data into block segments or cycles defined by

according to the consecutive visits of the chain to the regeneration set . The strong Markov property the sequences implies that and are i.i.d. (Bednorz et al., 2008, Lemma 3.1). Denote by the probability measure such that . The stationary distribution is given by the Pitman’s occupation measure:

where is the  indicator function of the event . Let be a general measurable function. In the following we consider partial sums over regenerative cycles . We denote by the total number of renewals, thus we observe blocks. Notice that the block length are also i.i.d. with mean .

Remark 1 (random number of blocks).

The number of blocks is random and correlated to the blocks itself. This causes a major difficulty when deriving second order asymptotic results as well as non-asymptotic results for regenerative Markov chains.

Remark 2 (small set or atom).

The Nummelin splitting technique is useless in the case where the initial chain is already atomic, in which case the atom is simply . For clarity, we choose to focus on the general framework of Harris chains.

3 Block Rademacher complexity

3.1 The independent case

Let be an i.i.d. sequence of random variables defined on valued in with common distribution on . Let be a countable class of real-valued measurable functions defined on . The Rademacher complexity associated to is given by

where the ( are i.i.d. Rademacher random variables, i.e., taking values and , with probability , independent from .

The notion of VC class is powerful because it covers many interesting classes of functions and ensures suitable properties on the Rademacher complexity. The function is an envelope for the class if for all and . For a metric space , the covering number is the minimal number of balls of size needed to cover . The metric that we use here is the -norm denoted by and given by .

Definition 6.

A class of measurable functions is said to be of VC-type (or Vapnik-Chervonenkis type) for an envelope and admissible characteristic (positive constants) such that and , if for all probability measure on with and every ,

We also assume that the class is countable to avoid measurability issues (but the non-countable case may be handled similarly by using outer probability and additional measurability assumptions, see van der Vaart and Wellner (2007)).

The next theorem is taken from Giné and Guillou (2001), Proposition 2.2, and has been successfully applied to kernel density estimators in Giné and Guillou (2002). Similar approaches are provided for instance in Einmahl and Mason (2005), Proposition 1.

Theorem 1 (Giné and Guillou (2001)).

Let be a measurable uniformly bounded VC class of functions defined on with envelop and characteristic . Let such that for all and . Let be such that for all . Then, whenever , it holds

where is a universal constant.

3.2 The Harris case

To extend the previous approach to any Harris chain , we decompose the chain according to the elements that belong to complete blocks and the elements in and . Assuming that ,

(3)

with the convention that empty sums are . The terms corresponding to the first and last blocks say and , will be treated separately. Because , where denote the size of block , it holds that

where . Hence this term is a (random) summation over complete blocks. Recall that and that, under (2), . Thus, aiming to reproduce the Rademacher approach in the i.i.d. setting, we introduce the following block Rademacher complexity of the class ,

where are Rademacher random variables independent from the blocks .

3.3 Block VC classes

Even if the blocks form an independent sequence, we cannot apply directly concentration results for empirical processes over bounded classes, e.g., Theorem 1, because the class of functions formed by is not bounded. To solve this problem we will show that it is possible by an adequate probability transformation to bound the covering number of the functions by the one of the original class of functions for an adequate metric. In particular, we show that the class of functions has a similar size, in terms of covering number, as the class . This in turn will help to extend existing concentration inequalities on to concentration inequalities on .

For this define and let the occupation measure be given by

Introduce the following notations : for any function , let be given by

and for any class of real-valued functions defined on , denote by

(4)

The function that gives the size of the blocks is , defined by,

Let denote the smallest -algebra formed by the elements of the -algebras , , where stands for the classical product -algebra. Let denote a probability measure on . If is a random variable with distribution , then is a random measure, i.e., is a (counting) measure on , almost surely, and for every , is a measurable random variable (valued in ). Henceforth is a random variable and, provided that , the map , defined by

(5)

is a probability measure on .

Lemma 2.

Let be a probability measure on such that and be a class of measurable real-valued functions defined on . Then we have, for every ,

where and are given in (4) and (5), respectively. Moreover if is VC with constant envelope and characteristic , then is VC with envelope and characteristic .

Proof.

The proof is inspired from the proof of Lemma 4.2 presented in Levental (1988). Let be such that (4) holds with a -measurable function. Then, using Jensen’s inequality,

Applying this to the function

when each is the center of an -cover of the space and gives the first assertion of the lemma. To obtain the second assertion, note that is an envelope for . In addition, we have that

From this we derive that, for every ,

Then using the first assertion of the lemma, we obtain for every ,

which implies the second assertion whenever the class is VC for the envelope . ∎

Now that we know that any bounded VC class can be extended to a VC class unbounded defined over the blocks, we consider the bounded case which, unsurprisingly, is shown to remain VC.

Lemma 3.

Let be a probability measure on and be a class of measurable real-valued functions defined on . Then we have, for every ,

where . Moreover if is VC with constant envelope and characteristic , then is VC with envelope and characteristic .

Proof.

The proof follows the same lines as the proof of Lemma 2, replacing by .

4 Main result

We shall distinguish between the two following assumptions on the regeneration time . We say that

has polynomial moments, whenever

  • there exists such that ,

and has some exponential moments (EM), as soon as

  • there exists such that .

The following result extends concentration inequalities for empirical processes over independent random variables (Giné and Guillou, 2001, 2002; Einmahl and Mason, 2005), e.g., Theorem 1, to Markov chains.

Theorem 4 (block Rademacher complexity).

Assume that the chain satisfies the generic hypothesis (H). Let be VC with constant envelope and characteristic . Let be such that

For some universal constant , and any such that ,

  1. if (PM) holds, then

  2. if (EM) holds, then

    where .

Proof.

First we show that

(6)

for some universal constant . Then we consider the two cases (i) and (ii) to bound accordingly.
Use the decomposition

(7)

where, for any ,

The first term in (7) represents a classical Rademacher complexity as it is a centered empirical process evaluated over the bounded class . It follows from Lemma 3 that the product class is VC with constant envelop . As by assumption, , we deduce from applying Theorem 1 (with in place of ), that

For the second term in (7), we find

Hence (6) is established. To obtain point (i) simply use Markov’s inequality. To obtain (ii), note that

The last inequality follows from which is implied by whenever . ∎

Remark 3 (geometric ergodicity and condition (EM)).

Condition (EM) is equivalent to each of the following assertions : (i) the geometric ergodicity of the chain , (ii) the (uniform) Doeblin condition, as well as (iii) the Foster-Lyapunov drift condition (see Theorem 16.0.2 in Meyn and Tweedie (2009) for the details). Under this assumption, most classical convergence results (for instance, the law of the iterated logarithm or the central limit theorem) are valid (Meyn and Tweedie, 2009, Chapter 17).

Remark 4 (mixing and (PM)).

We point out that the relationship between (PM) and the rate of decay of mixing coefficients has been investigated in Bolthausen (1982): this condition is typically fulfilled as soon as the strong mixing coefficients sequence decreases as an arithmetic rate , for some .

Remark 5 (choice of the atom).

Finding in practice can be done by plotting an estimator of the transition density and finding a zone were the density is lower bounded (in practice, may be simply chosen to be theuniform distribution over the small set).

The two following results show that the block Rademacher complexity, previously introduced, is useful to control the expected values as well as the excess probability of suprema over classes of functions.

Theorem 5 (expectation bound).

Assume that the chain satisfies the generic hypothesis (H). Let be a countable class of measurable functions bounded by . It holds that

where stands for the initial measure.

Proof.

We rely on the block decomposition (3). First, we apply Lemma 1.2.6 in de la Peña and Giné (1999) to treat the term formed by complete blocks. Denote by the class formed by . We obtain

From the triangular inequality and because

we obtain that . The terms corresponding to incomplete blocks are treated as follows. We have

Using Theorem 5, we now rephrase the result of Adamczak (2008) to obtain a concentration bound for the empirical process involving the Rademacher complexity defined previously.

Theorem 6 (concentration bound, Adamczak (2008)).

Assume that the chain satisfies the generic hypothesis (H), (EM) and there exists such that . Let be a countable class of measurable functions bounded by . Let be such that

Then, for some universal constant , and for depending on the tails of the regeneration time, we have, for all ,

yielding alternatively, that for any with probability we have,

Remark 6 (on Theorem 6).

An explicit value for the constant  is difficult to obtain from the results of Adamczak(2008) but would be of great interest in practical applications. Notice that for large the second member of the inequality reduces to the bound , which gives the same rate as in the i.i.d. case.

Remark 7 ( different from ).

We have reduced our analysis to the case , however it is very easy to see now how the general case can be handled up to a modified constant in the bound. Recall that when then the blocks are 1-dependent (see for instance Chen (1999) Corollary 2.3). It follows that we can split the sum as follows

Then notice that, because of the -dependence property, in each sums the blocks are independent and we now have two sums of at most independent blocks that can be treated separately as we did before.

5 Applications

5.1 Kernel density estimator

Given observations of a Markov chains , the kernel density estimator of the stationary measure is given by

where , called the kernel, is such that and is a positive sequence of bandwidths.

The analysis of the asymptotic behavior of is traditionally executed by studying two terms. The bias term, , is classically treated by using techniques from functional analysis (Giné and Nickl, 2008, section 4.1.1). The variance term, , is usually treated using empirical process technique in the case of independent random variables. In the next, we provide some results on the asymptotic behavior of the variance term.

We shall consider kernel functions that taking one of the two following forms,

(8)

where is a bounded function of bounded variation with support . From Nolan and Pollard (1987), the class of function

This previous point has been used to handle the asymptotic analysis of kernel estimate

(Giné and Guillou, 2001) as well as in semiparametric problems as for instance in Portier and Segers (2017).

Theorem 7.

Assume that the chain satisfies the generic hypothesis (H) ,the stationary density is supposed to be bounded, the kernel is given by (8) and , for all . Suppose that and there exists such that .

  1. If (PM) holds for and , we have

  2. If (EM) holds and , we have

Proof.

In virtue of Theorem 5, it suffices to provide for both cases a sufficiently tight bound on . First we consider (i). By Jensen inequality we have

and for any , we get by using the expression of Pitman’s occupation measure