Estimating β-mixing coefficients

03/04/2011 ∙ by Daniel J. McDonald, et al. ∙ 0

The literature on statistical learning for time series assumes the asymptotic independence or "mixing' of the data-generating process. These mixing assumptions are never tested, nor are there methods for estimating mixing rates from data. We give an estimator for the β-mixing rate based on a single stationary sample path and show it is L_1-risk consistent.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Relaxing the assumption of independence is an active area of research in the statistics and machine learning literature. For time series, independence is replaced by the asymptotic independence of events far apart in time, or “mixing”. Mixing conditions make the dependence of the future on the past explicit, quantifying the decay in dependence as the future moves farther from the past. There are many definitions of mixing of varying strength with matching dependence coefficients (see

[8, 6, 3] for reviews), but most of the results in the learning literature focus on -mixing or absolute regularity. Roughly speaking (see Definition 2.1 below for a precise statement), the -mixing coefficient at lag

is the total variation distance between the actual joint distribution of events separated by

time steps and the product of their marginal distributions, i.e., the distance from independence.

Numerous results in the statistical machine learning literature rely on knowledge of the -mixing coefficients. As Vidyasagar [24, p. 41] notes, -mixing is “just right” for the extension of IID results to dependent data, and so recent work has consistently focused on it. Meir [14] derives generalization error bounds for nonparametric methods based on model selection via structural risk minimization. Baraud et al. [1] study the finite sample risk performance of penalized least squares regression estimators under -mixing. Lozano et al. [12] examine regularized boosting algorithms under absolute regularity and prove consistency. Karandikar and Vidyasagar [11]

consider “probably approximately correct” learning algorithms, proving that PAC algorithms for IID inputs remain PAC with

-mixing inputs under some mild conditions. Ralaivola et al. [19]

derive PAC bounds for ranking statistics and classifiers using a decomposition of the dependency graph. Finally,

Mohri and Rostamizadeh [15] derive stability bounds for -mixing inputs, generalizing existing stability results for IID data.

All these results assume not just -mixing, but known mixing coefficients. In particular, the risk bounds in [14, 15] and [19] are incalculable without knowledge of the rates. This knowledge is never available. Unless researchers are willing to assume specific values for a sequence of -mixing coefficients, the results mentioned in the previous paragraph are generally useless when confronted with data.To illustrate this deficiency, consider Theorem 18 of [15]:

Theorem 1.1 (Briefly).

Assume a learning algorithm is -stable. Then, for any sample of size drawn from a stationary -mixing distribution, and

where , has a particular functional form, and is the difference between the true risk and the empirical risk.

Ideally, one could use this result for model selection or to control the size of the generalization error of competing prediction algorithms (support vector machines, support vector regression, and kernel ridge regression are a few of the many algorithms known to satisfy

-stability). However the bound depends explicitly on the mixing coefficient . To make matters worse, there are no methods for estimating the -mixing coefficients. According to Meir [14, p. 7], “there is no efficient practical approach known at this stage for estimation of mixing parameters.” We begin to rectify this problem by deriving the first method for estimating these coefficients. We prove that our estimator is consistent for arbitrary -mixing processes. In addition, we derive rates of convergence for Markov approximations to these processes.

Application of statistical learning results to -mixing data is highly desirable in applied work. Many common time series models are known to be -mixing, and the rates of decay are known given the true parameters of the process. Among the processes for which such knowledge is available are ARMA models [16], GARCH models [4], and certain Markov processes — see [8] for an overview of such results. To our knowledge, only Nobel [17] approaches a solution to the problem of estimating mixing rates by giving a method to distinguish between different polynomial mixing rate regimes through hypothesis testing.

We present the first method for estimating the -mixing coefficients for stationary time series data. Section 2 defines the -mixing coefficient and states our main results on convergence rates and consistency for our estimator. Section 3 gives an intermediate result on the convergence of the histogram estimator with -mixing inputs. Section 4 proves the main results from §2. Section 5 concludes and lays out some avenues for future research.

2 Estimation of -mixing

In this section, we present one of many equivalent definitions of absolute regularity and state our main results, deferring proof to §4.

To fix notation, let

be a sequence of random variables where each

is a measurable function from a probability space into a measurable space . A block of this random sequence will be given by where and are integers, and may be infinite. We use similar notation for the sigma fields generated by these blocks and their joint distributions. In particular, will denote the sigma field generated by , and the joint distribution of will be denoted .

2.1 Definitions

There are many equivalent definitions of -mixing (see for instance [8], or [3] as well as Meir [14] or Yu [27]), however the most intuitive is that given in Doukhan [8].

Definition 2.1 (-mixing).

For each positive integer , the the coefficient of absolute regularity, or -mixing coefficient, , is

(1)

where is the total variation norm, and is the joint distribution of . A stochastic process is said to be absolutely regular, or -mixing, if as .

Loosely speaking, Definition 2.1 says that the coefficient measures the total variation distance between the joint distribution of random variables seaparted by time units and a distribution under which random variables separated by time units are independent. The supremum over is unnecessary for stationary random processes which is the only case we consider here.

Definition 2.2 (Stationarity).

A sequence of random variables is stationary when all its finite-dimensional distributions are invariant over time: for all and all non-negative integers and , the random vectors and have the same distribution.

Our main result requires the method of blocking used by Yu [26, 27]. The purpose is to transform a sequence of dependent variables into subsequence of nearly IID ones. Consider a sample from a stationary -mixing sequence with density . Let and be non-negative integers such that . Now divide into blocks of each length . Identify the blocks as follows:

Let

be the entire sequence of odd blocks

, and let be the sequence of even blocks . Finally, let be a sequence of blocks which are independent of but such that each block has the same distribution as a block from the original sequence:

(2)

The blocks are now an IID block sequence, so standard results apply. (See [27] for a more rigorous analysis of blocking.) With this structure, we can state our main result.

2.2 Results

Our main result emerges in two stages. First, we recognize that the distribution of a finite sample depends only on finite-dimensional distributions. This leads to an estimator of a finite-dimensional version of . Next, we let the finite-dimension increase to infinity with the size of the observed sample.

For positive integers , , and , define

(3)

where is the joint distribution of . Also, let be the -dimensional histogram estimator of the joint density of consecutive observations, and let be the -dimensional histogram estimator of the joint density of two sets of consecutive observations separated by time points.

We construct an estimator of based on these two histograms.111While it is clearly possible to replace histograms with other choices of density estimators (most notably KDEs), histograms in this case are more convenient theoretically and computationally. See §5 for more details. Define

(4)

We show that, by allowing to grow with , this estimator will converge on . This can be seen most clearly by bounding the -risk of the estimator with its estimation and approximation errors:

The first term is the error of estimating with a random sample of data. The second term is the non-stochastic error induced by approximating the infinite dimensional coefficient, , with its -dimensional counterpart, .

Our first theorem in this section establishes consistency of as an estimator of for all -mixing processes provided increases at an appropriate rate. Theorem 2.4 gives finite sample bounds on the estimation error while some measure theoretic arguments contained in §4 show that the approximation error must go to zero as .

Theorem 2.3.

Let be a sample from an arbitrary -mixing process. Let where is the Lambert function.222The Lambert function is defined as the (multivalued) inverse of . Thus, is bigger than but smaller than . See for example Corless et al. [5]. Then as .

A finite sample bound for the approximation error is the first step to establishing consistency for . This result gives convergence rates for estimation of the finite dimensional mixing coefficient and also for Markov processes of known order , since in this case, .

Theorem 2.4.

Consider a sample from a stationary -mixing process. Let and be positive integers such that and . Then

where and .

Consistency of the estimator is guaranteed only for certain choices of and . Clearly and as are necessary conditions. Consistency also requires convergence of the histogram estimators to the target densities. We leave the proof of this theorem for section 4. As an example to show that this bound can go to zero with proper choices of and , the following corollary proves consistency for first order Markov processes. Consistency of the estimator for higher order Markov processes can be proven similarly. These processes are algebraically -mixing as shown in e.g. Nummelin and Tuominen [18].

Corollary 2.5.

Let be a sample from a first order Markov process with . Then under the conditions of Theorem 2.4, .

Proof.

Recall that . Then,

if for constants and . In this case, we have that the exponential terms are less than

for and a constant . Therefore, both exponential terms go to 0 as . ∎

Proving Theorem 2.4 requires showing the convergence of the histogram density estimator with -mixing data. We do this in the next section.

3 convergence of histograms

Convergence of density estimators is thoroughly studied in the statistics and machine learning literature. Early papers on the

convergence of kernel density estimators (KDEs) include

[25, 2, 21]; Freedman and Diaconis [9] look specifically at histogram estimators, and Yu [26] considered the convergence of KDEs for -mixing data and shows that the optimal IID rates can be attained. Devroye and Györfi [7] argue that is a more appropriate metric for studying density estimation, and Tran [22] proves consistency of KDEs under - and -mixing. As far as we are aware, ours is the first proof of convergence for histograms under -mixing.

Additionally, the dimensionality of the target density is analogous to the order of the Markov approximation. Therefore, the convergence rates we give are asymptotic in the bandwidth which shrinks as increases, but also in the dimension which increases with . Even under these asymptotics, histogram estimation in this sense is not a high dimensional problem. The dimension of the target density considered here is on the order of , a rate somewhere between and .

Theorem 3.1.

If is the histogram estimator based on a (possibly vector valued) sample from a -mixing sequence with stationary density , then for all ,

(5)

where .

To prove this result, we use the blocking method of Yu [27] to transform the dependent -mixing into a sequence of nearly independent blocks. We then apply McDiarmid’s inequality to the blocks to derive asymptotics in the bandwidth of the histogram as well as the dimension of the target density. For completeness, we state Yu’s blocking result and McDiarmid’s inequality before proving the doubly asymptotic histogram convergence for IID data. Combining these lemmas allows us to derive rates of convergence for histograms based on -mixing inputs.

Lemma 3.2 (Lemma 4.1 in Yu [27]).

Let be a measurable function with respect to the block sequence uniformly bounded by . Then,

(6)

where the first expectation is with respect to the dependent block sequence, , and is with respect to the independent sequence, .

This lemma essentially gives a method of applying IID results to -mixing data. Because the dependence decays as we increase the separation between blocks, widely spaced blocks are nearly independent of each other. In particular, the difference between expectations over these nearly independent blocks and expectations over blocks which are actually independent can be controlled by the -mixing coefficient.

Lemma 3.3 (McDiarmid Inequality [13]).

Let be independent random variables, with taking values in a set for each . Suppose that the measurable function satisfies

whenever the vectors and differ only in the coordinate. Then for any ,

Lemma 3.4.

For an IID sample from some density on ,

(7)
(8)

where is the histogram estimate using a grid with sides of length .

Proof of Lemma 3.4.

Let be the probability of falling into the bin . Then,

For the second claim, consider the bin centered at . Let be the union of all bins . Assume the following:

  1. and is absolutely continuous on , with a.e. partial derivatives

  2. and is absolutely continuous on , with a.e. partial derivatives

  3. for all .

Using a Taylor expansion

where . Therefore, is given by

since the integral of the second term over the bin is zero. This means that for the bin,

Therefore,

Since each bin is bounded, we can sum over all bins. The number of bins is by definition, so

We can now prove the main result of this section.

Proof of Theorem 3.1.

Let be the loss of the histogram estimator, . Here where is the bin containing . Let , , and be histograms based on the block sequences , , and respectively. Clearly Now,

where . Here,

so by Lemma 3.4, as long as for , and , then for all there exists such that for all , . Now applying Lemma 3.2 to the expectation of the indicator of the event gives

where the probability on the right is for the -field generated by the independent block sequence . Since these blocks are independent, showing that satisfies the bounded differences requirement allows for the application of McDiarmid’s inequality 3.3 to the blocks. For any two block sequences and with for all , then

Therefore,

4 Proofs

The proof of Theorem 2.4 relies on the triangle inequality and the relationship between total variation distance and the distance between densities.

Proof of Theorem 2.4.

For any probability measures and defined on the same probability space with associated densities and with respect to some dominating measure ,

Let be the -dimensional stationary distribution of the order Markov process, i.e.  in the notation of equation 3. Let be the joint distribution of the bivariate random process created by the initial process and itself separated by time steps. By the triangle inequality, we can upper bound for any . Let and be the distributions associated with histogram estimators and respectively. Then,

where is our estimator and the remaining terms are the distance between a density estimator and the target density. Thus,

A similar argument starting from shows that

so we have that

Therefore,

where and . ∎

The proof of Theorem 2.3 requires two steps which are given in the following Lemmas. The first specifies the histogram bandwidth and the rate at which (the dimensionality of the target density) goes to infinity. If the dimensionality of the target density were fixed, we could achieve rates of convergence similar to those for histograms based on IID inputs. However, we wish to allow the dimensionality to grow with , so the rates are much slower as shown in the following lemma.

Lemma 4.1.

For the histogram estimator in Lemma 3.4, let

with

These choices lead to the optimal rate of convergence.

Proof.

Let for some to be determined. Then we want , , and all as . Call these , , and . Taking and first gives

(9)

Similarly, combining and gives

(10)

Equating (9) and (10) and solving for gives

where is the Lambert function. Plugging back into (9) gives that

where

It is also necessary to show that as grows, . We now prove this result.

Lemma 4.2.

converges to as .

Proof.

By stationarity, the supremum over is unnecessary in Definition 2.1, so without loss of generality, let . Let be the distribution on , and let be the distribution on . Let be the distribution on (the product sigma-field). Then we can rewrite Definition 2.1 using this notation as

Let and be the sub--fields of and consisting of the -dimensional cylinder sets for the dimensions closest together. Let be the product -field of these two. Then we can rewrite as

(11)

As such for all and . We can rewrite (11) in terms of finite-dimensional marginals:

where is the restriction of to . Because of the nested nature of these sigma-fields, we have

for all finite . Therefore, for fixed , is a monotone increasing sequence which is bounded above, and it converges to some limit . To show that requires some additional steps.

Let , which is a signed measure on . Let , which is a signed measure on