Equivariant Variance Estimation for Multiple Change-point Model

08/21/2021 ∙ by Ning Hao, et al. ∙ 0

The variance of noise plays an important role in many change-point detection procedures and the associated inferences. Most commonly used variance estimators require strong assumptions on the true mean structure or normality of the error distribution, which may not hold in applications. More importantly, the qualities of these estimators have not been discussed systematically in the literature. In this paper, we introduce a framework of equivariant variance estimation for multiple change-point models. In particular, we characterize the set of all equivariant unbiased quadratic variance estimators for a family of change-point model classes, and develop a minimax theory for such estimators.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Accurate estimation of the noise variance is crucial for statistical inference. When the noise is additive, one natural and commonly used variance estimator is based on the residual sum of squares (RSS) after mean estimation. However, variance estimation can be easier than mean estimation for many statistical models, especially when the mean is a complex function involving many parameters or being nonparametric. More importantly, many mean estimation procedures require the variance to be known or pre-estimated. Techniques for estimating the variance without estimating mean function have been developed in the contexts of high dimensional linear regression

(Fan et al., 2012), additive models (Chen et al., 2018), and nonparametric regression (Hall et al., 1990; Tong et al., 2013), among many others.

In this paper, we focus on the problem of noise variance estimation for a multiple change-point model. In particular, consider a sequence of random variables

satisfying

(1)
(2)

where the mean vector

is piecewise constant, and is the location vector of change points. We assume that the noises are independent and identically distributed (i.i.d.) with and . This multiple change-point model has received and is continuing to draw significant attention in the literature because of its wide applications in biological science, econometrics, engineering and many other fields; see Frick et al. (2014) and a recent review Niu et al. (2016).

A premier goal of change-point detection is to estimate and make inferences about the change-point location vector . A good variance estimator is vital in many change-point detection procedures. For example, in binary segmentation and related methods, the variance is required to decide when to stop the recursive procedure. In other methods, for example, screening and ranking algorithm (SaRa) in Niu & Zhang (2012) and simultaneous multiscale change-point estimator (SMUCE) in Frick et al. (2014), the choice of tuning or thresholding parameters depends on the variance. In general, it is important to gauge the noise level, which determines the optimal detection boundary and detectability of the change-point problem (Arias-Castro et al., 2005). Moreover, an accurate and reliable estimate of the variance is necessary for constructing confidence sets of the change points. In practice, the noise variance is usually needed and estimated as the first step of a change-point analysis. However, most commonly used variance estimators, reviewed in Section 3.1, are based on some technical assumptions and can be severely biased when these assumptions fail to hold. The quality of these estimators, such as unbiasedness and efficiency, has been less studied. In fact, to our best knowledge, the exact unbiased variance estimator under a finite sample setup has not been discussed before this work. There are two main challenges to the error variance estimation for change-point models. First, the information on the mean structure such as the number of change points and jump magnitudes is unknown, while complex mean structures often makes the variance estimation more difficult. Second, the noise may not be Gaussian in practice, while many methods work well only under normality. In spite of the importance of this problem and these issues, there has been no systematic study on variance estimation for the multiple change-point model (1). This work aims to fill this gap.

Our approach is inspired by the classical difference-based variance estimation in nonparametric regression, studied in (Rice, 1984; Gasser et al., 1986; Müller & Stadtmüller, 1987; Hall et al., 1990), among many others. In particular, Müller and Stadtmüller (1999) Müller & Stadtmüller (1999) innovatively built a variance estimator by regressing the lag- Rice estimators on the lags, in the context of nonparametric regression with discontinuities. Recent developments along this direction include (Tong et al., 2013; Tecuapetla-Gómez & Munk, 2017); see also a recent review (Levine & Tecuapetla-Gomez, 2019)

. These works focused on asymptotic analysis of variance estimation for more flexible models, and hence required much stronger conditions on the number of change points or discontinuities. In contrast to the existing literature, we narrow down to change-point models, but the thrust of our study is to have exact and non-asymptotic results regarding the unbiasedness and the minimax risk of the variance estimators, under minimal conditions. To the best of our knowledge, similar results have not appeared in the literature, and are difficult to obtain without the equivariance framework introduced in this paper.

In this paper, we develop a new framework of equivariant variance estimation. Roughly speaking, we will embed the index set on a circle instead of the usual straight line segment so the indices and are neighbors. In other words, there is no ‘head’ or ‘tail’ in the index set, and every position plays the same role. As we will illustrate in Section 2.3, there is a natural cyclic group action on the index set, which leads to an equivariant estimation framework. Under this framework, we are able to characterize all the equivariant unbiased quadratic variance estimators for a family of change-point model classes, and establish a minimax theory on variance estimation. This family of change-point model classes, denoted by , is indexed by a positive integer , which is the minimal distance between change-point locations allowed for any change-point model in the class. In other words, includes all change-point models that is a piecewise constant vector in with each constant piece having length at least . In general, a smaller

leads to a broader model class, and hence, a higher minimax risk. In this work, we give both lower and upper bounds in nonasymptotic forms for the minimax risk of equivariant unbiased quadratic estimators for these model classes. Another advantage of the equivariant framework is that it requires minimal assumptions on the noise distribution. In fact, our theoretical analysis relies on no other assumption than the existence of the fourth moment. In particular, the performance of the proposed framework is guaranteed also for skewed or heavy-tailed distributions. We also note that the notion of equivariance has not been sufficiently explored in the literature except

Olshen et al. (2004), which focuses on short segment detection rather than a framework of equivariant estimation.

To summarize the main contributions of our work, first, we introduce a new framework on equivariant variance estimation, and characterize the equivariant unbiased quadratic variance estimators for a family of change-point model classes. This framework resembles the classical theory of linear unbiased estimation, but is also technically more complicated. Second, we derive nonasymptotic lower and upper minimax risk bounds for the proposed estimators. In particular, in Corollary 

2, we give a surprisingly simple and exact answer to the minimax problem with an explicit minimax risk for a broad change-point model class . Third, our approach requires minimal model assumptions on the noise distribution and mean structure, which can hardly be weaken further. Last but not least, we suggest an equivariant variance estimator that is computationally simple and practically useful in applications. As a by-product, we show the risk explicitly for the regression based estimator proposed by Müller & Stadtmüller (1999) and theoretically compare its risk with our method. In the numerical studies, compared to an oracle variance estimator that knows the true mean, the relative efficiency of our methods is often within 1.5 across different scenarios. Our estimator significantly outperforms some popular methods in terms of accuracy and robustness.

2 Variance estimation

2.1 Model descriptions

In model (1), we assume that the data vector is observed and indexed by the set . We define a segment, denoted by , as a subset of consisting of consecutive integers . The working model (1) is standard and widely used in the literature. Here we make and emphasize a key extension. That is, the index set is arranged on a circle, and the indices 1 and do not play special roles as start and end points. Consequently, a segment with is also well-defined. For example, . For the mean vector with the form (2), we assume that it consists of segments with constant means, ,…, , which are separated by change point . Denote the common value of on the segment by . For a mean vector , we denote by the minimal length of all constant segments in . The magnitude of is a complexity measure of a change-point model. We will consider a family of nested model classes , where

(3)

In general, the larger is, the easier the change-point analysis. In particular, when , each observation can have its own mean different from all others, and there is no sensible change-point problem. Therefore, we only consider the case in this paper. Note that, by definition, if is a constant vector, and otherwise, .

Note that the classical model treats the first segment and the last segment of as two separated segments. That is, the index is treated as a known change point, no matter whether or not. The classical model classes can be defined by

(4)

In fact, by definition. For example, let and . We have but . The larger generality of over can be negligible in real applications. However, as we will see, it is advantageous to work on the family (3) to obtain neat theoretical results.

We use , , , to denote the index of the data, and and to denote the length of segments. Occasionally, an index in or may go beyond in formulas. In that case, we use the convention where is the unique integer such that . Similarly, we use to denote the index of change points and use the convention . The length of a segment is defined as the cardinality of the set , which is when and otherwise.

We assume the following condition on the error distribution in this paper.

Condition 1. are i.i.d. with , , and .

We view this assumption as a “minimal” one for the variance estimation problem, because there is no distributional assumption. The existence of the 4-th moment is necessary for studying the mean squared error of the variance estimator.

We define two quantities related to the mean structure

In fact, and measure the total variation of the mean vector in -norm. There is no change point in the sequence if and only if .

With the convention that , we define

which plays a central role in our variance estimation framework. In fact, it can be considered as a circular version of the lag- Rice estimator, defined as

In particular, is called Rice estimator, introduced in Rice (1984).

2.2 An equivariant approach for variance estimation

The means and covariances of ’s can be calculated as follows.

Proposition 1

Under Condition 1, for ,

Moreover, for ,

and for ,

With Proposition 1, we rescale and consider a regression model

(5)

where , , and is the noise term with mean zero and covariance

(6)

where is the identity matrix, is a vector of length with all entries equal to 1, is a matrix with . As and are easily calculated from the data, we can estimate the variance, i.e., the intercept in the regression model (5

), by the ordinary least squares (OLS) estimator, denoted by

. Specifically, let , , , then

(7)
Theorem 1

Assume Condition 1. The OLS estimator is unbiased when . Moreover, if , we have

(8)

If ,

(9)

Theorem 1 gives an exact risk of the variance estimator for . Note that the risk depends on only through its total variation . When , the exact risk also depends on other information of the mean, besides the total variation . See Theorem 3 for more details. In the proof of Theorem 1 in the appendix, we show that the equality in (9) is achieved for a specific satisfying: , is an even number, all segments are of the same length, and the segment means have the same absolute value, but with alternating signs. Therefore, the upper bound provided in (9) is tight.

There are three summands in the -risk of (8). The first summand is equal to where

(10)

is the oracle estimator when the true mean is known. When , according to Proposition 1, the generalized least squares (GLS) estimator based on model (5) is obtained using the covariance matrix (6). Clearly depends on through in the covariance (6). In a special case when , the covariance is compound symmetric, and the OLS and GLS estimators coincide (McElroy, 1967) and equal to with -risk . Therefore, the first two summands in (8) can not be reduced for any linear unbiased estimators based on . We will elaborate the related minimax theory in subsection 2.4.

We may also calculate the mean and covariance of ’s.

Proposition 2

Under Condition 1 with , for ,

Moreover, if , for ,

and for ,

To our best knowledge, Müller and Stadtmüller first constructed variance estimators via a regression approach based on ’s (Müller & Stadtmüller, 1999). They studied variance estimation and tests for jump points in nonparametric estimation under an asymptotic setting as .

Remark. The condition in Proposition 2 means that when study the properties of ’s, we consider the classical change-point model where the first segment is , and the last segment is .

Comparing with ’s, the mean and covariance structure of ’s is more complex. Moreover, Proposition 2 requires one more condition , i.e. zero skewness. The following proposition gives a precise comparison of the OLS estimators based on ’s and ’s.

Proposition 3

Assume Condition 1, , and . Let be the OLS estimator obtained by using in place of . Then is unbiased when . Moreover, if , we have

If and ,

We call the Müller-Stadtmüller (MS) estimator. As an immediate consequence of Theorem 1 and Proposition 3, when ,

It follows that has a smaller variance if and ; and has a smaller variance if . Asymptotically, when . So these two estimators often perform similarly, which is also verified by our numerical studies. In this paper, we aim to derive nonasymptotic and exact risk bounds for the variance estimators, which seems too complicated using ’s. Therefore, we focus on ’s subsequently and introduce the equivariant framework in the next subsection.

2.3 Equivariant unbiased estimation

Geometrically, we can embed the index set into the unit circle by the exponential map . The set is invariant of natural group action , where is the cyclic group of order , and the unit element maps to itself via a rotation by an angle . This group action naturally induces a group action of on the sample space , where the unit element maps an -vector to . There is another way to represent this group action via circulant matrices. Define as a circulant matrix with its entry

Again, we may treat the subscript in as a number modulo . It is easy to verify that holds under standard matrix multiplication and , so is a group isomorphic to . Under this isomorphism, the group action can be represented by matrix multiplication . Note that both the parameter space of the mean vector, , and the sample space, , are for the change-point model. An estimator of the mean vector is called equivariant if and only if for all , i.e., the estimation procedure commutes with the group action. For the problem of variance estimation, as the group action does not affect the value of variance parameter , a variance estimator is equivariant (or simply invariant) if .

In this sense, is an equivariant version of because the values of ’s remain the same under the group action. Consequently, we have

Proposition 4

is an equivariant variance estimator. Under condition 1, is equivariant and unbiased for .

We consider the class of quadratic estimators of the form , or , where is a symmetric matrix. It is straightforward to see with . That is, and their linear combinations are quadratic estimators. It turns out that any equivariant unbiased quadratic variance estimator for model class must be a linear combination of ,…, , as characterized by the following theorem.

Theorem 2

The set of all equivariant unbiased quadratic variance estimators for the model class is

Interestingly, consists of only one estimator, i.e., . As a corollary of Theorems 1 and 2, we have

Corollary 1

The OLS estimator is the unique quadratic equivariant unbiased variance estimator for model class . Its variance satisfies

Before we conclude this subsection, we point out that it is also possible to characterize the unbiased quadratic estimators over the class of classical change-point models defined in (4). It turns out this characterization is much more complicated than Theorem 2. Furthermore, the variance of an unbiased over also depends on the mean vector in a more complicated way. These observations give us another motivation to consider the equivariant estimators over the larger class . We discuss the unbiased estimators over with more details in Appendix Appendix D.

2.4 Minimax risk

Theorem 2 concludes that all equivariant unbiased quadratic estimators for model class are linear combinations of ,…,, including the OLS estimator studied in subsection 2.2. A natural question is whether the OLS estimator is optimal, and if not, how far it is from an optimal estimator. In this subsection, we will answer this question from the perspective of minimax theory.

Consider the class of all equivariant unbiased estimators over the model class

For any estimator , define the risk up to a factor

This risk is scale invariant by definition. As we will show soon, for a fixed model , the risk of the optimal estimator depends on the minimal segment length and the ratio . Therefore, we consider the model class in our minimax analysis, where the two parameters and bound these two quantities respectively. Define the minimax risk of all equivariant unbiased estimators in over model class as follows.

(11)

We can solve the minimax problem for the case as a simple corollary of Theorems 1 and 2.

Corollary 2

is the minimax estimator for model class with minimax risk with equality holding when is a multiple of 4.

Corollary 2 gives an elegant minimax solution for the broadest model class considered in this paper. At the level of , the OLS estimator is optimal, no matter what value takes. Intuitively, as grows and the model class shrinks, we may borrow more information from neighbors because of the piecewise constant mean structure, and get lower minimax risk. Nevertheless, the minimax estimator and the exact risk are difficult to find for . We will provide instead both lower and upper bounds of the minimax risk. We first calculate the risk of any equivariant unbiased estimator in .

Theorem 3

Let such that and . For , the risk of is

(12)

where is a matrix with

(13)

As shown in the proof of Proposition 5, the quadratic form in (12) is positive definite on the constrained linear space which lies in. Therefore, we can minimize the risk (12) to get the optimal solution in for any model in , putting aside the fact that the solution may depend on unknown parameters. Because all estimators in are linear combinations of ’s, they are also linear estimators of the intercept in model (5). It is not surprising that the optimization problem (12) has the same optimal solution as the least squares problem (5). We state the result formally as below.

Proposition 5

There is a unique solution to the optimization problem

Let be the minimizer for a model . Then is the GLS estimator of model (5) with . Moreover, if , then depends on the model only through .

By minimizing (12) with linear constraints, we can easily find the optimal and corresponding risk for an individual model . Nevertheless, we see from (13) that the value of depends on , which is not a function of when . Thus, there is no simple way to characterize the behavior of for all models in . As a result, it is a highly nontrivial problem to identify the minimax estimator and the minimax risk.

In Theorem 4, we will provide both lower and upper bounds of the minimax risk. We first introduce the main ideas and some necessary notations. We consider the OLS and GLS estimators and their risks over the model class to bound the minimax risk. For OLS, formula (9) in Theorem 1 implies an upper bound of minimax risk.

(14)

For GLS, we consider a smaller model class , over which the GLS estimator in depends on only through . Specifically, let be the covariance matrix (6) with and , we define as the GLS estimator based on (5) and covariance matrix , i.e.

The maximal risk of the GLS over can be derived to offer a lower bound of the minimax risk. Finally, we study a GLS estimator based on an upper bound of the covariance structure (6) and its maximal risk over , which leads to a minimax upper bound different from (14).

Let be the sequence defined recursively by with initial values . Define the matrix

and define

(15)

where is the top left entry of the matrix .

Theorem 4

Let be the minimax risk defined in (11), and be a function defined in (15). For the subclass , the GLS estimator is minimax with the risk

The minimax risk on the model class satisfies (14) and

(16)

The function in (15) is defined through the sequence . Although the explicit expression of and hence can be derived, it is complicated and barely provides any additional insight, so we choose not to present it. Instead, we characterize the behavior of around 0 in the following proposition.

Proposition 6

is a nonnegative increasing function on with

This proposition, together with (14), shows that the exceeded minimax risk of the OLS estimator is bounded by

As an immediate consequence, we have the following corollary.

Corollary 3

The OLS estimator is asymptotically minimax under condition , i.e.,

In Figure 1, we illustrate the minimax risk bounds discussed above. In particular, we plot the upper bounds given by OLS in (14) (labeled by OLS-L) and by GLS in (16) (labeled by GLS-L). (14) is tighter when is small, and (16) gives a sharper bound when is large. Two other lines in Figure 1, labeled by OLS-2L and GLS-2L, are for the risks of the OLS and GLS estimators over a smaller model class , as in (8) and (16). In particular, as stated in Theorem 4, the GLS-2L line, corresponding to , gives a lower bound of the minimax risk over . All the curves are plotted over a big range . For example, a model class with would include a model

which changes mean at a level of 2 standard deviation every 5 data points, or at a level of 4 standard deviation every 20 data points. In general, a large ratio

indicates that either the magnitude of mean changes is large or the mean changes frequently. In the former scenario, we may detect the obvious change points first and reduce the total variation , then estimate the variance, which facilitate the detection of subtle change points. In the second scenario, it would be difficult to identify all the change points simultaneously even if we know the true variance. Therefore, it is reasonable to consider variance estimation for a class with small or moderate . Finally, we conclude that the OLS estimator , defined in (7) and considered in Section 2.2, gives a simple and good solution to the variance estimation problem, especially for a model class where is not too big. We call the equivariant variance estimator (EVE), whose numerical performance will be presented next.

Figure 1: Lower (GLS-2L) and upper (OLS-L, GLS-L) bounds of the minimax risk with respect to . The left and right panels correspond to and respectively.

3 Numerical studies

3.1 Variance estimators for comparison

Many estimators for the variance or standard deviation of the additive noise have been used in recent works on change-point detection. For example, Fryzlewicz (2014) uses a median absolute deviation (MAD) estimator (Hampel, 1974), defined by

(17)

where is the median of the vector

, the constant 1.4826 the ratio between standard deviation and the third quartile of Gaussian distribution. One advantage of this estimator is that it is robust against outliers. However, the method depends on Gaussianity assumption and a sparsity assumption that

is a constant vector except a small number of entries.

Frick et al. (2014) suggests an estimator used in Davies & Kovac (2001),

(18)

where and . This estimator is similar to the MAD except that it does not require to be a almost constant vector. Nevertheless, it still needs the normality of the noises.

The Rice estimator, introduced in Rice (1984),

(19)

is another popular method. For example, Pique-Regi et al. (2008) uses it for variance estimation. It does not depend on Gaussianity of the noise. But it is seriously biased when the total variation is large. In fact, as an immediate consequence of Proposition 2, the bias of is . The regression based MS estimator (Müller & Stadtmüller, 1999) studied in Proposition 3 can eliminate the bias and improve the efficiency. In spite of its popularity in nonparametric regression (Tong & Wang, 2005; Tong et al., 2013; Tecuapetla-Gómez & Munk, 2017; Levine & Tecuapetla-Gomez, 2019), it seems that the regression based variance estimator has not been widely recognized in change-point analysis. We will compare the EVE with these estimators, as well as the oracle estimator (10), which can be viewed as a benchmark.

3.2 Simulated data examples

We illustrate the performance of our method using simulated data. We consider three error distributions, standard Gaussian distribution , a scaled -distribution

, and a translated exponential distribution

, all of which have mean zero and variance one, with , 6, 9, respectively. Note that the exponential distribution is non-symmetric with a nonzero third moment. We fix and consider three mean structures. Specifically, we consider a null model without any change point in scenario 1, a sparse mean model with few change points in scenario 2, and a model with frequent changes in scenario 3, as detailed below.

Scenario 1: .

Scenario 2: when , ; when , and otherwise.

Scenario 3: when , , and otherwise.

We report the simulation results for different methods by the average values and standard errors over 500 independent replicates for each scenario. Because practically it is more often to use standard deviation

rather than the variance in inference, we take square root to all variance estimators and report the results on standard deviation estimation. In total, there are 9 scenarios (3 mean scenarios 3 error distributions), labeled by S1-G, S1-T,…, S3-E in tables. For example, S1-G indicates Scenario 1 with Gaussian error.

To show the sensitivity to the choice of of our method, we compare the performance of for , 10, 15, and 20 in Table 1. For the null model (Scenario 1), larger leads to a better performance, as affirmed in Theorem 1. Nevertheless, the improvement using an larger than 10 is marginal. In contrast, in Scenario 3 when there are many change points, there is a serious bias when is larger than 10. In Scenario 2, a larger leads to slightly larger bias but smaller variance. In this case, our method is not sensitive to the choice of . We observe that the standard errors of all estimators for the exponential and distributions are larger than the Gaussian distribution because their fourth moments are larger. This is consistent with Theorem 1.

K=5 K=10 K=15 K=20 tuned Oracle
S1-G 0.999(0.029) 1.000(0.026) 1.000(0.025) 1.000(0.024) 0.999(0.028) 1.000(0.023)
S1-T 0.999(0.039) 0.999(0.037) 0.999(0.036) 0.999(0.035) 0.999(0.038) 1.000(0.034)
S1-E 0.998(0.048) 0.998(0.046) 0.998(0.046) 0.998(0.046) 0.998(0.047) 0.998(0.046)
S2-G 1.000(0.029) 1.000(0.026) 1.004(0.026) 1.009(0.025) 1.000(0.028) 1.000(0.023)
S2-T 0.999(0.039) 0.999(0.037) 1.003(0.036) 1.008(0.035) 1.000(0.038) 1.000(0.034)
S2-E 0.998(0.049) 0.998(0.046) 1.003(0.046) 1.007(0.046) 0.999(0.047) 0.998(0.046)
S3-G 1.000(0.034) 1.000(0.030) 1.253(0.026) 1.468(0.031) 1.001(0.030) 1.000(0.023)
S3-T 0.999(0.043) 0.999(0.040) 1.254(0.033) 1.469(0.035) 1.000(0.041) 1.000(0.034)
S3-E 0.998(0.052) 0.998(0.049) 1.252(0.041) 1.467(0.041) 0.999(0.049) 0.998(0.046)
Table 1: Average values of estimators with standard errors in parenthesis over 500 replicates.

We see in first two scenarios the performance of our method is not sensitive to the choice of , while in the third scenario, a larger than 10 leads to serious bias. We develop a simple method to tune . Given a range of , say , we calculate ,…, and use ,…, to predict based on the linear model (5). We calculate a score defined by , where is estimated based on the RSS. An is selected by