 # Detection and Estimation of Local Signals

We study the maximum score statistic to detect and estimate local signals in the form of change-points in the level, slope, or other property of a sequence of observations, and to segment the sequence when there appear to be multiple changes. We find that when observations are serially dependent, the change-points can lead to upwardly biased estimates of autocorrelations, resulting in a sometimes serious loss of power. Examples involving temperature variations, the level of atmospheric greenhouse gases, suicide rates and daily incidence of COVID-19 illustrate the general theory.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction.

We consider a problem of detection, estimation, and segmentation of local nonlinear signals imbedded in a sequence of observations. As a model, we assume for

 Yu=ρYu−1+μ(Xu)+∑kξkf[(Xu−tk)/T]+ϵu, (1)

where define the locations and

the “shape,” of the local signals. For asymptotic analysis given below, we assume that the

are scaled, so . Initially we assume the are fixed, but for some applications they are random. The

are independent mean 0, normally distributed errors. For the moment we assume their variances are known and equal one, and discuss later how they should be estimated. The nuisance parameters

depend on the variable and may be constant or a parameterized regression function. They might, for example, be the mean values for a control group, for which it is assumed that all of the . For our general theory we assume is an unknown constant, but we estimate differently in different problems.

An important special case that we return to in examples below is and , so with this notation the model becomes

 Yu=ρYu−1+μu+∑kξkf[(u−tk)/T]+ϵu. (2)

It is occasionally convenient to simplify several basic calculations by considering an alternative continuous time model, for which (2) can be conveniently written

 dYu=−γYudu+μudu+∑kξkf[(u−tk)/T]du+dW(u) (3)

for , where

defines white noise residuals.

Specific choices of the function appear to be appropriate for a number of applications, some of which are discussed below. The special case is a frequently discussed “change-point” model, which has been applied to a variety of problems, usually with assumed equal to 0. See, for example, Olshen et al. (2004), Robbins, Gallagher, and Lund (2016) and references given there. Fang, Li and Siegmund (2018) and Fryzlewicz (2014)

address a version of the problem that involves the possibility of multiple change-points, and a major goal is segmentation of the observations to identify their number and locations, while controlling the probability of false positive detections. In this paper we extend the methods of

Fang, Li and Siegmund (2018) to deal with more general signals and with observations that may have autoregressive dependence. See also Baranowski, Chen, and Fryzlewicz (2019).

An interesting special case is a broken line regression model, where , so the model is given by

 Yu=ρYu−1+α+β[(u−(T+1)/2)/T]+∑ξk[(u−tk)+/T]+ϵu. (4)

This model has a long history, again primarily for the case of independent observations and at most one break-point in a regression line. An application to monitor kidney function after transplant has been discussed in a series of papers by A. F. M. Smith and others (e.g. Smith and Cook (1980)). Davies (1987) and Knowles and Siegmund (1989) give theoretical analyses for independent observations. Toms and Lesperance (2003) provides an analysis and ecological applications. Several of the examples discussed below involve climatological time series, or day by day newly confirmed cases of COVID-19. Although alternative models are possible, a broken line model provides a conceptual framework that allows us to fit an easy to understand model and to ask whether the breakpoints represent changes of scientific importance.

In a variety of genomic applications denotes a genomic location, and different applications suggest functions having different characteristic shapes. For example, for detection of copy number variations (CNV) the indicator of an interval or a half line may be appropriate. For detection of differentially methylated genomic regions Jaffe, et al. (2012a) suggests a “bump” function. See simple examples below and elaborations of this model involving covariates in Jaffe, et al. (2012b). For ChIP-Seq analysis steeply decaying symmetric functions like ,

, or a normal probability density function are different possibilities (cf.

Shin   et al. (2013), Schwartzman   et al. (2013)). In some applications it is often appropriate to add a scale parameter , so at the signal located at is of the form (cf. Siegmund and Worsley (1995)). The model of “paired change-points” of Olshen et al. (2004) can be described similarly, with .

In the special case that is a constant, , and is the indicator that (1) is the simplest example of a threshold autoregression, as introduced by Tong and studied by Chan and Tong and others (e.g., Chan and Tong (1990) ) in numerous projects. In a still different context, may be a regressor, say a biomarker, and the local signal can be used to study whether a subset of individuals, defined by the value of that biomarker, differ from others in some respect, e.g., response to a treatment.

In some applications

may be multidimensional or the noise distribution may not be normal, with the Poisson distribution representing a particularly interesting alternative.

Although the focus or our paper is on segmentation of multiple signals, we also develop confidence regions for and jointly for and . After beginning with some basic calculations and a discussion of testing for at most one change in Section 2, segmentation is discussed in Section 3 and illustrated in 3.1. New approximations to control the false positive error rate given in Section 3, which combine smooth and not smooth stochastic processes, play an important role. Sections 4 and 5 briefly consider a number of other special cases of our general framework. Additional theory and examples are contained in an online Supplement.

We have used simple models to minimize issues of over-fitting and data re-use that could arguably complicate interpretation of our segmentations. Once segmented, the models become simple linear models, so we also consider a second stage analysis to estimate more accurately the nuisance parameters and the , which then allows us to re-evaluate the accuracy of the segmentation.

Although large sample theory suggests use of maximum likelihood estimators to estimate the nuisance parameters and , we find that in some cases the signals themselves introduce large bias into those estimators. We discuss some ad hoc alternatives below.

## 2 Basic Calculations and the Case of at Most One Change.

To introduce our notation and provide some basic results, we begin with the model given by (2), with or 0, and we consider a test of the hypothesis . The log likelihood is given by

 =−.5∑u{Yu−ρYu−1−α−β[(u−(T+1)/2)/T]−ξf[(u−t)/T]}2,

where , and we consider a test based on the maximum with respect to of the standardized score statistic

 ℓξ(0,t,^θ)/σ(t), (5)

where and is the maximum likelihood estimator of the nuisance parameters under the hypothesis ; and is the asymptotic variance of the numerator.

We begin with the standard large sample expansion of the numerator in (5) given by

 ℓξ(0,t,^θ)≈ℓξ(0,t,θ)−Iξ,θI−1θ,θℓθ, (6)

where

denotes elements of the Fisher information matrix and by the law of large numbers all quantities on the right hand side are evaluated at

and true values of the other parameters. This expansion is valid in large samples up to terms that are in probability. To emphasize the structure of this approximation we find it convenient to use the notation to denote expectation under the hypothesis and write (6) in the form , where , is not random although it depends on ; does not depend on , and under the hypothesis it has mean value 0 and covariance matrix .

Remark. It may be shown that the decomposition given in (6) does not depend on the normality of the

, although we can no longer use the terminology of likelihood, efficient score, etc., and must rely on the central limit theorem to justify the asymptotic normality of probability calculations given below. The special case of Poisson observaions is discussed briefly in Section 5.

Calculations yield a number of simple propositions.

Proposition 1.

 Σ(s,t)=E0(VsVt)=E0[ℓξ(s)ℓξ(t)]−Ψ(s)′AΨ(t). (7)

In particular the variance of is

Remark. In the case the local signal contains a scale parameter, so it takes the form , the covariance is similar in its general formulation, but slightly more complicated to compute in examples.

Writing to denote expectation under the alternative, where and

can be vectors, we have

Proposition 2.

Let

, and consider the test statistic

The false positive probability for the test that detects a signal when is large is given by .

To evaluate this probability, we distinguish essentially two cases, one where is continuous and the second where it is discontinuous. The case where is the indicator of a half line or an interval (of unspecified length) is discussed in Fang, Li and Siegmund (2018). Here we assume that is continuous and piecewise differentiable. An approximation based on Rice’s formula is given by

Proposition 3.

 P0{maxT0≤t≤T1|Zt|>b}≤2{(φ(b)/(2π)1/2)∫T1T0[E0(˙Zt)2]1/2dt+1−Φ(b)}, (8)

where denotes the derivative of at . For the model where the local signal at has the form , and we maximize over both and a range of values of , a first order approximation becomes

 P0{maxt,τ|Zt,τ|>b}≈2{bφ(b)/(2π)}∫τ1τ0∫T1T0det[E0(˙Z˙Z′)]1/2dtdτ, (9)

where . This approximation can be improved by adding terms involving edge effects and corrections for curvature. The most important is the boundary correction at the minimum value of , which equals , where now . It is often convenient when is large to ignore edge effects, which simplifies integration over . See Siegmund and Worsley (1995) or Adler and Taylor (2007) for more detailed approximations.

To evaluate (8) and (9), the following result is useful.

Proposition 4.

 E0[(˙Z˙Z′)]=[E0(˙ℓ˙ℓ′)−˙Ψ′A˙Ψ]/σ2(t)−[˙σ˙σ′]/σ2. (10)

Finally observe that by Propositions 1 and 2, we have the “matched filter” conditions, and consequently

Proposition 5. Given (or ), the variable (or ) is sufficient for .

For the specific case of (4), it simplifies calculations somewhat to consider the continuous time version given in (3), which amounts to replacing certain sums by integrals, leading to

Proposition 6. , and

As expected, this result does not depend on the nuisance parameters . It is perhaps surprising that it also does not depend on . See the online Supplement for details.

### 2.1 Confidence Regions

It is possible to obtain a confidence region for the parameters or a joint region for and . Calculation shows that the expected value of equals times the covariance of and (cf. Proposition 2). An easy consequence is that is sufficient for , so the conditional distribution of given

is the same as it would be under the null hypothesis,

.

Now we must consider two cases, depending on whether the process has a jump at . The case of a jump was studied by Fang, Li and Siegmund (2018), so here we consider the continuous case. For a similar approach based on a different probability evaluation see Knowles, Siegmund and Zhang (1991). We use the Kac-Slepian model process, which for a mean 0 unit variance stationary, differentiable, Gaussian process, say , gives a parabolic approximation for the process in a neighborhood of conditional on assuming a large value. As a consequence of those arguments, in a neighborhood of

 Us=Us0+(s−s0)˙Us0−(s−s0)22Us0E0[(˙Us0)2]+o((s−s0)2),

so

 maxs(U2s−U2s0)≈˙U2s0/{E0[(˙Us0)2]Us0}, (11)

with errors that converge to 0 in probability. It follows that the conditional distribution of is asymptotically

with one degree of freedom, so a

confidence region is given by the set of all that satisfy , where is the quantile of the distribution with one degree of freedom. This is in effect the same result one would obtain by inverting the log likelihood ratio statistic if were the log likelihood of a parameter satisfying standard regularity conditions.

A joint confidence region for and is also easily obtained. To the condition that must be within a given distance of , we also require that is sufficiently small. The pairs that satisfy provide a joint confidence region having the confidence coefficient

 Pt,ξ{maxδ[Z2t+δ−Z2t]+(Zt−ξσ(t))2≤c2} (12)
 =Et,ξ[Pt,ξ{maxδ[Z2t+δ−Z2t]≤c2−(Zt−ξσ(t))2|Zt};|Zt−ξσ(t)|≤c] (13)

which equals the distribution of a random variable with two degrees of freedom, again as if standard regularity conditions had been satisfied.

Looking ahead to the possibility of finding joint confidence regions for two (or more) change-points , we face a more subtle argument. First note that with , and , a straightforward calculation extending the result given in Proposition 2 (cf. (29) in the online Supplement) shows that , where This produces by a second calculation the conclusion that the log likelihood of equals . Hence the log likelihood ratio statistic, given , for testing that is the 0 vector equals

 (Vt1,Vt2)Σ(t1,t2)−1(Vt1,Vt2)′=||~Zt1,t2||2, (14)

where has a standard bivariate normal distribution when , and denotes the Euclidean norm. It follows from sufficiency and the Kac-Slepian argument used above that

 maxδ1,δ2[||~Zt1+δ1,t2+δ2||2−||~Zt1,t2||2]

has a distribution with two degrees of freedom and can be used as above to obtain a joint confidence region for .

A joint confidence region for follows by an argment similar to that given above.

Remark. The confidence regions for the change-points are easy to visualize and seem reasonable. The estimates of are more problematic, since selection of the change-point value introduces bias into the estimation of which may be exacerbated by a biased estimate of . In Section 3 we consider a different method for estimating the parameters .

### 2.2 Numerical Examples

We begin with several simple numerical examples of broken line regression, which differs from some other models in the sense that a change of slope can have a lasting effect that makes even a relatively small change easy to detect if it persists.

Example 1. Extreme Precipitation in the United States. Extreme precipitation in the United States is reported by the National Oceanographic and Atmospheric Administration (NOAA) at

This web site gives the area of the country where monthly rainfall exceeeded the 90th percentile of normal for the 124 years and 8 months beginning in 1895.

Our test for at most one changes suggests a slope increase in 1954, with a p-value of 0.001. The value of is estimated to be 0.08. The test applied to extremely low preciptation (area of the country below the 10th percentile) is consistent with the hypothesis of no change in slope.

Example 2. Central England Annual Average Temperature. The average temperature in central England England from 1659 (Manley (1974)) until 2019 is reported on

The maximum value is about 4.60, with estimated to be 0.16, and occurs in 1969. The appropriate (two-sided) p-value is about . See Figure 1, where it appears that there might also be earlier change-points. Although the mercury thermometer was relatively young in 1659, those early years are interesting since they appear to involve the “little ice age.” We return to these data below for our discussion of segmentation. Meanwhile it may be interesting to note that an approximate 95% confidence region based on the detected change in 1969 is (1895,1990), which reflects the local shape of the plot in Figure 1 and suggests the possibility that an increase in temperature may have actually begun during the late 19th century. Figure 1: Plot of Z2t for central england temperature data, 1659-2017.

### 2.3 Estimation of σ2 and ρ: A Simulated Example.

Although we have assumed the variance known in our theoretical calculations, in our numerical studies we have estimated it by the residual mean square under the null hypothesis. Other possibilities when the data are independent are sums of squares of first or second order differences: or , which may be preferable. First order differences remove most of the effect of changing mean values, and second order differences also mitigate the effect of slope changes.

Regarding correlation of the observations, a useful model is one that helps to control false positive errors with minimal loss of power; and to that end our working model is a first order autoregression. Using a value of that is too small can lead to an increase in false positives. But when there are change-points, the maximum likelihood estimator of under the null hypothesis that can be upwardly biased, leading to a loss of power, as we can easily see from the simulated data described below.

The simulated example in Table 1 illustrates the problem of estimating and the effect on the power to detect a change.

For the data in the third and fourth rows of the Table, if we assume it known that and use second order differences to estimate , the estimated value is 0.89 and the maximum value is 6.77. In all cases, the maximizing value of is 94. In most other cases as well the maximizing value of provides a reasonable estimator of the true value.

It is clear from these examples and others not shown that the existence of a change-point causes the estimator of to be upwardly biased and results in a loss of power compared to using the true value of . This is usually worse if there are multiple change-points. If a plot of the data indicates long stretches without change-points, one might use an estimator of from that segment. For the data in the next to last row of Table 1, an estimate of based on the observations 65-150 yields the value 0.61, which in turn leads to a maximum -value of 4.31. Although the theory developed here does not justify this approach, numerous simulations suggest that it works reasonably well.

A second consideration in estimation of is robustness of our procedure against more complex forms of dependence. Without exploring this question in detail, consider again the next to last row of Table 1, but where there is a second order autoregressive coefficient of - 0.2 and . A simulation of this case gives an estimator of equal to 0.6 based on all the data, with a maximuom -value of 3.32, and an estimator equal to 0.42 based on the second half of the data, with a maximum -value of 4.85. Other simulations, not given here, suggest that at least for higher order autoregressive dependence, the first order model we have suggested provides reasonable protection against an inflated false positive error rate.

A different approach (e.g., Robbins, Gallagher, and Lund (2016)) for dealing with dependence in the standard change-point model (i.e.,

) is to assume that the estimated residuals from a least squares fit of the null model form a locally dependent stationary process and estimate the autocovariances accordingly. Weak convergence arguments typically indicate that a rescaled process converges weakly to a Gaussian process, often related to a Brownian Bridge, which can be studied to look for change-points. Although this method has some advantages, we prefer our approach for facilitating quantitative statements in the form of hypothesis tests and confidence intervals after a minimal amount of “data snooping” to arrive at a useful model. If we are successful in detecting the signal locations, our model for the other parameters becomes a standard linear model. In Section 3.1 we discuss analysis of this linear model as a test of our approach.

## 3 Segmentation.

We now turn to the problem of segmentation when there may be several change-points. It is helpful conceptually to think of two different cases. In the first the signals exhibit no particular pattern. In the second two or more changes are expected to produce a definite signature. For the problem of changes in levels of the process, paired changes frequently take the form of an abrupt departure from followed by a return to a baseline level.

Here we initially concentrate on the case of signals having no apparent pattern and consider versions of the two methods suggested in Fang, Li and Siegmund (2018) for detection of jump changes. First we consider a pseudo-sequential procedure, called Seq below, where it is convenient to write to denote the statistic when the interval of observation is . We also use a minimum sample size and a minimum lag , both of which we often set equal to 5. Let and let be the smallest integer exceeding such that for some , the value of exceeds a threshold to be determined by a constraint on the probability of false positive error. The first change-point is taken to be : equal to , or the smallest or largest value of for that smallest . (In numerical experiments we have been unable to find a consistent preference.) The process is then iterated starting from . The nuisance parameter is assumed to be known. In practice it is estimated from the data or more often from a subset of the data to mitigate the effect of the bias discussed above. The other nuisance parameters, are estimated from the current, evolving interval of observations, so that the nuisance parameters associated with one change-point do not confound detection of another. It would also be possible to estimate from the currently studied subset of the data, but this estimator appears to be unstable.

To control the global false positive error rate, we want an approximation for

 Q=P{maxm0≤t

where is the number of observations and represent minimal sample sizes, both of which we frequently take to be 5. To emphasize that is a variable quantity, we also write , etc.

Let and Then

 Q≈(2/π)1/2bφ(b)m∑T=m0∫m0

A derivation of the approximation (15) begins from the initial approximation of the probability of interest by a sum over of the sum over of the integral over from 0 to of the product of three factors: (i) (ii) and (iii)

The second factor is approximately . If we choose , so that is very small, the integral is dominated by small values of . The third factor is approximately , for a suitable random walk. For small , it is approximately . Putting these two observations together leads to the approximation given by (15).

A slightly more complex calculation leads to a different and apparently preferable approximation, which is based on a similar analysis but incorporates in addition the conditional probability, given , that and By virtue of the Kac-Slepian model process and hence is an independent random variable. This leads to an expression similar to (15), but with the integrand in (15) replaced by

 bλ1/2tβ(t,T)ν{b[2β(t,T)]1/2}. (16)

In spite of its somewhat different appearance, this second approximation is slightly less conservative than the first, but gives very similar numerical values. For and , the first approximation gives a 0.05 threshold of 4.08, while the second gives 3.97. Since these probabilities vary substantially for small values of , and neighboring changes in slope seem difficult to detect and rarely to occur in the data we have studied, we typically use the values , which for this example would reduce the 0.05 threshold given by (16) from 3.97 to 3.89.

The approximation (16) may be used to assign a p-value to the the individual detections. For example, suppose we use the threshold , which for and gives a global false positive error of 0.05. Suppose also that we detect a change at with and . The approximation (16) for and , which equals 0.001, can be regarded as a p-value for that detection. This argument can be applied iteratively to obtain p-values for a second and subsequent detections. In view of the lag between a putative change-point and the value of when detection occurs, it is possible for the sum of p-values to exceed the global significance value, although numerical experimentation indicates that this rarely occurs.

When this process detects a change in slope, for the smallest value of there typically is a cluster of values of where exceeds the chosen threshold. In practice we might reasonably choose the smallest such , or the value that maximizes . We then iterate the process starting from the selected

. Upon iteration the background linear regression is estimated using the data beginning with that

. It may not be clear which choice is better in a particular application.

Remarks. (i) Although we use this “pseudo-sequential” method with a given amount of data, it can also be used for actual sequential detection, as in the problem of monitoring kidney transplants discussed in Smith and Cook (1980).

(ii) Although sample paths of the process as a function of are smooth, as a function of they behave locally like a random walk. A consequence is that to obtain a mathematically precise derivation of the approximation (15), local perturbations of the paramter must be scaled by a small factor, say , which converges to 0 at the rate of when See, for example, Fang, Li and Siegmund (2018) for the case of abrupt changes in the mean. A similar remark applies to (17) given below.

A second method of segmentation, called MS below (for Maximum Score Statistic), uses (in obvious notation)

 maxm0≤T0

An asymptotic approximation to the null probability that this expression exceeds can be computed by a modification of the argument behind (16), and we obtain the approximating expression

 (2π)1/2b2φ(b)×
 ∑m0≤T0

where is defined similarly to . The evaluation of (17) can be simplified by a summation by parts and the observation that the various functions of three variables, , occurring in (17) are in fact functions of two variables: .

These methods determine a list of putative local signals together with generally overlapping backgrounds. We consider different algorithms for selecting a set of local signals and backgrounds from this list, with a preference for the additional constraint that no background overlap two local signals. In Fang, Li and Siegmund (2018) examples of particular interest were the shortest background, (cf. also Baranowski, Chen, and Fryzlewicz (2019)), and the largest value. For the second of these methods we have paid in advance for false positive errors, so we can also consider other methods for searching the list, e.g., subjective consistency with a plot of the data. We also find it convenient computationally to follow the suggestion of Fryzlewicz (2014) by searching a random subset of intervals, which seems to work very well. When it appears that the number of changes is small, as in the examples considered in this paper, one might also use a combination of methods, e.g., searching a small random subset of intervals at a low threshold to generate candidates, then sorting those by hand with tests to detect at least one change at the appropriate higher threshold.

The methods discussed so far are “bottom up” methods in the sense that they attempt to identify one local signal at a time against a background appropriate for that signal. A popular “top down” method in the change-point literature scans all the data looking for a pair of change-points (e.g., Olshen et al. (2004)). In some cases, especially if there is only one change to be detected, the method will “detect” two changes, as required, but it will put one of them near an end of the data sequence, where it can be recognized and ignored. After an initial discovery of one or two change-points, the sequence is broken into two or three parts by those change-points, and the process is iterated as long as new change-points are identified. Here we also consider the statistic , where

 Us,t=(Vs,Vt)Σ−1s,t(Vs,Vt)′. (18)

and is a parameter that represents a minimum distance between changes that we find interesting (usually taken to be 5 or 10 in the examples below). An appropriate threshold may be determined from the approximation (closely related to (9))

 P{maxsb}∼[2bφ(b)/(2π)]∫s

This approximation is approximately linear in the length of the sequence searched, so iterating the process leads in a final step to roughly the same probability of a false positive as is present in the first step. If we assume that in early iterations, only true positives are detected, then the iterative process does not create a multiple comparisons issue. As we will see in the examples below, this method has very good power properties, although its use requires somewhat more thoughtful analysis than the other methods. Since the process can have relatively large local correlations, when there is only one change to be detected in an interval, and hence we expect to find a second “fake” change near an end-point of the interval searched, that “fake” change may be somewhat distant from the end-point, so it is not clear without additional analysis whether it should be ignored. Use of a linear analysis as in the following section can be very helpful.

A search algorithm based on the maximum of (18) is easily adapted to obtain joint confidence regions for a pair of change points along the lines discussed toward the end of Section 2.1.

An advantage of the top down feature is that the intervals searched are usually relatively long, so it may be reasonable to estimate different values of for the different intervals searched. If we assume that large changes are detected first, this method would presumably reduce the estimated value of , in subsequent searches, but the problem of bias would remain as long as subsequent searches involve intervals containing changes. For the sake of consistency in our examples we have not explored this option.

### 3.1 Applications.

In this section we consider a number of data sets, where broken line regression is arguably an appropriate model, or at least where it allows one to address questions of interest regarding the times of changes in long term trends. Since all these examples involve observational time series, an assumption of independence seems inappropriate, although in several cases the estimated value of is small enough to be assumed equal to zero.

After employing a detection method, we assume that the detections are in fact correct,so our model becomes a standard linear model. Estimation of the parameters of the linear model gives us an idea of the size and importance of different detected changes, the adequacy of the model as judged by the value of and , and the reasonableness of our choice for the value of in our initial segmentation.

Example 3: Greenland temperatures 1901-2016. An example from a site sponsored by the World Bank,

is the average annual temperature of Greenland, which may be particularly interesting in view of the large amount of water contained in its glaciers. Data are provided for the years 1901-2016. See Figure 4 in the online Supplement for a plot of , which suggests a small increase followed by a decrease early in the 20th century, then a second increase near the end of the century. If we use all 116 observations to estimate , we obtain and the test to detect at least one change detects a decrease in slope occurring in about 1929. If we use only the first half or the second half of the data to estimate , we obtain a value slightly less than 0.1 and with this value detect an increase in 1993. With the smaller estimate of , Seq detects three changes: an increase in 1919, a decrease in 1929, and a second increase that seems to occur about 1973, although 1984 gives essentially the same observed value of the statistic. With MS, we detect changes in 1929 and 1984. If we set we also barely detect a change in 1919. A linear model incorporating these three changes begins with an initially negative slope, with the first two changes essentially canceling each other, and the third change about three times as large (in magnitude) as the initial negative slope. The effect of putting these changes into the model is to increase from an initial value of 0.28 to approximately 0.7, while reducing the estimated value of from 0.13 to 0.02, which does not test to be different from 0. Although the change at 1919 is borderline, omitting it from the model provides a less satisfactory result (details omitted).

If we use (18) to search for two changes at a time, we find slope changes in 1918, 1933, and 1983, essentially the same as above.

Example 4. Berkeley Earth Central European Temperature Anomalies beginning in 1753. Whereas both the Hadley Center and NOAA provide temperature data, their beginning dates are in the mid to late 19th century, and hence make it difficult to detect slope changes that may have occurred during the 19th century. The Berkeley Earth web site at

provides measurements going back to the mid 18th century. The results for several European countries are quite similar and suggest that at least in Europe increasing temperatures began in the late 19th century before the recent acceleration that began about 1980. As an example we consider Swiss temperature anomalies from 1753 to 2012. Our test for at least one change identifies a slope increase in 1890 with an estimated autocorrelation of 0.24. The plot of (Figure 2) contains a broad peak and a local maximum in the late 20th century, which suggests the possibility of more than one change. A 95% confidence interval based on the theory suggested above is [1866,1972], which leaves open the possibility of a second change. Seq and MS confirm a 19th century change, but they fail to detect one in the 20th century even if the autocorrelation is set to 0. With the estimated autocorrelation of 0.24, the statistic (18) detects changes in both 1897 and 1980. A linear analysis with these two change points yields and of 0.33 and an autocorrelation of 0.03, which does not test to be different from 0. Temperature anomalies of several other countries in Europe exhibit similar, but not identical behavior. In these cases the test to detect a single change will sometimes pick out a local maximum in the 19th century and sometimes in the 20th, while the statistic (18) detects two changes. This is perhaps not surprising since the method was designed for such data. For Switzerland, a 90 % joint confidence region for the two change-points with set equal to 0 contains about 2000 pairs of points. Typical examples near the extremes are (1829,1988), (1888,1962), (1890,1995). Figure 2: Plot of Zt for Berkeley Earth Swiss Temperature Anomalies: 1753-2012.

Example 2 (continued): Central England Average Annual Temperature. A challenging case is the central England temperature data for 360 years beginning in 1659. A plot of the data suggests possible changes very early in the series with a much larger change 2-3 hundred years later. The first 80-100 years may be less reliable, due to the primitive thermometers available at that time, but the early data are nonetheless interesting in view of the so-called “little ice age,” which overlapped these years. As reported above, a test of at least one change produces a significant increase in slope in 1970 with an estimated autocorrelation of 0.159, and a confidence interval going back about a hundred years, reflecting the increase in the plot of , in Figure 2, that begins in the late 19th century. If we use Seq with the threshold 4.00 and the autocorrelation 0.159, we find an increase in 1693, a decrease about 1708, and a larger increase about 1890. For MS with a threshold of 4.81 and the same value for , we detect only the change in 1890. If we lower the detection standard to control the global false positive rate at 0.10 and set , which is the estimate obtained from the first half of the data or from the second half of the data, we obtain a change in 1696 and a second in either 1890 or 1988, but without a compelling reason to choose one or the other. Figure 3 suggests that either (or both) could be correct. A linear analysis indicates that the model with only one change at either 1890 or 1988 leads to an of approxmately 0.26 and a value of of about 0.12. Incorporating the two earlier changes detected by Seq raises to about 0.31 and reduces to about 0.1 for either choice of the third change. Including all four changes raises to about 0.33 and decreases to about 0.08. The statistic (20) to detect two changes at a time detects changes in 1697, 1704, 1730, and 1886; but a linear analysis indicates that the change in 1730 need not be included. The two earlier changes are detected on the initial search, and hence this represents an example where the statistic (20), when faced with an interval where there is only one change, may place a second change sufficiently far from an end-point of the interval searched that it is not obvious that it should be disregarded.

Example 5: Age specific suicide rates in the United States. A small sample size example is suicide rates in the US from 1990 through 2017, which can be found on the web site

ourworldindata.org.

Autocorrelation appears to be very small, so we assume it is 0. All methods detect a substantial slope increase about 2000. For MS this is the only change detected, and it produces an of 0.91. Seq detects in addition, a second increase about 2014. The statistic (18) detects three changes, a decrease in slope about 1995 before a larger increase in 1999, and a third increase in 2014. A linear analysis with all three changes yields an of 0.97.

## 4 Detection of Jump Changes

We consider again the model (2), now with an unknown constant and , so the are change-points in the level of the observed process. For some applications it may seem more appropriate to use so the length of a local interval where the mean changes is an unknown constant , which may vary from one interval to another when there are multiple changepoints. Since the case has been widely studied, here we concentrate on time series, where the first order autoregressive dependence discussed above may be pertinent. Again it turns out that change-points introduce bias into the estimation of , so we consider estimators based on subsets of the data that appear to be free of change-points.

We consider primarily methods taken from Fang, Li and Siegmund (2018), which, as noted above, motivate the structure of the bottom up segmentation methods suggested in this paper.

We also consider the top down method of circular binary segmentation (CBS) (Olshen et al. (2004)), which may have advantages when a change that increases (decreases) the mean value tends to be followed by a change that decreases (increases) the mean, since it directly models such paired change-points. And finally we consider the analogue of (18), which was not studied in Fang, Li and Siegmund (2018). To that end we take for simplicity and let Then , so is defined by the equation (18) for the approriately modified covariance matrix with entries for . In this section we continue to refer to this as the statistic (18).

Two interesting climate related time series are AMO (Atlantic Multidecadal Oscillation), for which there are data beginning in 1856, and PDO (Pacific Decadal Oscillation) with data from 1900. In both cases the data have been recorded monthly, but we average the monthly data to get annual time series with 163 and 118 values, respectively.

A visual inspection of the AMO data suggests that there is no change-point before about 1900, so we estimate from the first 45 observations to get the value 0.38. All methods agree that there is a decrease in the mean at 46 (1901), an increase at 70 (1925), a decrease about at 108 (1963), and the latest increase at 139 (1995). A linear analysis with these change-points returns an of about 0.72 and an auto-correlation of 0. It may be interesting to observe that the changes in slope detected in the Northern Hemisphere ocean temperature anomalies (Example 6 in Section 3.1) fall very close to the mid-points of these three intervals.

The PDO data are similar, but changes appear to be more frequent, as the name suggests, and a cumulative sum plot appears very noisy. If we estimate by all the observations, we get 0.55 and detect no changes. If we use the observations from 1 to 45, we get and detect changes at 48, 76, 99, and 114 by the pseudo-sequential method and by CBS, while the stricter segmentation method that uses all possible background intervals detects changes at 48, 76, and 114. The statistic (18) detects only the first two change-points. A linear analysis with the four detected change-points produces an of 0.48 and an estimate for of 0.27.

For these data level changes seem to alternate positive and negative directions, so CBS and related methods that look for paired changes appear to have an advantage. Although the pseudo-sequential method detected the same changes in the PDO data as CBS, its detections at 76 and 114 were borderline.

## 5 Non-Gaussian Models.

It is relatively straightforward to extend the analysis given above to generalized linear models, at least for independent observations. For example, suppose that has the log likelihood

 Yuθu−ψ(θu), (20)

where

 θu=μu+ξf(u−t).

The efficient score to test the hypothesis or segment the sequence is much as before, although computation of is more complicated. Fortunately at least for the following examples, it seems plausible to assume independent (quasi)Poisson observations.

A perhaps more flexible model than (20) is an asymptotic quasi-likelihood model, e.g., Heyde (1997), where there is the possibility of different models for the mean and variance. A very similar model to the one we developed in Section 2 is obtained by setting in (20), but we interpret as the conditional expectation of given the observations up to , so the efficient score to detect a local signal is a sum of martingale differences. It is then straightforward to extend the methods developed above.

Example 6: Rain Storms in New Zealand An example modeled as jump changes in a Poisson mean are heavy rain storms in New Zealand since 1900, which have been catalogued at

Both Seq and (18

) suggest that there are two periods of increased storm activity—from about 1920 to approximately 1949 and from 1996 to 2010. The statistic MS misses the decrease in level in 1949, but detects the other three changes. A generalized linear model analysis indicates that the four change model is preferable to three changes, but not surprisingly gives a relatively small z-score to the change in 1949.

A series of interesting, but often challenging examples of contemporary interest are the slope changes occuring in the new cases day by day of COVID-19 occurring in various locations. We have used data from

ourworldindata.org

.

Example 7: COVID-19 in Santa Clara County, California. We consider the 72 day history by day ending on 12 April 2020. The statistic (20), initially at a constant mean value and allowing for a slope change and for overdispersion, detects an increase in slope on the 34th day. The quasilikelihood statistic with , autocorrelation of 0 and constant variance for the residuals, i.e., effectively just the single change statistic of Section 2, detects a slope increase on the same day. Both a linear least squares analysis and a log-linear (Poisson) analysis confirm this change and indicate 0 autocorrelation. The dispersion paramter is estimated to be about 20.

Example 8: COVID-19 in South Korea. South Korea provides an example (84 observations as of 12 April, 2020). which involves multiple changes. We use a quasi-likelihood analysis, again with 0 autocorrelation. The statistic Seq detects a slope increase at 27, a decrease at 44, and another increase at 54, all of which seem evident from a plot of the data. The statistic (18) suggests slope changes at observations 32, 41, and 54 while the statistic MS also detects three changes: at 32,44, and 55.

Remark. Additional examples and discussion of COVID-19 data are contained in the online supplement.

## 6 Discussion.

We have studied several score statistics to detect local signals in the form of changes of level, slope, or (in the online Supplement) the autoregressive coefficient. To segment the observations, we consider two “bottom up” methods patterned after the methods of multiple change-point segmentation in Fang, Li and Siegmund (2018) and one “top down” method, defined in (18).

Estimation of the nuisance parameters, and , pose special problems. For the bottom up methods, in examining an interval of observations for possible change-points, intercept, level, slope and variance of the process are estimated locally, i.e., using only the data from the interval under consideration. For the autoregressive coefficient, our theory suggests using the (global) maximum likelihood estimator estimator under the hypothesis of no change. Since this estimator can be badly biased and result in a loss of power when there are change-points, we also consider ad hoc methods based on examination of different parts of the data. For use of (18), which estimates of based on the (usually long) interval being searched, a possible method for dealing with variance heterogeneity arising from the early part of the data is to discard initial segments of varying length and consider the stability of the results obtained.

The multiple regression analsis suggested in Section 3.1, based on the assumption that detected break-points are correct, allows us to see if our segmentation is reasonable, estimate the magnitude of the detected signals, and reconsider our chosen value of to see if we have used one that is too small. In most examples, it appears that we have used a larger value than necessary, which may result in a loss of power in the case of a large discrepancy.

A different approach would be to use initially the value

, or some other small value to detect change points, apply the multiple regression analysis to discard changes that seem to be superfluous, and then iterate the process with the value of

suggested by the regression analysis. We have not chosen this path because our goal has been to use a value of obtained with at most a small amount of “data snooping,” in order to claim that due to minimal re-use of the data the false positive error rate has been adequately controlled.

Because our emphasis has been on detection of local signals, we have admittedly been selective in our use of examples and for the most part have limited our subsequent linear analysis to what it can tell us about our detection methods. In some examples it seems interesting to ask if there are explanations for the local changes: e.g., what is the evidence that a temperatures began to increase in the late 19th century, perhaps accompanied by an hiatus in the early-mid 20th century, before accelerating about 1980? In other examples it appears that our methods may lead to a reasonable regression fit without suggesting that the detected change-points need an explanation.

For problems involving sparse, bump like changes as discussed briefly the Supplement, some superficial analysis suggests that our methods work well since most of the data are consistent with the null model, and hence global estimation of nuisance parameters does not pose a serious problem. In view of the ubiquity of genomic applications where “bumps” have different signatures suggested by a combination of science and experimental technique, and where the background may involve use of a control group it seems worthwhile to pursue a more systematic study.

Here we have considered signals in one-dimensional processes, where the number of possible “shapes” of the signals is relatively small. In view of the much larger variety of possible multi-dimensional signal shapes and the variety of approaches already existing in the literature a systematic comparative study of that problem may be valuable.

## References

• Adler and Taylor (2007) R. J. Adler and J.  E. Taylor (2007). Random Fields and Geometry Springer-Verlag, New York-Heidelberg-Berlin.
• Baranowski, Chen, and Fryzlewicz (2019) R.  Baranowsk, Yining  Chen, and P.  Fryzlewicz (2019). Narrowest-over-threshold detection of multiple change-points and change-point-like features J. Roy. Statist. Soc. B 89, 649-672.
• Chan and Tong (1990) K.  S. Chan and H. Tong (1990) On likelihood ratio tests for threshold autoregression JRSSB 52, 469-476.
• Davies (1987) R.  B Davies (1987). Hypothesis testing when a nuisance parameter is present only under the alternative Biometrika 74, 33-43.
• Fang, Li and Siegmund (2018) X. Fang, J.  Li and D. O. Siegmund (2018). Segmentation and estimation of change-point models: false positive control and confidence regions Arkiv
• Fryzlewicz (2014) P. Fryzlewicz (2014). Wild binary segmentation for multiple change-pont detection. Ann. Statist. 42, 2243–2281.
• Heyde (1997) C. C. Heyde (1997) Quasi-Likelihood and Its Application Springer-Verlag, New York-Heidelberg-Berlin.
• Jaffe, et al. (2012a) A.  E. Jaffe, A.  P. Feinberg, R.  A. Irizarry, and J  T. Leek (2012a). Significance analysis and statistical dissection of variably methylated regions. Biostatistics 13, 166–178.
• Jaffe, et al. (2012b) A.  E. Jaffe, P. Murakami, Hwajin Lee, J  T. Leek, M. Daniele Fallin, A  P. Feinberg, and R.  A. Irizarry (2012b). Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies, Int. J. of Epidemiology 41, 200–209.
• Jones, et al. (2012) P.  D. Jones, D.  H. Lister, T.  J. Osborn, C. Harpham, M. Salmon, and C.  P. Morice Hemispheric and large-scale land surface air temperature variations: An extesive revision and an update to 2010. J. Geophys. Res. 117 D05127,doi:10.1029/2011JD017139.
• Knowles and Siegmund (1989) M. Knowles and D. Siegmund (1989) On Hotelling’s approach to testing for a nonlinear parameter in regression Int. Statist. Rev. 57 205-220.
• Knowles, Siegmund and Zhang (1991) M. Knowles, D. Siegmund and H.  P. Zhang (1991) Confidence regions in semilinear regression Biometrika 79 15-31.
• Loader (1992) C. Loader (1992). A log-linear model for a Poisson process change-point. Ann. Statist. 20 1391-1411.
• Manley (1974) Gordon Manley (1974) Central England temperatures:monthly means 1659 to 1973 Quarterly J. Roy. Meteorological Soc. 100 389-405.
• Muggeo (2016) V.  M.  R Muggeo (2016) Testing with a nuisance parameter present only under the alternative: a score based approach with application to segmented modeling J. Statistical Computation and Simulation doi: 10.1080/00949655.2016.1149855
• Olshen et al. (2004) A. B. Olshen, E. S. Venkatraman, R. Lucito and M. Wigler (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5, 557–572.
• Raftery and Akman (1986) A.  E. Raftery and V.  E. Akman (1986). Bayesian analysis of a Poisson process with a change-point. Biometrika 73, 85–80.
• Robbins, Gallagher, and Lund (2016) M.  W. Robbins, C.  M. Gallagher, and R. Lund (2016). A general regression change-point test for time series data, Jour. Amer. Statist. Assoc. 111 670–683.
• Schwartzman   et al. (2013) A. Schwartzman, A. Jaffe, Y. Gavrilov, Y. and C.   E. Meyer (2013). Multiple testing of local maxima for detection of peaks in ChIP-seq data, Ann. Appl. Statist., 7, 471-494.
• Shin   et al. (2013) H.  Shin, T.  Liu, X.  Duan, Y.  Zhang, X.  S. Liu (2013). Computational methodology for ChIP-seq analysis. Quantitative Biology 1, 54–70.
• Siegmund and Worsley (1995) D. O. Siegmund and K. J. Worsley (1995). Testing for a signal with unknown location and scale in a stationary Gaussian random field. Ann. Statist. 23, 608–639.
• Smith and Cook (1980) A.  F.  M. Smith and D.  G. Cook (1980). Strait lines with a change-point: an analysis of some renal transplant data J. Roy. Statist. Soc. Series C 29, 180-189.
• Toms and Lesperance (2003) J.   D. Toms and M.  L. Lesperance Piecewise regression: a tool for identifying ecological thresholds Ecology 84, 2034-2041.
• Zhang et al. (2010) N. R. Zhang, D. O. Siegmund, H. Ji and J. Z. Li (2010). Detecting simultaneous changepoints in multiple sequences. With supplementary data available online. Biometrika 97, 631–645.

## References

• Brillinger (et al. 1980) D.  R. Brillinger, J. Guckenheimer, P. Guttorp and G. Oster (1980) Empirical modelling of population time series data: the case of age and density dependent vital rates Lectures on Mathematics in the Life Sciences 13, 65-90 American Math. Soc.
• Chan (1991) K.  S. Chan (1991) Percentage points of the likelihood ratio test for threshold autoregression JRSSB 53, 691-696.
• Chan and Tong (1990) K.  S. Chan and H. Tong (1990) On likelihood ratio tests for threshold autoregression JRSSB 52, 469-476.
• Church and White (2011) J.  A. Church and N.  J. White (2011) Sea level rise from the late 19th to the early 21st century Surv. Geophysis. 32, 585-602.
• Jaffe, et al. (2012a) A.  E. Jaffe, A.  P. Feinberg, R.  A. Irizarry, and J  T. Leek (2012a). Significance analysis and statistical dissection of variably methylated regions. Biostatistics 13, 166–178.
• Jaffe, et al. (2012b) A.  E. Jaffe, P. Murakami, Hwajin Lee, J  T. Leek, M. Daniele Fallin, A  P. Feinberg, and R.  A. Irizarry (2012b). Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies, Int. J. of Epidemiology 41, 200–209.