# Detecting possibly frequent change-points: Wild Binary Segmentation 2 and steepest-drop model selection

Many existing procedures for detecting multiple change-points in data sequences fail in frequent-change-point scenarios. This article proposes a new change-point detection methodology designed to work well in both infrequent and frequent change-point settings. It is made up of two ingredients: one is "Wild Binary Segmentation 2" (WBS2), a recursive algorithm for producing what we call a complete' solution path to the change-point detection problem, i.e. a sequence of estimated nested models containing 0, ..., T-1 change-points, where T is the data length. The other ingredient is a new model selection procedure, referred to as "Steepest Drop to Low Levels" (SDLL). The SDLL criterion acts on the WBS2 solution path, and, unlike many existing model selection procedures for change-point problems, it is not penalty-based, and only uses thresholding as a certain discrete secondary check. The resulting WBS2.SDLL procedure, combining both ingredients, is shown to be consistent, and to significantly outperform the competition in the frequent change-point scenarios tested. WBS2.SDLL is fast, easy to code and does not require the choice of a window or span parameter.

Comments

There are no comments yet.

## Authors

• 8 publications
06/23/2020

### Seeded intervals and noise level estimation in change point detection: A discussion of Fryzlewicz (2020)

In this discussion, we compare the choice of seeded intervals and that o...
01/30/2019

### Detecting multiple generalized change-points by isolating single ones

We introduce a new approach, called Isolate-Detect (ID), for the consist...
06/25/2020

### Discussion of 'Detecting possibly frequent change-points: Wild Binary Segmentation 2 and steepest-drop model selection'

We discuss the theoretical guarantee provided by the WBS2.SDLL proposed ...
06/18/2020

### Short Communication: Detecting Possibly Frequent Change-points: Wild Binary Segmentation 2

This article comments on the new version of wild binary segmentation 2. ...
11/27/2020

### Multiple change point detection under serial dependence: Wild energy maximisation and gappy Schwarz criterion

We propose a methodology for detecting multiple change points in the mea...
11/29/2018

### Segmenting Dynamic Network Data

Networks and graphs arise naturally in many complex systems, often exhib...
08/11/2020

### Frequent or Systematic Changes? discussion on "Detecting possibly frequent change-points: Wild Binary Segmentation 2 and steepest-drop model selection."

We discuss Fryzlewicz's (2020) that proposes WBS2.SDLL approach to detec...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A recent report (National Research Council, 2013) lists change-point detection as one of the “inferential giants” (common building blocks in knowledge discovery) in massive data analysis. Change-point analysis, be it a posteriori or online, is used in fields as diverse as bioinformatics (Olshen et al., 2004), econometrics and finance (Hansen, 2001; Andreou and Ghysels, 2002; Bai and Perron, 2003), climate and the natural environment (Reeves et al., 2007; Liu et al., 2010; Salarijazi et al., 2012; Pezzatti et al., 2013), autonomous driving (Galceran et al., 2015)

(Ranganathan, 2012), clinical neuroscience (Younes et al., 2014) and quality/reliability monitoring (D’Angelo et al., 2011; Huang and Lyu, 2011; Amiri and Allahyari, 2012).

A common testing ground for a posteriori change-point detection methods, and the focus of this paper, is the model

 Xt=ft+εt,t=1,…,T, (1)

in which is a deterministic, one-dimensional, piecewise-constant signal with change-points whose number and locations are unknown. The sequence is random and such that is exactly or approximately zero. In the simplest case are modelled as i.i.d. , but can also have other marginal distributions and/or exhibit temporal dependence. The task is to estimate and under various assumptions on , the magnitudes of the jumps and the minimum permitted distance between the change-point locations. This article focuses on the i.i.d. distribution for , as even in this simplest case, the above change-point detection problem still poses major challenges for the state of the art, as this article illustrates.

Although performance measures for the above estimation task, naturally, vary across the literature, it can be anticipated that many users will have a preference for methods which

1. accurately estimate the number and locations of any change-points present in , for both

1. signals with few/infrequent change-points, and

2. signals with many/frequent change-points;

2. are fast to execute and scale well to long signals.

Depending on the setting, other desirable features may include e.g. the existence of software, ease of implementation and transferability to other, more complex stochastic models. Even though much methodological and algorithmic progress has been made on the above change-point detection problem in recent years, we show in this paper that very few (if any) state-of-the-art techniques simultaneously possess the above desired features 1(a), 1(b) and 2. In particular, many techniques perform surprisingly poorly for signals with frequent change-points. We first briefly describe the history of the multiple change-point detection problem and the state of the art. For reasons of space, our literature review mainly focuses on the parametric case, i.e. models in which the distribution of can be modelled parametrically, with special emphasis on the i.i.d. Gaussian model for .

A major class of approaches to a posteriori multiple change-point detection is one in which change-points are found by minimising a criterion function consisting of a likelihood-type (or least-squares) term measuring the fit of the estimate to the data plus a penalty term to prevent overfitting. In this category, Yao and Au (1989) consider least-squares estimation of for a fixed and i.i.d. noise. Braun et al. (2000)

extend this work to noise for which the variance is a function of the mean. In the Gaussian case, the SIC (BIC) is used to estimate an unknown but bounded

in Yao (1988), and a more general penalty but also linear in the number of change-points appears in Lee (1995). For an unknown but bounded , Lavielle and Moulines (2000) consider penalised least-squares estimation, with a penalty linear in the number of change-points, for dependent ’s; see also Lavielle (1999) and Lavielle (2005). For a fixed , Pan and Chen (2006) propose a likelihood criterion with a penalty depending not only on the number, but also on the locations of change-points, favouring more uniformly-spread estimated change-points; related ideas appear in Zhang and Siegmund (2007). For an unknown , Lebarbier (2005) proposes least-squares estimation with a penalty originating from the model selection approach of Birgé and Massart (2001). Boysen et al. (2009) use the least-squares criterion with a linear penalty on the number of change-points. More general forms of Schwarz-like penalties are studied, for example, in Wu (2008), Ciuperca (2011) and Ciuperca (2014). The SMUCE estimator of Frick et al. (2014) can also be cast in the penalised cost function framework (while SMUCE is shown to control the FWER, the related method of Li and Munk (2016) controls the FDR). An empirical Bayes approach to change-point estimation in a marginal likelihood framework appears in Du et al. (2016). The PELT algorithm by Killick et al. (2012) and the pruned dynamic programming by Rigaill (2015) accelerate computation of some penalised change-point estimators to linear time in best-case scenarios (while retaining quadratic speed in worst-case ones), and offer fast implementations. Related ideas appear also in Maidstone et al. (2017).

Other attempts to reduce the computational complexity of the problem include Davis et al. (2006)

, who (in a time series setting) use a genetic algorithm to minimise a Minimum Description Length criterion, and

Harchaoui and Lévy-Leduc (2010) who consider the least-squares criterion with a total variation penalty, which enables them to use the LARS algorithm of Efron et al. (2004) to compute the solution in

time. However, the total variation penalty is not an optimal one for change-point detection (in the sense of balancing out type-I and type-II errors, as described in

Brodsky and Darkhovsky (1993) and Cho and Fryzlewicz (2011)). The total variation penalty is also considered in the context of peak/trough detection by Davies and Kovac (2001), who propose the ‘taut string’ approach for fast computation. In the context of multiple change-point detection, it is considered by Rinaldo (2009) (with a subsequent correction published on the author’s web page) and Rojas and Wahlberg (2014) as part of the fused lasso penalty, proposed by Tibshirani et al. (2005) and equivalent to taut string in model (1). Lin et al. (2017) propose a generic post-processing procedure for any -consistent, piecewise-constant estimator of , including in particular the total-variation-based one, which yields consistency for the estimators of under certain assumptions on the distances between the change-points and the jump sizes. Similar results for general “trend filtering” (Tibshirani, 2014) estimators appear in Guntuboyina et al. (2018).

Our experience is that many penalty-based methods in which the penalty has been pre-set by the user (rather than having been chosen adaptively from the data) can struggle to offer uniformly good performance across both signals with infrequent change-points, and signals with frequent/numerous ones; in particular, we illustrate in Section 2.1 that the popular BIC (Yao, 1988) and mBIC (Zhang and Siegmund, 2007) penalties can perform poorly for signals with frequent change-points. An interesting approach to data-driven penalty selection is proposed in Birgé and Massart (2006), who define the so-called minimal penalty; two approaches to estimating this quantity, referred to as ‘dimension jump’ and ‘data-driven slope estimation’ are described in Baudry et al. (2012). Both offer impressive performance for short signals with a limited number of jumps, but their performance degrades for signals with frequent jumps (more so for the dimension jump approach). There is also computational price to pay for selecting the penalty in a data-driven way: the ‘data-driven slope estimation’ approach is particularly slow for long signals. We illustrate these points in Section 4.2. With the exception of this approach, the existing literature does not appear to provide guidance as to the appropriate choice of a penalty for a given dataset.

Another class of approaches to the multiple change-point detection problem is one in which change-points are estimated one by one, in the order in which they appear in the data. Huskova and Slaby (2001) and Eichinger and Kirch (2018) propose the “moving sum” (MOSUM) technique, which requires the choice of an extra bandwidth parameter, but the latter requirement is circumvented in the mosum R package (Meier et al., 2018), which produces estimates combined over a range of automatically chosen bandwidths. The pseudo-sequential procedure of Venkatraman (1992)

, as well as the method of Ross (2015) are based on an adaptation of online detection algorithms to a posteriori situations and work by bounding the Type I error rate of falsely detecting change-points. The ‘Isolate-Detect’ method of

Anastasiou and Fryzlewicz (2018a) is also pseudo-sequential in nature, avoids to some extent the issue of bandwidth/span selection, and offers particularly fast implementation for signals with frequent change-points. However, from the performance point of view, such signals tend to pose major difficulties for many methods in this category, an issue magnified by the fact that many such techniques rely critically on the accurate estimation of a threshold against which to judge the significance of each change-point candidate, but the magnitude of this threshold can be difficult to estimate well in the presence of many change-points. We demonstrate this phenomenon in Section 2.3.2.

Binary segmentation is one of the most popular approaches to multiple change-point detection. Heuristically, it works by fitting the best model with a single change-point to data

, and if the estimated change-point is deemed to be significant, the same procedure is then repeated to the left and to the right of the estimate. Binary segmentation is fast, conceptually simple and easy to implement, but tends not to work well for signals with frequent change-points, the main reason being that for such signals there are no guarantees that the single-change-point model fit at each stage of the procedure will behave well in the presence of multiple change-points in the data segment considered. Possibly the first work to propose binary segmentation in a stochastic process setting is Vostrikova (1981), who shows consistency of binary segmentation for the number and locations of change-points for a fixed . Venkatraman (1992) offers a proof of the consistency of binary segmentation for and the change-point locations, even for increasing with , albeit with sub-optimal assumptions and rates for the locations. In a setting similar to Vostrikova (1981) (for a fixed and with following a linear process), binary segmentation is considered in Bai (1997). Chen et al. (2011) provide a proof of consistency of binary segmentation for the number of change-points in the case of a fixed and i.i.d. Gaussian. Binary segmentation is used for univariate time series segmentation in Fryzlewicz and Subba Rao (2014) and Cho and Fryzlewicz (2012), and for multivariate, possibly high-dimensional time series segmentation in Cho and Fryzlewicz (2015). Circular Binary Segmentation (Olshen et al., 2004; Venkatraman and Olshen, 2007), Wild Binary Segmentation (Fryzlewicz, 2014) and the Narrowest-Over-Threshold method (Baranowski et al., 2018) are designed to improve the performance of binary segmentation. We return to Wild Binary Segmentation later in this section.

A number of techniques do not directly fall into any of the above categories. Wang (1995) uses the fast discrete wavelet transform to detect change-points. Muggeo and Adelfio (2011) take the cumulative sum of and then apply the methodology described in Muggeo (2003) in the resulting piecewise-linear framework. An agglomerative, bottom-up, “tail-greedy” method is proposed in Fryzlewicz (2018). Matteson and James (2014) present a heuristic agglomerative algorithm for multiple change-point detection as a computationally attractive alternative to their divisive one. An early review of some multiple change-point detection methods (in the context of DNA segmentation, but applicable more widely) appears in Braun and Mueller (1998).

Wild Binary Segmentation (WBS, Fryzlewicz, 2014) is one of the state-of-the-art procedures for multiple change-point detection in model (1), for signals with a small or moderate number of change-points. Truong et al. (2018) give WBS “three stars” (their highest grade) for scalability to long signals, and Wang et al. (2018) show certain assumption- and error-rate-optimality of WBS. WBS is designed to deal with the shortcomings of binary segmentations (described above) in the following manner. At the start of the procedure, subsamples of the data, living on intervals , are drawn, with the start- and end-points and selected randomly, independently, uniformly and with replacement. The single-change-point model fit is then performed on each subsample, and the first change-point candidate is selected as the one corresponding to the best fit over the subsamples. The hope is that if is sufficiently large, then this best fit will come from a subsample containing at most one change-point and will therefore lead to the accurate estimation of the given change-point. If the first detected change-point is considered significant, the same procedure is repeated to the left and to the right of it, re-using (for speed) the previously drawn subsamples.

Notwithstanding this improvement over plain binary segmentation, we show in Section 2.1 that the performance of WBS deteriorates dramatically for signals with frequent change-points (as does that of many other state-of-the-art methods). The roots of its poor performance for such signals are multiple, and can be summarised as follows.

1. The recommended default number of subsamples drawn by the WBS (the R packages breakfast (Fryzlewicz, 2017) and wbs (Baranowski and Fryzlewicz, 2015) recommend not exceeding 20000) can be insufficient to ensure adequate performance of WBS for signals with frequent change-points. Indeed, good performance can only be hoped for if

is large enough for the drawn subsamples to include (with a high probability) ones with start- and end-points immediately to the left and to the right of each change-point; this means that the required

can easily be very large for signals with frequent change-points. Increasing the default to ensure this is not a viable solution as this would increase the computation time beyond what is acceptable (and is not needed for signals with no or few change-points). In the remainder of this article, we refer to this issue as the lack of computational adaptivity of the WBS, in the sense that the procedure does not decide automatically the number or the locations of the subsamples drawn.

2. A related point is that for WBS to be able to produce an entire solution path to problem (1) (i.e. estimated models with change-points), we would have to draw all possible subsamples of the data. This would result in a procedure of a cubic computational complexity in , which is computationally inadmissible. Any value of less than this upper limit would not guarantee the computability of the entire solution path. If a solution path algorithm in multiple change-point detection problems does not produce an entire solution path, we refer to this algorithm as incomplete. The use of incomplete solution path algorithms is particularly risky in frequent change-point scenarios, as it may lead to obtaining a partial solution path that is shorter than the true number of change-points, in which case any model selection criterion, however good, will not be able to recover all the change-points. We illustrate this important phenomenon in Section 2.3.1.

3. Fryzlewicz (2014) proposes two model selection criteria for the WBS: one based on thresholding, and the other based on the use of the sSIC (a penalty close to the standard BIC/SIC). Both approaches fail in frequent change-point scenarios. Thresholding does so because its success critically depends on the user’s ability to accurately estimate the variance of the noise , but this task can be difficult in frequent change-point settings, which we show in Section 2.3.2. BIC fails because it places too heavy a penalty on frequent-change-point solutions to the estimation problem 1. The failure of BIC in frequent change-point scenarios occurs for other change-point search algorithms too (see Section 2.1), not just WBS.

The objective of this paper is to introduce the “Wild Binary Segmentation 2” (WBS2) solution path algorithm for multiple change-point problems, and the “Steepest Drop to Low Levels” (SDLL) model selection for WBS2. The resulting multiple change-point detection method, labelled WBS2.SDLL, addresses the above shortcomings of WBS in two separate ways.

Firstly, WBS2 produces an entire solution path, i.e. sequences of solutions to the change-point detection problem with change-points. Among them, WBS2 guarantees (with probability approaching one with ) that the solution with the true number of change-points is such that the estimated change-point locations are near-optimally close to the true change-points . Because it produces the entire solution path, the WBS2 is (unlike WBS), by defintion, a complete procedure, and its completeness is achieved in a computationally efficient way as follows. In the first pass through the data, WBS2 draws only a small number of data subsamples, and marks the first change-point candidate as the arg-max of the resulting absolute CUSUMs. It then uses this change-point candidate to split its domain of operation into two, and again recursively draws

subsamples to the left and to the right of this change-point candidate, and so on. Therefore, WBS2 adaptively decides where to recursively draw the next subsamples, based on the change-point candidates detected so far (to re-express this is the language of machine learning, WBS2 learns the locations of the subsamples to draw automatically and “on the fly”, i.e. as it proceeds through the data and discovers the signal). This is in contrast to the standard WBS, which (non-adaptively) pre-draws all its

subsamples at the start of the procedure. We note that in practice, is of orders of magnitude smaller than ; in the numerical sections of this paper we will be using . WBS2 draws subsamples on every current subdomain of the data as long as its length is , which ensures completeness. By contrast, WBS (non-adaptively) pre-draws all its subsamples at the start of the procedure, which leads to its incompleteness except if all possible subsamples are drawn, which is computationally infeasible for even moderately long signals.

Secondly, the proposed SDLL procedure for estimating from the WBS2 solution path does not use a penalty, and only uses thresholding as a certain secondary check, which is in contrast to the use of thresholding as a primary model selection tool in WBS. This means that the magnitude of the threshold in WBS2.SDLL does not need to be estimated to a high degree of accuracy (which is difficult to do in frequent change-point scenarios). SDLL works by monitoring the ratios of the the pairs of consecutive maximum absolute CUSUM statistics returned by the WBS2 solution path, arranged in decreasing order. The procedure then selects the largest ratio for which the denominator falls under a pre-specified threshold. This is motivated by the fact that those absolute maximised CUSUMs that carry information about the change-points have the same order of magnitude between them, which is larger than the order of magnitude of those CUSUMs that correspond to the no-change-points sections of the data. The presence of this “drop” in the magnitude of the absolute maximised CUSUMs gives the SDLL criterion its name.

Our numerical experiments suggest that unlike the standard WBS, the resulting WBS2.SDLL procedure offers good detection accuracy and fast execution times for signals with frequent change-points. We demonstrate that in doing so, WBS2.SDLL significantly outperforms the state of the art for this signal class. Unlike many of its competitors, WBS2.SDLL does not require the choice of a penalty or the specification of the maximum number of change-points, and is easy to calibrate to return zero estimated change-points for constant signals with a required probability. Both the WBS2 and the SDLL components are conceptually simple and easy to code.

The paper is organised as follows. In Section 2, we illustrate the poor performance of the state of the art for signals with frequent change-points, and we investigate the reasons for WBS’s poor performance. Section 3 motivates and proposes WBS2 and SDLL and investigates their theoretical properties. Our practical recommendations and a comparative simulation study are in Section 4. Section 5 describes the application of WBS2.SDLL to a London house price dataset. The proofs of our theoretical results are in the Appendix.

## 2 Motivation

### 2.1 Frequent change-points: performance of the state of the art

To motivate the need for our new methodology, we start by illustrating the poor performance of WBS and other state-of-the-art multiple change-point detection methods for signals with frequent change-points, using the example of the extreme.teeth signal defined below. We use this signal as a running example throughout the paper to motivate several of our developments, which are then however shown to hold for a wide class of signals both in theory and in practice. Throughout the paper, we work with methods available from the R CRAN repository, and for each method tested, we use the default parameter settings unless stated otherwise. For techniques that require the specification of the maximum number of change-points, we set this to the (rounded) length of the input signal divided by three; in all our examples, the true number of change-points is (well) within the interior of this range. Our R code enabling replication of the results from this paper is available from https://github.com/pfryz/wild-binary-segmentation-2.0. We now list the techniques tested in addition to our new WBS2.SDLL methodology.

1. PELT+BIC and PELT+mBIC: the PELT technique from the R package changepoint (Killick et al., 2016) with the BIC and mBIC penalties, respectively. The methodology is described in Killick et al. (2012).

2. MOSUM: the multiscale MOSUM technique with localised pruning from the R package mosum (Meier et al., 2018). (We do not give results for the multiscale MOSUM technique with bottom-up bandwidth-based merging from the same package as it frequently returned warnings in the examples we tested.)

3. ID: the ‘Isolate-Detect’ technique from the R package IDetect (Anastasiou and Fryzlewicz, 2018b), described in Anastasiou and Fryzlewicz (2018a).

4. FDRSeg: the multiscale FDR-based technique from the R package FDRSeg (Li and Sieling, 2017), proposed in Li and Munk (2016).

5. S3IB: the technique from the R package Segmentor3IsBack (Cleynen et al., 2016). It implements a fast algorithm for minimising the least-squares cost function for change-point detection, as described in Rigaill (2015). The best model is selected via the “oracle” criterion of Lebarbier (2005), which uses the so-called slope heuristics.

6. SMUCE: the multiscale FWER-based technique from the R package stepR (Pein et al., 2018), proposed in Frick et al. (2014).

7. CUMSEG: the linearisation-based technique from the R package cumSeg (Muggeo, 2012), described in Muggeo and Adelfio (2011).

8. FPOP: the ‘functional pruning and optimal partitioning’ technique from the R package fpop (Rigaill and Hocking, 2016) with the BIC penalty. The methodology is described in Maidstone et al. (2017).

9. DDSE and DJUMP: the “data-driven slope estimation” and “dimension jump” techniques from the R package capushe (Arlot et al., 2016), with computation based also on the R package jointseg (Pierre-Jean et al., 2017). The methodology is described in Baudry et al. (2012).

10. TGUH: the tail-greedy, bottom-up agglomerative technique from the R package breakfast (Fryzlewicz, 2017). The methodology is described in Fryzlewicz (2018).

11. WBS-C1.0, WBS-C1.3 and WBS-BIC: the Wild Binary Segmentation methods with the model selection determined via thresholding with constants and and the BIC (respectively), as specified in Fryzlewicz (2014) and implemented in the R package breakfast (Fryzlewicz, 2017). See Section 2.2 for their definitions.

We do not include the (non-parametric) methodology implemented in the R package ecp (James and Matteson, 2014) as this procedure does not appear to be able to return zero estimated change-points. We also do not provide results for the R package strucchange (Zeileis et al., 2002) because of the slow speed of the methodology implemented therein.

Our example in this section is the signal , referred to as extreme.teeth and defined as follows: if ; if . We simulate 100 independent copies of in model (1) with being the extreme.teeth signal and . Note that , the true number of change-points in , equals 199. Figure 1 shows the first 100 observations of the noiseless and noisy .

Figure 2 provides evidence that most of the techniques listed above fail somewhat dramatically for this model, with all but three being able to estimate practically none or only a small proportion of the change-points. The three best performers are DDSE, and two versions of our method, labelled WBS2.SDLL(0.9) and WBS2.SDLL(0.95) (we provide the details later). The DDSE, however, is over three times slower than WBS2.SDLL and estimates zero change-points for two out of the 100 sample paths, which results in compared to . We return to DDSE and the other techniques in the comparative simulation study in Section 4.2. WBS-C1.0, WBS-C1.3, WBS-BIC are among the failing techniques and we now examine the reasons for their failure.

### 2.2 Wild Binary Segmentation

We first review WBS as proposed in Fryzlewicz (2014). Throughout the paper, our main “locator” statistic for indicating the locations of change-point candidates is the CUSUM statistic, defined, for the data portion , as

 ~Xbs,e=√e−bn(b−s+1)b∑t=sXt−√b−s+1n(e−b)e∑t=b+1Xt, (2)

where , with . It is used in different ways in WBS and WBS2. In both these algorithms, we use as the most likely location of a change-point in . This is the same as the location indicated by fitting the best possible piecewise-constant function with one change-point to via least squares. Therefore, the use of the CUSUM statistic is equivalent to the use of maximum likelihood in model (1) in which i.i.d. .

Denote by a set of random intervals , , whose start- and end-points have been drawn (independently with replacement) uniformly from the set . How many intervals to draw is decided by the user (Fryzlewicz (2014) and version 1.0.0 of the R package breakfast (Fryzlewicz, 2017) recommend and , respectively, for signal lengths not exceeding a few thousand). An appropriate choice of is key: on the whole, the larger the value of , the more accurate but slower the procedure. Using pseudocode, the main function of the WBS algorithm, with model selection performed via thresholding with threshold , is defined in the following way.

function WBS(, , )
if  then
STOP
else
set of those indices for which is such that

if  then
add to the set of estimated change-points (with the additional side
information )
WBS(, , )
WBS(, , )
else
STOP
end if
end if
end function

The WBS procedure (with model selection via thresholding with threshold ) is launched by the call WBS(, , ). The WBS-C1.0 and WBS-C1.3 versions, singled out by Fryzlewicz (2014) and used in Section 2.1 above, use, respectively, and , where is the Median Absolute Deviation (MAD) estimator of , computed using on input. An alternative implementation of WBS with thresholding first computes all possible change-point candidates via the call WBS(, , ) and then tests them, as is done in the definition of the routine WBS(, , ), against the threshold .

As an alternative to thresholding, Fryzlewicz (2014) also proposes model selection via the “strengthened Schwarz Information Criterion” (sSIC). The sSIC penalty is close to the standard SIC (BIC) penalty, and therefore we do not review it here in detail. If an estimated change-point arises via the maximisation of over , we refer to this estimate as in this paragraph. The WBS-BIC version of the procedure, used in Section 2.1, initially computes all possible change-point candidates via the call WBS(, , ). It then sorts them in the order of decreasing magnitudes of . Let be the indices (i.e. values of the indexing variable ) corresponding to this sorted order. The WBS-BIC returns as the estimated change-points the set that yields the minimum of the BIC.

The key difference between the WBS and standard binary segmentation algorithms is that each change-point candidate is found by WBS by examining all intervals contained within (rather than examining only itself, as in binary segmentation), and finding the overall arg-max of the CUSUM statistics fitted on each such interval . The reason why this improves on the standard binary segmentation is that if is large enough, each true change-point is guaranteed (with probability approaching one in ) to be contained within at least one of the intervals which contains only that single change-point. This alone provides suitable guarantees for the size of the largest maximum of the CUSUM statistics, and is enough to yield substantial theoretical and practical advantages of WBS over binary segmentation, as similar guarantees are unavailable in the latter method. The collection of intervals is only drawn once at the start of the WBS procedure; therefore, previously-drawn intervals are reused as the algorithm progresses.

### 2.3 Issues affecting Wild Binary Segmentation

#### 2.3.1 Incompleteness of WBS

With the definition of the routine WBS(, , ) as in Section 2.2, the call WBS(, , ) returns all possible change-point candidates in the sense that with , the change-point candidates are not tested against a threshold before they enter the output. Any model selection procedure can then be applied to this output to identify the final set of estimated change-points.

The call WBS(, , ) is defined recursively. The equivalent non-recursive procedure for producing the same set of change-point candidates can be described as follows. For each , let . Initially let and . While , repeat the following pair of operations.

1. (Identification of next change-point candidate)

 m0 := argmaxm:(m,sm,em,bm,|~Xbmsm,em|)∈C|~Xbmsm,em|, P := P∪(m0,sm0,em0,bm0,|~Xbm0sm0,em0|).
2. (Removal of all intervals containing this change-point candidate)

 C:=C∖{(m,sm,em,bm,|~Xbmsm,em|):sm≤bm0

At the end of this process, the set contains the same set of change-point candidates as the output of WBS(, , ). To see the equivalence between these two observe that the two recursive steps WBS(, , ) and WBS(, , ) in the definition of the WBS routine are equivalent to: after the recording of each change-point candidate , remove from consideration all those intervals that contain , and only operate on those intervals that fall entirely within and . In the above non-recursive definition of , this removal is explictly performed in stage 2. The removal is essential for the procedure to avoid repeated estimation of the same change-point multiple times.

Let (where returns the cardinality of its argument when it is a set). can be viewed as an ordered set in the following sense. Let be the order in which the 5-tuples enter . The sequence is non-increasing, as its each next element is a maximum over the progressively shrinking set . In the remainder of this paper, we treat as an ordered set with this particular ordering. Therefore, the ordered set can be viewed as a WBS “solution path”: the change-point candidates contained in are arranged in order of decreasing importance (i.e. decreasing magnitude of ), and therefore any reasonable WBS model selection procedure would typically choose a certain number of the first components of as the final set of estimated change-points.

As a result of the repeated removal of intervals from the set , the cardinality of the solution path is typically much smaller than the cardinality of the initial set . This is illustrated in the following example.

Example 2.1   Consider a sample path simulated from the model , , where is the extreme.teeth signal defined in Section 2.1 and and has been simulated in R on setting the random seed to 1. We simulate (a value recommended in Fryzlewicz (2014)) intervals . On completion of the computation of as described in this section, we obtain . As the length of the solution path is less than 199, the true number of change-points in , no model selection procedure is able to recover all the change-points in this example: indeed, even if we were to accept all change-point candidates in , we would only estimate 119 change-points. Therefore, WBS as proposed in Fryzlewicz (2014) fails in this case. (End of example.)

The above example illustrates that too small a value of can prevent WBS from working well for signals with frequent change-points, regardless even of the model selection procedure used. Increasing the default value of in the hope of obtaining a longer solution path is not a viable option as it can easily slow down the WBS procedure beyond what is acceptable (in particular, drawing all possible sub-intervals of leads to WBS having computational speed ) and is unnecessary for signals with no or few change-points, which require far fewer intervals. In other words, WBS does not adapt to the data in terms of the number or the locations of the intervals it draws. From the user perspective, this lack of adaptivity creates two related major practical issues: (a) in the absence of prior information about the number and frequency of change-points, it is unclear whether any particular value of is large enough in practice, and (b) setting to be very large to be “on the safe side” is not an option from the computational point of view.

We formalise the issue of the length of solution paths in the language of “completeness”, which we now introduce.

###### Definition 2.1

Let be a solution path algorithm in a multiple change-point detection problem which produces a sequence of candidate models with estimated change-points. We refer to as complete if , where is the length of the input data, and incomplete otherwise.

Note that means that every time point is a change-point candidate, so that the candidate model with is maximal. The definition of completeness leads to the following proposition for WBS. The proof is straightforward, so we omit it.

###### Proposition 2.1

The solution path obtained via WBS as described in this section is guaranteed to be complete in the sense of Definition 2.1 if and only if represent all possible sub-intervals of , in which case . Furthermore, if , then the WBS solution path procedure is of computational order .

We argue that completeness is a desirable feature of the solution path component of any change-point detection procedure if it is to be applied in frequent change-point scenarios, as the presence of completeness guarantees that we avoid situations (such as the one illustrated in this section) in which the solution path is too short to enable estimation of all the change-points. Completeness embodies the idea that no change-point candidate should be dismissed as inadmissible at the “solution path” stage: this should only happen at the model selection stage. For both scalability and good performance in frequent change-point scenarios, it is important for a solution path procedure to be both fast and complete, and we have shown in this section that WBS does not simultaneously possess these two features.

#### 2.3.2 Shortcomings of thresholding and BIC as model selection for WBS

Fryzlewicz (2014) proposes two model selection criteria for WBS: one based on thresholding, and the other based on the sSIC, an information criterion for all practical purposes equivalent to the standard SIC (a.k.a. BIC). In this section, we argue that both can easily be inadequate in frequent change-point scenarios (independently of the issue of completeness described in Section 2.3.1).

The success of thresholding as a model selection criterion for WBS critically relies on the user’s ability to accurately estimate . The following example illustrates that this task can be difficult to achieve in frequent change-point models.

Example 2.2   Consider 100 sample paths simulated from the model , , where is the extreme.teeth signal defined in Section 2.1. and

. For each simulated sample path, we evaluate the Median Absolute Deviation (MAD) and Inter-Quartile Range (IQR) estimators (two commonly used robust estimators) of

calibrated for the Gaussian distribution, computed with

on input. Figure 3 shows that both estimators are heavily biased upwards on this example and overestimate by around 25%. The degree of this overestimation prevents thresholding from being an effective model selection tool in this model. (End of example.)

In addition, there is ample empirical evidence that the BIC is a poor model selector in frequent change-point scenarios, also for search algorithms other than WBS. For instance, we showed in Section 2.1 that the PELT method (Killick et al., 2012) equipped with the BIC stopping rule failed on the extreme.teeth example. In brief, this is due to the fact that the BIC term places too heavy a penalty on frequent-change-point solutions.

It is tempting to ask at this point whether choosing a penalty adaptively from the data (so that possibly different penalties are applied in infrequent and frequent change-point scenarios) may provide a successful remedy to this issue. Based on our empirical experience, the best such approach that we have come across is the “data-driven slope estimation” technique implemented in the R package capushe (Arlot et al., 2016) and described in Baudry et al. (2012). However, we show in Section 4.2 that this approach scales poorly to long signals and is comfortably outperformed by WBS2.SDLL in the frequent change-point examples we tested.

## 3 Wild Binary Segmentation 2 and Steepest Drop to Low Levels

### 3.1 WBS2: data-adaptive computation of complete solution path

#### 3.1.1 Computation of the WBS2 solution path

This section describes how the WBS2 solution path is computed. This is the first of the two ingredients of the WBS2.SDLL procedure; the other is the application of the new “Steepest Drop to Low Levels” model selection criterion, described in Section 3.2.

We start with a heuristic description, followed by a formal algorithmic definition. In the first pass through the data, WBS2 draws a small number of data subsamples, with start- and end-points chosen randomly, uniformly and independently with replacement from the set . WBS2 marks the first change-point candidate as the arg-max of the resulting absolute CUSUMs. It then uses this change-point candidate to split its domain of operation into two, and again recursively draws subsamples to the left and to the right of this change-point candidate, and so on. Eventually, on short sub-domains, the number of all possible subsamples will be less than , in which case WBS2 will draw all possible subsamples of each such sub-domain. WBS2 stops and takes no action on a given current sub-domain only if its length is 1. Having completed this step, WBS2 then sorts all thus-obtained change-point candidates in the decreasing order of their corresponding absolute CUSUMs. The resulting object is referred to as the WBS2 solution path. Algorithmically, the main ingredient of the computation of the WBS2 solution path is the routine WBS2.Sol.Path, defined recursively as follows.

function WBS2.Sol.Path(, , )
if  then
STOP
end if
if  then

Draw all intervals , , s.t. .
else
Randomly draw intervals , , s.t.
and drawn uniformly and independently with replacement.
end if

Add to the (unsorted) solution path .
WBS2.Sol.Path(, , )
WBS2.Sol.Path(, , )
end function

The WBS2 solution path is computed in two stages, defined below.

1. (Identification of change-point candidates)

Initialise .

Add elements to the (unsorted) WBS2 solution path by executing the command WBS2.Sol.Path(, , ). We state in Section 3.1.2 that the length of the thus-obtained sequence is .

2. (Sorting of the WBS2 solution path)

Rearrange the elements of so that the sequence of its 4th components, denoted by , is non-increasing, breaking any ties at random. Denote the elements of the resulting (sorted) sequence by .

The philosophy of WBS2 is fundamentally different from WBS. In WBS, all intervals are pre-drawn before the launch of the procedure, and there needs to be a sufficient number of them for all change-points to be detectable from this single draw. By contrast, in WBS2, with each draw of intervals, we only hope to detect one of the remaining undetected change-points (if there are any). For this reason, will typically be much smaller than . We comment in Section 3.1.2 on the theoretical orders of magnitude of ; for now suffice it to say that we use in all numerical examples of this paper.

In comparison to WBS, the computation of the WBS2 solution path is data-adaptive. WBS2 does not use the data in drawing its intervals; by contrast, in WBS2, the location of the domain of each next batch of intervals is determined by the locations of the previously detected change-point candidates. One consequence of this computational adaptivity of WBS2 is that on average, the intervals drawn by WBS2 rapidly become shorter as the procedure progresses, as the procedure operates on shorter and shorter sub-domains of . This makes WBS2 fast; we comment on the theoretical computational complexity of the WBS2 solution path in Section 3.1.2 and actual computation times are given in Section 4.2.

Finally, we note that the computation of the WBS2 solution path requires no other parameters besides , and that it is natural for it to be defined (and indeed implemented) recursively.

#### 3.1.2 Theoretical properties of the WBS2 solution path

We introduce the following technical assumption.

###### Assumption 3.1
1. The sequence is i.i.d. .

2. With the additional notation , , the minimum spacing between change-point satisfies , where with .

3. for .

4. The magnitudes of the jumps satisfy .

These conditions are not strictly necessary for the procedure to work, and various extensions are possible, related to the (joint) distribution of

, the minimum distance between the change-point locations or the minimum magnitude of the jumps. However, the setting of Assumption 3.1 has the benefit of simplifying the technical exposition as much as possible, while still enabling the demonstration of the full mechanics of the procedure.

To facilitate the statement of the theorem below, we introduce the concept of ‘scale’ at which the WBS2 procedure operates as follows. Initially, the procedure operates on the domain , and we refer to this stage as scale 1. It then subdivides this domain into two and operates on the two sub-domains; this is scale 2. The scale parameter then increases by one with each subdivision.

We have the following result.

###### Theorem 3.1

Let follow model (1) and let Assumption 3.1 hold. Let and denote, respectively, the number and the locations of the change-points. Let be the solution path of the WBS2 algorithm, defined in Section 3.1.1. The following statements are true.

1. The length of is .

2. Let be the length of the longest sub-domain on which the WBS2 solution path algorithm operates at scale . Let . The computational complexity of the WBS2 solution path algorithm is .

3. Let be the 3rd components of the elements of . Sort in increasing order, and denote the thus-ordered sequence by . Define the event

 AT = {maxi=1,…,N|ηi−~ηi|≤C1(Δ)logT∩|~Xbksk,ek|∼T1/2fork=1,…,N ∩ |~Xbksk,ek|≤C2(Δ)log1/2Tfork=N+1,…,T−1},

where are positive constants that only depend on the positive parameter below, and the symbol means “is of the order of magnitude of”. We have

 P(AT)≥1−αT−12N(N+1)(2C1(Δ)logT+1)2(1−δ2/9)~M, (3)

where

 αT=1−P(BΔ,T)

with

 BΔ,T={∀1≤l≤m≤T|m−l+1|−1/2∣∣ ∣∣m∑i=lεi∣∣ ∣∣≤σ{2(1+Δ)logT}1/2}.

We now comment on several aspects of Theorem 3.1. Part (i) means that the WBS2 solution path procedure is complete in the sense of Definition 2.1. In part (ii), we investigate the magnitudes of and to gain a better idea of the computational complexity of the WBS2 solution path procedure. Beginning with , we now discuss the minimum number of draws required to ensure that in part (iii) converges to 1 suitably fast. To match the equivalent rate for WBS (see Theorem 3.2 in Fryzlewicz (2014)), we require

 12N(N+1)(2C1(Δ)logT+1)2(1−δ2/9)~M≤T−1,

which is equivalent to

 ~M≥log−1{(1−δ2/9)−1}(logT+log{N(N+1)/2}+2log(2C1(Δ)logT+1)),

and therefore . The maximum magnitude of depends on the location of the splits within each each interval on which the routine WBS2.Sol.Path is executed. In particular, we have the following result (the proof is straightforward, so we omit it).

###### Proposition 3.1

If, in each execution of the routine WBS2.Sol.Path, the split location is such that

 γ≤bm0−s+1e−s+1≤1−γ, (4)

for a certain fixed , then .

It is immediate that condition (4) is automatically satisfied if , i.e. for short intervals. From the proof of Theorem 3.1, it is also automatically satisfied (for small enough, with probability tending to 1 with ) for those intervals which contain previously undetected change-points. Since the locations of the splits in those intervals that do not contain previously undetected change-points is immaterial for the performance of the procedure, this in turn implies that an alternative WBS2 solution path procedure, in which the existing selection of were replaced by constrained maximisation of the form

 (m0,bm0):=argmaxm=1,…,~M,b∈{sm,…,em−1}|~Xbsm,em|s.t. condition (???)

would enjoy exactly the same properties as the original WBS2 solution path procedure, plus would come with a guarantee that . However, we refrain from discussing such a modified procedure further as it would require the choice of , an additional parameter.

Proposition 3.1 and the above discussion regarding the magnitude of imply the following corollary.

###### Corollary 3.1

Under the assumptions of Theorem 3.1 and Proposition 3.1, the computational complexity of the WBS2 solution path algorithm is .

Regarding part (iii) of Theorem 3.1, it is the statement of Lemma 1 in Yao (1988) that for all , at a rate that depends on the magnitude of . The uniform logarithmic bound on the magnitudes of the errors is the same as in WBS and is ‘near-optimal’ in the sense described in Fryzlewicz (2014).

We emphasise that part (iii) does not yet constitute a complete consistency result for the WBS2.SDLL procedure, as it does not specify a model selection process; it merely says that the model with the true number of estimated change-points is such that the locations of all change-points are estimated near-optimally. In Section 3.2, we will describe how to estimate consistently via SDLL and this will lead to a complete consistency result (for the number and locations of estimated change-points) for the WBS2.SDLL method.

### 3.2 “Steepest Drop to Low Levels” model selection

#### 3.2.1 Motivating example and methodology of SDLL model selection

This section introduces our new model selection criterion, labelled “Steepest Drop to Low Levels”, which will be used to select a model along the WBS2 solution path . Given , the model selection task is equivalent to indicating an estimate of (the number of change-points), as the estimated change-point locations will then be given by , if . This section will define our estimator . We start with a motivating example.

Example 3.1   As in Section 2.1 and Example 2.1, we consider the model , , where is the extreme.teeth` signal. However, the set-up in the current example is even more challenging in that we consider i.i.d. as compared to in Section 2.1 and Example 2.1. For a particular realisation of this model, we compute the WBS2 solution path with and plot the resulting sequence (Figure 4, top plot). Recall that by the construction of , this sequence is non-increasing.

The true number of change-points in is . A visual inspection of the top plot in Figure 4 reveals an increasing (negative) gradient in at around . This appears to align well with the assertion of part (iii) of Theorem 3.1, according to which, on a set of high probability, the elements of are of a higher order of magnitude () for than they are for (). This observation provides a basis for our procedure for estimating : our will be an estimator of the location of this change in orders of magnitude, or equivalently the location of this large negative gradient.

In order to perform the estimation of this location, it is natural to work on the log scale: we denote . The reason is that taking the differences leads to the cancellation of the terms for and the cancellation of the terms for (where is the largest for which for a certain constant ). Therefore, the terms for will be bounded in , whereas the single term will tend to infinity with at the speed of . As a consequence, it is tempting to consider setting to the value of for which is the largest. Our new model selection criterion does exactly that, but with a certain finite-sample correction, which we motivate next.

The bottom plot of Figure 4 shows the sequence for . In light of the above discussion, it is unsurprising to see a prominent peak in around (in this particular example, the location of this local peak is ). Yet, setting would have failed in this particular case, as the global maximum of is achieved for . However, one way of checking that leads to an infeasible model in this example is to examine the size of , the smaller term entering the difference . If were to be the true model size, then by part (iii) of Theorem 3.1 we would expect . We use a threshold to decide whether is indeed of this magnitude, and the form of this “universal” threshold is (where is the MAD estimate of with on input); this formula will be motivated in Sections 3.2.2 and 4.1. In this particular example, we have , and therefore is rejected as a feasible model. Our procedure then goes on to examine the next largest , which in this case is ; this leads to examination of , whose magnitude is well under . As a consequence, the model with 202 change-points is accepted and we set (over-estimating the true number of change-points by 3). (End of example.)

We now comment heuristically on some aspects of Example 3.1 before introducing a formal algorithmic description of our new “Steepest Drop to Low Levels” (SDLL) model selection procedure.

The name “Steepest Drop to Low Levels” has its origins in the fact that our model selection procedure does not merely look for the “steepest drop” (= largest negative gradient) in (this occurs at in the above example) but it searches for the steepest drop followed by “low levels” (= values of that fall under the threshold); this occurs at in the example.

We emphasise that the use of thresholding in the SDLL model selection is new and different from the classical use of thresholding in change-point detection problems (as in e.g. Fryzlewicz (2014)). In classical thresholding, a threshold is used as the main model selection device in the sense that its value is used to separate significant from insignificant change-point candidates. This happens in a “continuous” manner, in the sense that even small changes in the threshold value can lead to the exclusion or inclusion of additional candidate change-points and therefore to changes in the selected model. A satisfactory threshold should in theory be linear with , so due to this sensitivity of the selected model to the threshold value, the user should have the ability to estimate accurately, which is not always possible in frequent change-point settings, as illustrated earlier.

By contrast, SDLL uses thresholding only as a secondary model selection device, the main criterion being the size of the negative gradient in . In SDLL, thresholding is used “discretely” in the sense that the consecutive models being tested via thresholding are often very remote from each other in terms of their numbers of change-points; in the example above, we first tested a model with 3 change-points followed by a model with 202 change-points. In other words, we used thresholding not so much to “choose” a model, but to test the feasibility of models pre-ordered according to their respective negative gradients in (rather than according to their numbers of change-points). The model with 3 change-points was rejected as easily as the model with 202 change-points was accepted. This suggests that SDLL does not require the threshold to be selected as precisely as is the case in classical thresholding. This is potentially a useful feature of SDLL as it facilitates accurate model selection in the presence of frequent change-points, when the user is not always able to estimate (and hence the threshold) well.

We are now in a position to introduce a complete algorithm describing our SDLL model selection procedure. It is defined by the WBS2.SDLL routine below, which takes three parameters on input: is a WBS2 solution path, whose computation is outlined in Section 3.1.1; is a threshold; and is a constant. In the remainder of this paper, we use , and this choice will be motivated in Section 3.2.2; the choice of the constants and will be described in Section 4.1.

function WBS2.SDLL(, , )
if  then
STOP
end if
if  then

else

if  then

else
,
if  then

else

end if
end if

end if
end function

If the magnitudes of all fall under , the procedure returns zero change-points. Otherwise, the procedure extracts all those indices for which ; this is to ensure that we only work with those for which as motivated in Example 3.1. Within this range of , it then looks for the maximum negative gradient in , subject to the constraint that , and assigns to be that location; if the constraint is never satisfied, this necessarily means that the end of the range has been reached, in which case .

We now comment on a few important aspects of the SDLL model selection. As described earlier in the discussion of Example 3.1, SDLL is not a classical thresholding model selection procedure, but also does not used a penalty-based approach. Its execution, not including the call to the Sort routine, is of computational order . Under the assumptions of Theorem 3.2, the execution of the Sort routine is of computational order with high probability. Unlike some penalty-based approached, SDLL does not need to know the maximum number of change-points present in the data, in either theory or practice.

#### 3.2.2 Theoretical properties of SDLL model selection

In this section, we formulate a complete consistency theorem for the WBS2 method equipped with the SDLL model selection criterion, as regards the estimation of both the number and the locations of the change-points. We have the following result.

###### Theorem 3.2

Let follow model (1) and let Assumption 3.1 hold. Let and denote, respectively, the number and the locations of the change-points. Let be the solution path of the WBS2 algorithm, defined in Section 3.1.1. Let and be the estimates of the number and the locations (respectively) of the change-points, returned by the SDLL model selection criterion as defined by the WBS2.SDLL routine with , and on input, with and for a large enough constant . Define the events

 Θθ,T = {(1+θ)−1<^σT/σ<1+θ}, ET = {^N=N∩maxi=1,…,^N|ηi−~ηi|≤C1(Δ)logT},

where is a positive constant and