1 Introduction
A recent report (National Research Council, 2013) lists changepoint detection as one of the “inferential giants” (common building blocks in knowledge discovery) in massive data analysis. Changepoint analysis, be it a posteriori or online, is used in fields as diverse as bioinformatics (Olshen et al., 2004), econometrics and finance (Hansen, 2001; Andreou and Ghysels, 2002; Bai and Perron, 2003), climate and the natural environment (Reeves et al., 2007; Liu et al., 2010; Salarijazi et al., 2012; Pezzatti et al., 2013), autonomous driving (Galceran et al., 2015)
(Ranganathan, 2012), clinical neuroscience (Younes et al., 2014) and quality/reliability monitoring (D’Angelo et al., 2011; Huang and Lyu, 2011; Amiri and Allahyari, 2012).A common testing ground for a posteriori changepoint detection methods, and the focus of this paper, is the model
(1) 
in which is a deterministic, onedimensional, piecewiseconstant signal with changepoints whose number and locations are unknown. The sequence is random and such that is exactly or approximately zero. In the simplest case are modelled as i.i.d. , but can also have other marginal distributions and/or exhibit temporal dependence. The task is to estimate and under various assumptions on , the magnitudes of the jumps and the minimum permitted distance between the changepoint locations. This article focuses on the i.i.d. distribution for , as even in this simplest case, the above changepoint detection problem still poses major challenges for the state of the art, as this article illustrates.
Although performance measures for the above estimation task, naturally, vary across the literature, it can be anticipated that many users will have a preference for methods which

accurately estimate the number and locations of any changepoints present in , for both

signals with few/infrequent changepoints, and

signals with many/frequent changepoints;


are fast to execute and scale well to long signals.
Depending on the setting, other desirable features may include e.g. the existence of software, ease of implementation and transferability to other, more complex stochastic models. Even though much methodological and algorithmic progress has been made on the above changepoint detection problem in recent years, we show in this paper that very few (if any) stateoftheart techniques simultaneously possess the above desired features 1(a), 1(b) and 2. In particular, many techniques perform surprisingly poorly for signals with frequent changepoints. We first briefly describe the history of the multiple changepoint detection problem and the state of the art. For reasons of space, our literature review mainly focuses on the parametric case, i.e. models in which the distribution of can be modelled parametrically, with special emphasis on the i.i.d. Gaussian model for .
A major class of approaches to a posteriori multiple changepoint detection is one in which changepoints are found by minimising a criterion function consisting of a likelihoodtype (or leastsquares) term measuring the fit of the estimate to the data plus a penalty term to prevent overfitting. In this category, Yao and Au (1989) consider leastsquares estimation of for a fixed and i.i.d. noise. Braun et al. (2000)
extend this work to noise for which the variance is a function of the mean. In the Gaussian case, the SIC (BIC) is used to estimate an unknown but bounded
in Yao (1988), and a more general penalty but also linear in the number of changepoints appears in Lee (1995). For an unknown but bounded , Lavielle and Moulines (2000) consider penalised leastsquares estimation, with a penalty linear in the number of changepoints, for dependent ’s; see also Lavielle (1999) and Lavielle (2005). For a fixed , Pan and Chen (2006) propose a likelihood criterion with a penalty depending not only on the number, but also on the locations of changepoints, favouring more uniformlyspread estimated changepoints; related ideas appear in Zhang and Siegmund (2007). For an unknown , Lebarbier (2005) proposes leastsquares estimation with a penalty originating from the model selection approach of Birgé and Massart (2001). Boysen et al. (2009) use the leastsquares criterion with a linear penalty on the number of changepoints. More general forms of Schwarzlike penalties are studied, for example, in Wu (2008), Ciuperca (2011) and Ciuperca (2014). The SMUCE estimator of Frick et al. (2014) can also be cast in the penalised cost function framework (while SMUCE is shown to control the FWER, the related method of Li and Munk (2016) controls the FDR). An empirical Bayes approach to changepoint estimation in a marginal likelihood framework appears in Du et al. (2016). The PELT algorithm by Killick et al. (2012) and the pruned dynamic programming by Rigaill (2015) accelerate computation of some penalised changepoint estimators to linear time in bestcase scenarios (while retaining quadratic speed in worstcase ones), and offer fast implementations. Related ideas appear also in Maidstone et al. (2017).Other attempts to reduce the computational complexity of the problem include Davis et al. (2006)
, who (in a time series setting) use a genetic algorithm to minimise a Minimum Description Length criterion, and
Harchaoui and LévyLeduc (2010) who consider the leastsquares criterion with a total variation penalty, which enables them to use the LARS algorithm of Efron et al. (2004) to compute the solution intime. However, the total variation penalty is not an optimal one for changepoint detection (in the sense of balancing out typeI and typeII errors, as described in
Brodsky and Darkhovsky (1993) and Cho and Fryzlewicz (2011)). The total variation penalty is also considered in the context of peak/trough detection by Davies and Kovac (2001), who propose the ‘taut string’ approach for fast computation. In the context of multiple changepoint detection, it is considered by Rinaldo (2009) (with a subsequent correction published on the author’s web page) and Rojas and Wahlberg (2014) as part of the fused lasso penalty, proposed by Tibshirani et al. (2005) and equivalent to taut string in model (1). Lin et al. (2017) propose a generic postprocessing procedure for any consistent, piecewiseconstant estimator of , including in particular the totalvariationbased one, which yields consistency for the estimators of under certain assumptions on the distances between the changepoints and the jump sizes. Similar results for general “trend filtering” (Tibshirani, 2014) estimators appear in Guntuboyina et al. (2018).Our experience is that many penaltybased methods in which the penalty has been preset by the user (rather than having been chosen adaptively from the data) can struggle to offer uniformly good performance across both signals with infrequent changepoints, and signals with frequent/numerous ones; in particular, we illustrate in Section 2.1 that the popular BIC (Yao, 1988) and mBIC (Zhang and Siegmund, 2007) penalties can perform poorly for signals with frequent changepoints. An interesting approach to datadriven penalty selection is proposed in Birgé and Massart (2006), who define the socalled minimal penalty; two approaches to estimating this quantity, referred to as ‘dimension jump’ and ‘datadriven slope estimation’ are described in Baudry et al. (2012). Both offer impressive performance for short signals with a limited number of jumps, but their performance degrades for signals with frequent jumps (more so for the dimension jump approach). There is also computational price to pay for selecting the penalty in a datadriven way: the ‘datadriven slope estimation’ approach is particularly slow for long signals. We illustrate these points in Section 4.2. With the exception of this approach, the existing literature does not appear to provide guidance as to the appropriate choice of a penalty for a given dataset.
Another class of approaches to the multiple changepoint detection problem is one in which changepoints are estimated one by one, in the order in which they appear in the data. Huskova and Slaby (2001) and Eichinger and Kirch (2018) propose the “moving sum” (MOSUM) technique, which requires the choice of an extra bandwidth parameter, but the latter requirement is circumvented in the mosum R package (Meier et al., 2018), which produces estimates combined over a range of automatically chosen bandwidths. The pseudosequential procedure of Venkatraman (1992)
, as well as the method of Ross (2015) are based on an adaptation of online detection algorithms to a posteriori situations and work by bounding the Type I error rate of falsely detecting changepoints. The ‘IsolateDetect’ method of
Anastasiou and Fryzlewicz (2018a) is also pseudosequential in nature, avoids to some extent the issue of bandwidth/span selection, and offers particularly fast implementation for signals with frequent changepoints. However, from the performance point of view, such signals tend to pose major difficulties for many methods in this category, an issue magnified by the fact that many such techniques rely critically on the accurate estimation of a threshold against which to judge the significance of each changepoint candidate, but the magnitude of this threshold can be difficult to estimate well in the presence of many changepoints. We demonstrate this phenomenon in Section 2.3.2.Binary segmentation is one of the most popular approaches to multiple changepoint detection. Heuristically, it works by fitting the best model with a single changepoint to data
, and if the estimated changepoint is deemed to be significant, the same procedure is then repeated to the left and to the right of the estimate. Binary segmentation is fast, conceptually simple and easy to implement, but tends not to work well for signals with frequent changepoints, the main reason being that for such signals there are no guarantees that the singlechangepoint model fit at each stage of the procedure will behave well in the presence of multiple changepoints in the data segment considered. Possibly the first work to propose binary segmentation in a stochastic process setting is Vostrikova (1981), who shows consistency of binary segmentation for the number and locations of changepoints for a fixed . Venkatraman (1992) offers a proof of the consistency of binary segmentation for and the changepoint locations, even for increasing with , albeit with suboptimal assumptions and rates for the locations. In a setting similar to Vostrikova (1981) (for a fixed and with following a linear process), binary segmentation is considered in Bai (1997). Chen et al. (2011) provide a proof of consistency of binary segmentation for the number of changepoints in the case of a fixed and i.i.d. Gaussian. Binary segmentation is used for univariate time series segmentation in Fryzlewicz and Subba Rao (2014) and Cho and Fryzlewicz (2012), and for multivariate, possibly highdimensional time series segmentation in Cho and Fryzlewicz (2015). Circular Binary Segmentation (Olshen et al., 2004; Venkatraman and Olshen, 2007), Wild Binary Segmentation (Fryzlewicz, 2014) and the NarrowestOverThreshold method (Baranowski et al., 2018) are designed to improve the performance of binary segmentation. We return to Wild Binary Segmentation later in this section.A number of techniques do not directly fall into any of the above categories. Wang (1995) uses the fast discrete wavelet transform to detect changepoints. Muggeo and Adelfio (2011) take the cumulative sum of and then apply the methodology described in Muggeo (2003) in the resulting piecewiselinear framework. An agglomerative, bottomup, “tailgreedy” method is proposed in Fryzlewicz (2018). Matteson and James (2014) present a heuristic agglomerative algorithm for multiple changepoint detection as a computationally attractive alternative to their divisive one. An early review of some multiple changepoint detection methods (in the context of DNA segmentation, but applicable more widely) appears in Braun and Mueller (1998).
Wild Binary Segmentation (WBS, Fryzlewicz, 2014) is one of the stateoftheart procedures for multiple changepoint detection in model (1), for signals with a small or moderate number of changepoints. Truong et al. (2018) give WBS “three stars” (their highest grade) for scalability to long signals, and Wang et al. (2018) show certain assumption and errorrateoptimality of WBS. WBS is designed to deal with the shortcomings of binary segmentations (described above) in the following manner. At the start of the procedure, subsamples of the data, living on intervals , are drawn, with the start and endpoints and selected randomly, independently, uniformly and with replacement. The singlechangepoint model fit is then performed on each subsample, and the first changepoint candidate is selected as the one corresponding to the best fit over the subsamples. The hope is that if is sufficiently large, then this best fit will come from a subsample containing at most one changepoint and will therefore lead to the accurate estimation of the given changepoint. If the first detected changepoint is considered significant, the same procedure is repeated to the left and to the right of it, reusing (for speed) the previously drawn subsamples.
Notwithstanding this improvement over plain binary segmentation, we show in Section 2.1 that the performance of WBS deteriorates dramatically for signals with frequent changepoints (as does that of many other stateoftheart methods). The roots of its poor performance for such signals are multiple, and can be summarised as follows.

The recommended default number of subsamples drawn by the WBS (the R packages breakfast (Fryzlewicz, 2017) and wbs (Baranowski and Fryzlewicz, 2015) recommend not exceeding 20000) can be insufficient to ensure adequate performance of WBS for signals with frequent changepoints. Indeed, good performance can only be hoped for if
is large enough for the drawn subsamples to include (with a high probability) ones with start and endpoints immediately to the left and to the right of each changepoint; this means that the required
can easily be very large for signals with frequent changepoints. Increasing the default to ensure this is not a viable solution as this would increase the computation time beyond what is acceptable (and is not needed for signals with no or few changepoints). In the remainder of this article, we refer to this issue as the lack of computational adaptivity of the WBS, in the sense that the procedure does not decide automatically the number or the locations of the subsamples drawn. 
A related point is that for WBS to be able to produce an entire solution path to problem (1) (i.e. estimated models with changepoints), we would have to draw all possible subsamples of the data. This would result in a procedure of a cubic computational complexity in , which is computationally inadmissible. Any value of less than this upper limit would not guarantee the computability of the entire solution path. If a solution path algorithm in multiple changepoint detection problems does not produce an entire solution path, we refer to this algorithm as incomplete. The use of incomplete solution path algorithms is particularly risky in frequent changepoint scenarios, as it may lead to obtaining a partial solution path that is shorter than the true number of changepoints, in which case any model selection criterion, however good, will not be able to recover all the changepoints. We illustrate this important phenomenon in Section 2.3.1.

Fryzlewicz (2014) proposes two model selection criteria for the WBS: one based on thresholding, and the other based on the use of the sSIC (a penalty close to the standard BIC/SIC). Both approaches fail in frequent changepoint scenarios. Thresholding does so because its success critically depends on the user’s ability to accurately estimate the variance of the noise , but this task can be difficult in frequent changepoint settings, which we show in Section 2.3.2. BIC fails because it places too heavy a penalty on frequentchangepoint solutions to the estimation problem 1. The failure of BIC in frequent changepoint scenarios occurs for other changepoint search algorithms too (see Section 2.1), not just WBS.
The objective of this paper is to introduce the “Wild Binary Segmentation 2” (WBS2) solution path algorithm for multiple changepoint problems, and the “Steepest Drop to Low Levels” (SDLL) model selection for WBS2. The resulting multiple changepoint detection method, labelled WBS2.SDLL, addresses the above shortcomings of WBS in two separate ways.
Firstly, WBS2 produces an entire solution path, i.e. sequences of solutions to the changepoint detection problem with changepoints. Among them, WBS2 guarantees (with probability approaching one with ) that the solution with the true number of changepoints is such that the estimated changepoint locations are nearoptimally close to the true changepoints . Because it produces the entire solution path, the WBS2 is (unlike WBS), by defintion, a complete procedure, and its completeness is achieved in a computationally efficient way as follows. In the first pass through the data, WBS2 draws only a small number of data subsamples, and marks the first changepoint candidate as the argmax of the resulting absolute CUSUMs. It then uses this changepoint candidate to split its domain of operation into two, and again recursively draws
subsamples to the left and to the right of this changepoint candidate, and so on. Therefore, WBS2 adaptively decides where to recursively draw the next subsamples, based on the changepoint candidates detected so far (to reexpress this is the language of machine learning, WBS2 learns the locations of the subsamples to draw automatically and “on the fly”, i.e. as it proceeds through the data and discovers the signal). This is in contrast to the standard WBS, which (nonadaptively) predraws all its
subsamples at the start of the procedure. We note that in practice, is of orders of magnitude smaller than ; in the numerical sections of this paper we will be using . WBS2 draws subsamples on every current subdomain of the data as long as its length is , which ensures completeness. By contrast, WBS (nonadaptively) predraws all its subsamples at the start of the procedure, which leads to its incompleteness except if all possible subsamples are drawn, which is computationally infeasible for even moderately long signals.Secondly, the proposed SDLL procedure for estimating from the WBS2 solution path does not use a penalty, and only uses thresholding as a certain secondary check, which is in contrast to the use of thresholding as a primary model selection tool in WBS. This means that the magnitude of the threshold in WBS2.SDLL does not need to be estimated to a high degree of accuracy (which is difficult to do in frequent changepoint scenarios). SDLL works by monitoring the ratios of the the pairs of consecutive maximum absolute CUSUM statistics returned by the WBS2 solution path, arranged in decreasing order. The procedure then selects the largest ratio for which the denominator falls under a prespecified threshold. This is motivated by the fact that those absolute maximised CUSUMs that carry information about the changepoints have the same order of magnitude between them, which is larger than the order of magnitude of those CUSUMs that correspond to the nochangepoints sections of the data. The presence of this “drop” in the magnitude of the absolute maximised CUSUMs gives the SDLL criterion its name.
Our numerical experiments suggest that unlike the standard WBS, the resulting WBS2.SDLL procedure offers good detection accuracy and fast execution times for signals with frequent changepoints. We demonstrate that in doing so, WBS2.SDLL significantly outperforms the state of the art for this signal class. Unlike many of its competitors, WBS2.SDLL does not require the choice of a penalty or the specification of the maximum number of changepoints, and is easy to calibrate to return zero estimated changepoints for constant signals with a required probability. Both the WBS2 and the SDLL components are conceptually simple and easy to code.
The paper is organised as follows. In Section 2, we illustrate the poor performance of the state of the art for signals with frequent changepoints, and we investigate the reasons for WBS’s poor performance. Section 3 motivates and proposes WBS2 and SDLL and investigates their theoretical properties. Our practical recommendations and a comparative simulation study are in Section 4. Section 5 describes the application of WBS2.SDLL to a London house price dataset. The proofs of our theoretical results are in the Appendix.
2 Motivation
2.1 Frequent changepoints: performance of the state of the art
To motivate the need for our new methodology, we start by illustrating the poor performance of WBS and other
stateoftheart multiple changepoint detection methods for signals with frequent changepoints, using
the example of the extreme.teeth
signal defined below.
We use this signal as a running example throughout the paper to motivate several of our developments,
which are then however shown to hold for a wide class of signals both in theory and in practice.
Throughout the paper, we work with methods available from the R
CRAN repository, and for each method tested, we use the default parameter settings unless stated
otherwise. For techniques that require the specification of the maximum number of changepoints, we set this to the
(rounded) length of the input signal divided by three; in all our examples, the true number of changepoints is (well) within
the interior of this range. Our R code enabling replication of the results from this paper is available from
https://github.com/pfryz/wildbinarysegmentation2.0. We now list the techniques tested
in addition to our new WBS2.SDLL methodology.

MOSUM: the multiscale MOSUM technique with localised pruning from the R package mosum (Meier et al., 2018). (We do not give results for the multiscale MOSUM technique with bottomup bandwidthbased merging from the same package as it frequently returned warnings in the examples we tested.)

S3IB: the technique from the R package Segmentor3IsBack (Cleynen et al., 2016). It implements a fast algorithm for minimising the leastsquares cost function for changepoint detection, as described in Rigaill (2015). The best model is selected via the “oracle” criterion of Lebarbier (2005), which uses the socalled slope heuristics.

WBSC1.0, WBSC1.3 and WBSBIC: the Wild Binary Segmentation methods with the model selection determined via thresholding with constants and and the BIC (respectively), as specified in Fryzlewicz (2014) and implemented in the R package breakfast (Fryzlewicz, 2017). See Section 2.2 for their definitions.
We do not include the (nonparametric) methodology implemented in the R package ecp (James and Matteson, 2014) as this procedure does not appear to be able to return zero estimated changepoints. We also do not provide results for the R package strucchange (Zeileis et al., 2002) because of the slow speed of the methodology implemented therein.
Our example in this section is the signal , referred to as extreme.teeth
and
defined as follows: if ; if .
We simulate 100 independent copies of in model (1) with being the extreme.teeth
signal and . Note that , the true number of changepoints in ,
equals 199. Figure 1 shows the first 100 observations of the noiseless and noisy .
Figure 2 provides evidence that most of the techniques listed above fail somewhat dramatically for this model, with all but three being able to estimate practically none or only a small proportion of the changepoints. The three best performers are DDSE, and two versions of our method, labelled WBS2.SDLL(0.9) and WBS2.SDLL(0.95) (we provide the details later). The DDSE, however, is over three times slower than WBS2.SDLL and estimates zero changepoints for two out of the 100 sample paths, which results in compared to . We return to DDSE and the other techniques in the comparative simulation study in Section 4.2. WBSC1.0, WBSC1.3, WBSBIC are among the failing techniques and we now examine the reasons for their failure.
2.2 Wild Binary Segmentation
We first review WBS as proposed in Fryzlewicz (2014). Throughout the paper, our main “locator” statistic for indicating the locations of changepoint candidates is the CUSUM statistic, defined, for the data portion , as
(2) 
where , with . It is used in different ways in WBS and WBS2. In both these algorithms, we use as the most likely location of a changepoint in . This is the same as the location indicated by fitting the best possible piecewiseconstant function with one changepoint to via least squares. Therefore, the use of the CUSUM statistic is equivalent to the use of maximum likelihood in model (1) in which i.i.d. .
Denote by a set of random intervals , , whose start and endpoints have been drawn (independently with replacement) uniformly from the set . How many intervals to draw is decided by the user (Fryzlewicz (2014) and version 1.0.0 of the R package breakfast (Fryzlewicz, 2017) recommend and , respectively, for signal lengths not exceeding a few thousand). An appropriate choice of is key: on the whole, the larger the value of , the more accurate but slower the procedure. Using pseudocode, the main function of the WBS algorithm, with model selection performed via thresholding with threshold , is defined in the following way.
The WBS procedure (with model selection via thresholding with threshold ) is launched by the call WBS(, , ). The WBSC1.0 and WBSC1.3 versions, singled out by Fryzlewicz (2014) and used in Section 2.1 above, use, respectively, and , where is the Median Absolute Deviation (MAD) estimator of , computed using on input. An alternative implementation of WBS with thresholding first computes all possible changepoint candidates via the call WBS(, , ) and then tests them, as is done in the definition of the routine WBS(, , ), against the threshold .
As an alternative to thresholding, Fryzlewicz (2014) also proposes model selection via the “strengthened Schwarz Information Criterion” (sSIC). The sSIC penalty is close to the standard SIC (BIC) penalty, and therefore we do not review it here in detail. If an estimated changepoint arises via the maximisation of over , we refer to this estimate as in this paragraph. The WBSBIC version of the procedure, used in Section 2.1, initially computes all possible changepoint candidates via the call WBS(, , ). It then sorts them in the order of decreasing magnitudes of . Let be the indices (i.e. values of the indexing variable ) corresponding to this sorted order. The WBSBIC returns as the estimated changepoints the set that yields the minimum of the BIC.
The key difference between the WBS and standard binary segmentation algorithms is that each changepoint candidate is found by WBS by examining all intervals contained within (rather than examining only itself, as in binary segmentation), and finding the overall argmax of the CUSUM statistics fitted on each such interval . The reason why this improves on the standard binary segmentation is that if is large enough, each true changepoint is guaranteed (with probability approaching one in ) to be contained within at least one of the intervals which contains only that single changepoint. This alone provides suitable guarantees for the size of the largest maximum of the CUSUM statistics, and is enough to yield substantial theoretical and practical advantages of WBS over binary segmentation, as similar guarantees are unavailable in the latter method. The collection of intervals is only drawn once at the start of the WBS procedure; therefore, previouslydrawn intervals are reused as the algorithm progresses.
2.3 Issues affecting Wild Binary Segmentation
2.3.1 Incompleteness of WBS
With the definition of the routine WBS(, , ) as in Section 2.2, the call WBS(, , ) returns all possible changepoint candidates in the sense that with , the changepoint candidates are not tested against a threshold before they enter the output. Any model selection procedure can then be applied to this output to identify the final set of estimated changepoints.
The call WBS(, , ) is defined recursively. The equivalent nonrecursive procedure for producing the same set of changepoint candidates can be described as follows. For each , let . Initially let and . While , repeat the following pair of operations.

(Identification of next changepoint candidate)

(Removal of all intervals containing this changepoint candidate)
At the end of this process, the set contains the same set of changepoint candidates as the output of WBS(, , ). To see the equivalence between these two observe that the two recursive steps WBS(, , ) and WBS(, , ) in the definition of the WBS routine are equivalent to: after the recording of each changepoint candidate , remove from consideration all those intervals that contain , and only operate on those intervals that fall entirely within and . In the above nonrecursive definition of , this removal is explictly performed in stage 2. The removal is essential for the procedure to avoid repeated estimation of the same changepoint multiple times.
Let (where returns the cardinality of its argument when it is a set). can be viewed as an ordered set in the following sense. Let be the order in which the 5tuples enter . The sequence is nonincreasing, as its each next element is a maximum over the progressively shrinking set . In the remainder of this paper, we treat as an ordered set with this particular ordering. Therefore, the ordered set can be viewed as a WBS “solution path”: the changepoint candidates contained in are arranged in order of decreasing importance (i.e. decreasing magnitude of ), and therefore any reasonable WBS model selection procedure would typically choose a certain number of the first components of as the final set of estimated changepoints.
As a result of the repeated removal of intervals from the set , the cardinality of the solution path is typically much smaller than the cardinality of the initial set . This is illustrated in the following example.
Example 2.1 Consider a sample path simulated from the model , , where
is the extreme.teeth
signal defined in Section 2.1 and and
has been simulated in R on setting the random seed to 1. We simulate (a value recommended
in Fryzlewicz (2014)) intervals . On completion of the computation of
as described in this section, we obtain . As the length of the solution path
is less than 199, the true number of changepoints in , no model selection procedure is able to recover
all the changepoints in this example: indeed, even if we were to accept all changepoint candidates in ,
we would only estimate 119 changepoints. Therefore, WBS as proposed in Fryzlewicz (2014) fails in this case. (End of example.)
The above example illustrates that too small a value of can prevent WBS from working well for signals with frequent changepoints, regardless even of the model selection procedure used. Increasing the default value of in the hope of obtaining a longer solution path is not a viable option as it can easily slow down the WBS procedure beyond what is acceptable (in particular, drawing all possible subintervals of leads to WBS having computational speed ) and is unnecessary for signals with no or few changepoints, which require far fewer intervals. In other words, WBS does not adapt to the data in terms of the number or the locations of the intervals it draws. From the user perspective, this lack of adaptivity creates two related major practical issues: (a) in the absence of prior information about the number and frequency of changepoints, it is unclear whether any particular value of is large enough in practice, and (b) setting to be very large to be “on the safe side” is not an option from the computational point of view.
We formalise the issue of the length of solution paths in the language of “completeness”, which we now introduce.
Definition 2.1
Let be a solution path algorithm in a multiple changepoint detection problem which produces a sequence of candidate models with estimated changepoints. We refer to as complete if , where is the length of the input data, and incomplete otherwise.
Note that means that every time point is a changepoint candidate, so that the candidate model with is maximal. The definition of completeness leads to the following proposition for WBS. The proof is straightforward, so we omit it.
Proposition 2.1
The solution path obtained via WBS as described in this section is guaranteed to be complete in the sense of Definition 2.1 if and only if represent all possible subintervals of , in which case . Furthermore, if , then the WBS solution path procedure is of computational order .
We argue that completeness is a desirable feature of the solution path component of any changepoint detection procedure if it is to be applied in frequent changepoint scenarios, as the presence of completeness guarantees that we avoid situations (such as the one illustrated in this section) in which the solution path is too short to enable estimation of all the changepoints. Completeness embodies the idea that no changepoint candidate should be dismissed as inadmissible at the “solution path” stage: this should only happen at the model selection stage. For both scalability and good performance in frequent changepoint scenarios, it is important for a solution path procedure to be both fast and complete, and we have shown in this section that WBS does not simultaneously possess these two features.
2.3.2 Shortcomings of thresholding and BIC as model selection for WBS
Fryzlewicz (2014) proposes two model selection criteria for WBS: one based on thresholding, and the other based on the sSIC, an information criterion for all practical purposes equivalent to the standard SIC (a.k.a. BIC). In this section, we argue that both can easily be inadequate in frequent changepoint scenarios (independently of the issue of completeness described in Section 2.3.1).
The success of thresholding as a model selection criterion for WBS critically relies on the user’s ability to accurately estimate . The following example illustrates that this task can be difficult to achieve in frequent changepoint models.
Example 2.2 Consider 100 sample paths simulated from the model , , where
is the extreme.teeth
signal defined in Section 2.1. and
. For each simulated sample path, we evaluate the Median Absolute Deviation (MAD) and InterQuartile Range (IQR) estimators (two commonly used robust estimators) of
calibrated for the Gaussian distribution, computed with
on input. Figure 3 shows that both estimators are heavily biased upwards on this example and overestimate by around 25%. The degree of this overestimation prevents thresholding from being an effective model selection tool in this model. (End of example.)In addition, there is ample empirical evidence that the BIC is a poor model selector in frequent changepoint scenarios, also
for search algorithms other than WBS. For instance, we showed in Section 2.1 that the PELT method (Killick et al., 2012)
equipped with the BIC stopping rule failed on the extreme.teeth
example. In brief, this is due to the fact that the
BIC term places too heavy a penalty on frequentchangepoint solutions.
It is tempting to ask at this point whether choosing a penalty adaptively from the data (so that possibly different penalties are applied in infrequent and frequent changepoint scenarios) may provide a successful remedy to this issue. Based on our empirical experience, the best such approach that we have come across is the “datadriven slope estimation” technique implemented in the R package capushe (Arlot et al., 2016) and described in Baudry et al. (2012). However, we show in Section 4.2 that this approach scales poorly to long signals and is comfortably outperformed by WBS2.SDLL in the frequent changepoint examples we tested.
3 Wild Binary Segmentation 2 and Steepest Drop to Low Levels
3.1 WBS2: dataadaptive computation of complete solution path
3.1.1 Computation of the WBS2 solution path
This section describes how the WBS2 solution path is computed. This is the first of the two ingredients of the WBS2.SDLL procedure; the other is the application of the new “Steepest Drop to Low Levels” model selection criterion, described in Section 3.2.
We start with a heuristic description, followed by a formal algorithmic definition. In the first pass through the data, WBS2 draws a small number of data subsamples, with start and endpoints chosen randomly, uniformly and independently with replacement from the set . WBS2 marks the first changepoint candidate as the argmax of the resulting absolute CUSUMs. It then uses this changepoint candidate to split its domain of operation into two, and again recursively draws subsamples to the left and to the right of this changepoint candidate, and so on. Eventually, on short subdomains, the number of all possible subsamples will be less than , in which case WBS2 will draw all possible subsamples of each such subdomain. WBS2 stops and takes no action on a given current subdomain only if its length is 1. Having completed this step, WBS2 then sorts all thusobtained changepoint candidates in the decreasing order of their corresponding absolute CUSUMs. The resulting object is referred to as the WBS2 solution path. Algorithmically, the main ingredient of the computation of the WBS2 solution path is the routine WBS2.Sol.Path, defined recursively as follows.
The WBS2 solution path is computed in two stages, defined below.

(Identification of changepoint candidates)
Initialise .
Add elements to the (unsorted) WBS2 solution path by executing the command WBS2.Sol.Path(, , ). We state in Section 3.1.2 that the length of the thusobtained sequence is .

(Sorting of the WBS2 solution path)
Rearrange the elements of so that the sequence of its 4th components, denoted by , is nonincreasing, breaking any ties at random. Denote the elements of the resulting (sorted) sequence by .
The philosophy of WBS2 is fundamentally different from WBS. In WBS, all intervals are predrawn before the launch of the procedure, and there needs to be a sufficient number of them for all changepoints to be detectable from this single draw. By contrast, in WBS2, with each draw of intervals, we only hope to detect one of the remaining undetected changepoints (if there are any). For this reason, will typically be much smaller than . We comment in Section 3.1.2 on the theoretical orders of magnitude of ; for now suffice it to say that we use in all numerical examples of this paper.
In comparison to WBS, the computation of the WBS2 solution path is dataadaptive. WBS2 does not use the data in drawing its intervals; by contrast, in WBS2, the location of the domain of each next batch of intervals is determined by the locations of the previously detected changepoint candidates. One consequence of this computational adaptivity of WBS2 is that on average, the intervals drawn by WBS2 rapidly become shorter as the procedure progresses, as the procedure operates on shorter and shorter subdomains of . This makes WBS2 fast; we comment on the theoretical computational complexity of the WBS2 solution path in Section 3.1.2 and actual computation times are given in Section 4.2.
Finally, we note that the computation of the WBS2 solution path requires no other parameters besides , and that it is natural for it to be defined (and indeed implemented) recursively.
3.1.2 Theoretical properties of the WBS2 solution path
We introduce the following technical assumption.
Assumption 3.1

The sequence is i.i.d. .

With the additional notation , , the minimum spacing between changepoint satisfies , where with .

for .

The magnitudes of the jumps satisfy .
These conditions are not strictly necessary for the procedure to work, and various extensions are possible, related to the (joint) distribution of
, the minimum distance between the changepoint locations or the minimum magnitude of the jumps. However, the setting of Assumption 3.1 has the benefit of simplifying the technical exposition as much as possible, while still enabling the demonstration of the full mechanics of the procedure.To facilitate the statement of the theorem below, we introduce the concept of ‘scale’ at which the WBS2 procedure operates as follows. Initially, the procedure operates on the domain , and we refer to this stage as scale 1. It then subdivides this domain into two and operates on the two subdomains; this is scale 2. The scale parameter then increases by one with each subdivision.
We have the following result.
Theorem 3.1
Let follow model (1) and let Assumption 3.1 hold. Let and denote, respectively, the number and the locations of the changepoints. Let be the solution path of the WBS2 algorithm, defined in Section 3.1.1. The following statements are true.

The length of is .

Let be the length of the longest subdomain on which the WBS2 solution path algorithm operates at scale . Let . The computational complexity of the WBS2 solution path algorithm is .

Let be the 3rd components of the elements of . Sort in increasing order, and denote the thusordered sequence by . Define the event
where are positive constants that only depend on the positive parameter below, and the symbol means “is of the order of magnitude of”. We have
(3) where
with
We now comment on several aspects of Theorem 3.1. Part (i) means that the WBS2 solution path procedure is complete in the sense of Definition 2.1. In part (ii), we investigate the magnitudes of and to gain a better idea of the computational complexity of the WBS2 solution path procedure. Beginning with , we now discuss the minimum number of draws required to ensure that in part (iii) converges to 1 suitably fast. To match the equivalent rate for WBS (see Theorem 3.2 in Fryzlewicz (2014)), we require
which is equivalent to
and therefore . The maximum magnitude of depends on the location of the splits within each each interval on which the routine WBS2.Sol.Path is executed. In particular, we have the following result (the proof is straightforward, so we omit it).
Proposition 3.1
If, in each execution of the routine WBS2.Sol.Path, the split location is such that
(4) 
for a certain fixed , then .
It is immediate that condition (4) is automatically satisfied if , i.e. for short intervals. From the proof of Theorem 3.1, it is also automatically satisfied (for small enough, with probability tending to 1 with ) for those intervals which contain previously undetected changepoints. Since the locations of the splits in those intervals that do not contain previously undetected changepoints is immaterial for the performance of the procedure, this in turn implies that an alternative WBS2 solution path procedure, in which the existing selection of were replaced by constrained maximisation of the form
would enjoy exactly the same properties as the original WBS2 solution path procedure, plus would come with a guarantee that . However, we refrain from discussing such a modified procedure further as it would require the choice of , an additional parameter.
Proposition 3.1 and the above discussion regarding the magnitude of imply the following corollary.
Corollary 3.1
Regarding part (iii) of Theorem 3.1, it is the statement of Lemma 1 in Yao (1988) that for all , at a rate that depends on the magnitude of . The uniform logarithmic bound on the magnitudes of the errors is the same as in WBS and is ‘nearoptimal’ in the sense described in Fryzlewicz (2014).
We emphasise that part (iii) does not yet constitute a complete consistency result for the WBS2.SDLL procedure, as it does not specify a model selection process; it merely says that the model with the true number of estimated changepoints is such that the locations of all changepoints are estimated nearoptimally. In Section 3.2, we will describe how to estimate consistently via SDLL and this will lead to a complete consistency result (for the number and locations of estimated changepoints) for the WBS2.SDLL method.
3.2 “Steepest Drop to Low Levels” model selection
3.2.1 Motivating example and methodology of SDLL model selection
This section introduces our new model selection criterion, labelled “Steepest Drop to Low Levels”, which will be used to select a model along the WBS2 solution path . Given , the model selection task is equivalent to indicating an estimate of (the number of changepoints), as the estimated changepoint locations will then be given by , if . This section will define our estimator . We start with a motivating example.
Example 3.1 As in Section 2.1 and Example 2.1, we consider the model ,
, where is the extreme.teeth
signal.
However, the setup in the current example is even more challenging in that we consider i.i.d.
as compared to in Section 2.1 and Example 2.1.
For a particular realisation of this model, we compute the WBS2 solution path with
and plot the resulting sequence (Figure 4, top plot). Recall that by the construction of
, this sequence is nonincreasing.
The true number of changepoints in is . A visual inspection of the top plot in Figure 4 reveals an increasing (negative) gradient in at around . This appears to align well with the assertion of part (iii) of Theorem 3.1, according to which, on a set of high probability, the elements of are of a higher order of magnitude () for than they are for (). This observation provides a basis for our procedure for estimating : our will be an estimator of the location of this change in orders of magnitude, or equivalently the location of this large negative gradient.
In order to perform the estimation of this location, it is natural to work on the log scale: we denote . The reason is that taking the differences leads to the cancellation of the terms for and the cancellation of the terms for (where is the largest for which for a certain constant ). Therefore, the terms for will be bounded in , whereas the single term will tend to infinity with at the speed of . As a consequence, it is tempting to consider setting to the value of for which is the largest. Our new model selection criterion does exactly that, but with a certain finitesample correction, which we motivate next.
The bottom plot of Figure 4 shows the sequence for . In light of the above discussion, it is unsurprising to see a prominent peak in around (in this particular example, the location of this local peak is ). Yet, setting would have failed in this particular case, as the global maximum of is achieved for . However, one way of checking that leads to an infeasible model in this example is to examine the size of , the smaller term entering the difference . If were to be the true model size, then by part (iii) of Theorem 3.1 we would expect . We use a threshold to decide whether is indeed of this magnitude, and the form of this “universal” threshold is (where is the MAD estimate of with on input); this formula will be motivated in Sections 3.2.2 and 4.1. In this particular example, we have , and therefore is rejected as a feasible model. Our procedure then goes on to examine the next largest , which in this case is ; this leads to examination of , whose magnitude is well under . As a consequence, the model with 202 changepoints is accepted and we set (overestimating the true number of changepoints by 3). (End of example.)
We now comment heuristically on some aspects of Example 3.1 before introducing a formal algorithmic description of our new “Steepest Drop to Low Levels” (SDLL) model selection procedure.
The name “Steepest Drop to Low Levels” has its origins in the fact that our model selection procedure does not merely look for the “steepest drop” (= largest negative gradient) in (this occurs at in the above example) but it searches for the steepest drop followed by “low levels” (= values of that fall under the threshold); this occurs at in the example.
We emphasise that the use of thresholding in the SDLL model selection is new and different from the classical use of thresholding in changepoint detection problems (as in e.g. Fryzlewicz (2014)). In classical thresholding, a threshold is used as the main model selection device in the sense that its value is used to separate significant from insignificant changepoint candidates. This happens in a “continuous” manner, in the sense that even small changes in the threshold value can lead to the exclusion or inclusion of additional candidate changepoints and therefore to changes in the selected model. A satisfactory threshold should in theory be linear with , so due to this sensitivity of the selected model to the threshold value, the user should have the ability to estimate accurately, which is not always possible in frequent changepoint settings, as illustrated earlier.
By contrast, SDLL uses thresholding only as a secondary model selection device, the main criterion being the size of the negative gradient in . In SDLL, thresholding is used “discretely” in the sense that the consecutive models being tested via thresholding are often very remote from each other in terms of their numbers of changepoints; in the example above, we first tested a model with 3 changepoints followed by a model with 202 changepoints. In other words, we used thresholding not so much to “choose” a model, but to test the feasibility of models preordered according to their respective negative gradients in (rather than according to their numbers of changepoints). The model with 3 changepoints was rejected as easily as the model with 202 changepoints was accepted. This suggests that SDLL does not require the threshold to be selected as precisely as is the case in classical thresholding. This is potentially a useful feature of SDLL as it facilitates accurate model selection in the presence of frequent changepoints, when the user is not always able to estimate (and hence the threshold) well.
We are now in a position to introduce a complete algorithm describing our SDLL model selection procedure. It is defined by the WBS2.SDLL routine below, which takes three parameters on input: is a WBS2 solution path, whose computation is outlined in Section 3.1.1; is a threshold; and is a constant. In the remainder of this paper, we use , and this choice will be motivated in Section 3.2.2; the choice of the constants and will be described in Section 4.1.
If the magnitudes of all fall under , the procedure returns zero changepoints. Otherwise, the procedure extracts all those indices for which ; this is to ensure that we only work with those for which as motivated in Example 3.1. Within this range of , it then looks for the maximum negative gradient in , subject to the constraint that , and assigns to be that location; if the constraint is never satisfied, this necessarily means that the end of the range has been reached, in which case .
We now comment on a few important aspects of the SDLL model selection. As described earlier in the discussion of Example 3.1, SDLL is not a classical thresholding model selection procedure, but also does not used a penaltybased approach. Its execution, not including the call to the Sort routine, is of computational order . Under the assumptions of Theorem 3.2, the execution of the Sort routine is of computational order with high probability. Unlike some penaltybased approached, SDLL does not need to know the maximum number of changepoints present in the data, in either theory or practice.
3.2.2 Theoretical properties of SDLL model selection
In this section, we formulate a complete consistency theorem for the WBS2 method equipped with the SDLL model selection criterion, as regards the estimation of both the number and the locations of the changepoints. We have the following result.
Theorem 3.2
Let follow model (1) and let Assumption 3.1 hold. Let and denote, respectively, the number and the locations of the changepoints. Let be the solution path of the WBS2 algorithm, defined in Section 3.1.1. Let and be the estimates of the number and the locations (respectively) of the changepoints, returned by the SDLL model selection criterion as defined by the WBS2.SDLL routine with , and on input, with and for a large enough constant . Define the events
where is a positive constant and
Comments
There are no comments yet.