1 Introduction and Motivation
Change point detection refers to the problem of estimating the location of abrupt structural changes for data that are ordered e.g. in time, space or by the genome sequence. One can distinguish between online (sequential) and offline (retrospective) change point detection. We are interested in the latter setup, where one has a set of ordered observations and the data in the segments between the change points (also called break points) are assumed to be coming from some common underlying distribution. The goal is to estimate both the location and the number of the change points. Applications cover areas in biology (e.g. copy number variation in Olshen et al. (2004); or ion channels in Hotz et al. (2013)), finance (Kim et al., 2005), environmental science (e.g. climate data in Reeves et al. (2007); or environmental monitoring systems in Londschien et al. (2019)) and many more.
There are two common approaches to tackle change point detection algorithmically: Modifications of dynamic programming (see e.g. Friedrich et al. (2008); Boysen et al. (2009); Killick et al. (2012)) aim to solve a global optimisation problem, which often lead to statistically efficient methods even in a multiscale manner (Frick et al., 2014)
; Greedy procedures provide a heuristic to those (see e.g.Fang et al. (2019)) and are often based on binary segmentation (Vostrikova, 1981). Plain binary segmentation is usually fast and easy to adapt to various change point detection scenarios, but it is not optimal in terms of statistical estimation. Fryzlewicz (2014) proposed wild binary segmentation and Baranowski et al. (2019) proposed the similar narrowest over threshold method which improve on statistical detection, but lose some of the computational efficiency of plain binary segmentation. Both approaches draw a high number of random search intervals and in each the best split point is found greedily, leading to a set of candidate split points. The two methods differ in the way the final change point estimates are selected from these candidates. The computational time for both essentially depends on the total length (and hence the number) of the drawn intervals. In frequent change point scenarios one should draw a higher number of random intervals both from a theoretical and practical perspective. The problems that arise are the following. In practice one may not know whether the number of drawn search intervals is sufficient and in the worst case, when essentially all possible segments are drawn, the total length of considered intervals will be cubic. Hence, as one considers all possible split points within the drawn intervals, going through all of them can already result in a computational cost that is cubic (e.g. in univariate Gaussian change in mean setups) or possibly even worse if some complicated model fits are used for the investigated problem.
We propose seeded binary segmentation as a generic approach for fast and flexible change point detection in large scale problems. Our approach is similar to wild binary segmentation and the narrowest over threshold method in the sense that for many different search intervals the best split point for each is first determined greedily, in order to form a preliminary set of candidates out of which the final selection can be made. However, we propose a deterministic construction of search intervals which can be pre-computed. The resulting seeded intervals allow computational efficiency, i.e. lead to a total length for the search intervals that is linear up to a logarithmic factor (and thus we will refer to it as near-linear), independent of the underlying number of change points, while keeping the improvements in terms of statistical estimation compared to standard binary segmentation. Figure 1 illustrates the statistical efficiency vs. computational efficiency of seeded binary segmentation compared to wild binary segmentation for a univariate signal. It is remarkable that the estimation performance of the default option suggested for seeded binary segmentation is very close to what seems to be possible in this specific setup. Hence, our approach not only uses available resources more efficiently, but final results are also much more robust across the number of generated seeded intervals.
Two very useful applications for seeded binary segmentation in connection with ever larger data sets are in the following setups. First, applications with very long univariate Gaussian or similar low-dimensional signals with possibly hundreds of thousands of observations led to the need and thus emergence of computationally more efficient approaches. Based on dynamic programming, Killick et al. (2012) proposed the pruned exact linear time method that is returning the optimal partitioning in linear time in frequent change point scenarios when pruning works well, but remains of quadratic complexity in the worst case (with a few change points only). Maidstone et al. (2017) proposed the functional pruning optimal partitioning method and demonstrated faster empirical performance compared to the pruned exact linear time method in many scenarios. Fryzlewicz (2018) recently proposed wild binary segmentation 2 for a univariate Gaussian change in mean setup, a hybrid between classical binary segmentation and wild binary segmentation aiming to improve the computational performance of the latter. In all these setups, seeded binary segmentation would lead to a competitive approach with guaranteed near-linear cost independent of the underlying number of change points as opposed to the previously mentioned methods (including even plain binary segmentation) which can all have a quadratic computational complexity in certain method-specific adverse scenarios. Note that dynamic programming would typically run fast in frequent change point scenarios while wild binary segmentation would run fast in the presence of only a few change points, but neither of them are guaranteed to be fast under all circumstances.
However, a second, and to us even more appealing applicability of seeded binary segmentation is in setups where model fits are expensive. Several examples, especially in high-dimensional change point detection settings, are discussed at the beginning of section 5. The commonality among these setups is that they usually require computationally costly single fits and thus it is essential for practical feasibility to keep the total length of search intervals as small as possible. We will discuss this in more detail and using a concrete example in the last section.
The structure of the paper is as follows. In the next section we introduce seeded binary segmentation in more detail, background and with more discussion (e.g. details regarding computational times). For the ease of explanation we do this in the classical setup of Gaussian variables with changing mean for which section 3 contains theoretical results regarding consistency and localisation error. Section 4 provides simulation results which demonstrate competitive performance compared to other state-of-the-art estimation methods. As noted previously, the applicability of the seeded binary segmentation methodology is very wide and we illustrate this with a high-dimensional example in section 5.
2 On the way to Seeded Binary Segmentation
A well-studied problem, which we will analyse in detail as an illustrative example of the methodology, is the detection of shifts in mean of independent univariate Gaussian variables with variance that stays constant over time. Recent proposals for this setup are for example that ofFrick et al. (2014), Fryzlewicz (2014), Du et al. (2016), Li et al. (2016) and Fang et al. (2019)
, references therein also provide an overview. The understanding for this piecewise constant Gaussian setup can often be generalised to other scenarios, e.g. for multivariate and dependent, for different parametric as well as non-parametric models, or even changes in piecewise linear structure. Hence, while focusing on this classical model for illustrative purposes, we would like to emphasise again the much wider potential applicability of our methodology (which we will come back to in the last section).
2.1 The Gaussian setup and notation
We consider the model
where is a fixed one-dimensional, piecewise constant signal with an unknown number of change points at unknown locations . The noise is assumed to be independent and identically distributed Gaussians with zero mean and variance
. Without loss of generality, the variance can be assumed to be 1. The test statistic used for detection by binary segmentation and related procedures is the CUSUM statistics(Page, 1954) defined for a split point within the segment as
with integers and . The CUSUM statistic is the generalised likelihood ratio test for a single change point at location in the interval against a constant signal. The value will be referred to as gain, because the square of it gives the reduction in squared errors in the segment when splitting at point and fitting a separate mean on the corresponding left and right segments. The location of the maximal absolute CUSUM statistics
is the best split point and thus a candidate when dividing the segment into two parts. Note that in scenarios other than the Gaussian setup from equation (1) one would only need to change the CUSUM statistic to one that is adapted to the problem at hand. Whether one declares to be a change point usually depends on the corresponding maximal gains . We treat the final selection of change points as a separate issue and will discuss specific ways to do so later on.
2.2 Related methods based on binary segmentation
Binary segmentation (Vostrikova, 1981) first finds the best split on all data and if is above a pre-defined threshold, then searches for other split points on the left and right segments and . The same iterative search is continued as long as the gains are still above the pre-defined threshold and there are sufficiently many observations in the resulting segments.
The wild binary segmentation method (Fryzlewicz, 2014) and the narrowest over threshold method (Baranowski et al., 2019) both generate random intervals , , where the starting and end points are drawn independently, uniformly and with replacement from , possibly subject to some minimal length constraint to be able to evaluate the chosen test statistic. One determines candidate splits and corresponding maximal gains for each of the random intervals via the CUSUM statistics defined in equation (2). The two methods are slightly different beyond this point, in the way to select the final change point estimates.
Wild binary segmentation orders the candidates according to the respective maximal gains (i.e. values of the test statistic) and selects greedily, i.e. the split point with highest value is the first change point estimate. Then all intervals are eliminated from the list which contain this found change point. From the remaining candidates again the one with maximal gain is selected, followed by the elimination of all intervals that contain this second one. The procedure is repeated as long as the split with highest gain remains above a certain pre-defined threshold. A suitable threshold can be derived from asymptotic theory for example. Out of several possible segmentations corresponding to different thresholds, one can also select the one optimising a chosen information criterion (e.g. proposed by Zhang and Siegmund (2007)).
The narrowest over threshold method selects change points amongst the candidates that have a maximal gain above a pre-defined threshold. Amongst these candidates the one having the shortest (i.e. narrowest) background interval is selected first. Then all intervals that contain this first estimate are eliminated and repeatedly from the left over ones above the pre-defined threshold the next one with the currently shortest background is selected. The procedure is repeated as long as no more maximal gain is above the pre-defined threshold. It is also possible to repeat this procedure with differing thresholds and then take the one optimising some information criterion. Note that the narrowest over threshold way of selecting the final estimates makes it applicable in more general scenarios, where the model is e.g. piecewise linear.
Elimination of the intervals in each step for both methods is necessary in order to avoid detecting the same change point at slightly different locations for several different background intervals. The two methods allow for detection in a larger class of signals compared to binary segmentation. The intuition behind this is that some of the random intervals will contain only a single change point sufficiently well separated from the boundaries and thus serve as good local backgrounds in which change points are easily detectable. As a comparison, binary segmentation is particularly bad in signals that exhibit many up and down jumps that cancel each other if a background contains several change points.
The main drawback of wild binary segmentation and the narrowest over threshold method is the increased computational cost. This grows linearly with the total length of used random intervals (and with expected length for each random interval). In frequent change point scenarios one should draw a higher number of random intervals in order to ensure that there are still sufficiently many intervals with a single change point in them. In the worst case, when drawing essentially all possible intervals, the complexity is cubic, which is clearly not satisfactory for large scale change point detection problems. As mentioned previously, it is also an issue that one might not know how many change points to expect and thus how many intervals to draw. Fryzlewicz (2018) summarises these issues as the lack of computational adaptivity of wild binary segmentation and proposes a modification, called wild binary segmentation 2, which is a hybrid between wild and classical binary segmentation in the following sense. In order to speed up, he suggests to draw a smaller number of random intervals, localise the first candidate greedily, and in the resulting left and right segments draw again random intervals, localise again, and repeat iteratively. From the resulting preliminary list of candidates the final estimates are again determined based on certain maximal gains. However, this proposal is fairly specific to the univariate Gaussian setup and might fail when adapting to high-dimensional or other more complicated scenarios such as a model that is piecewise linear. An interesting two-stage procedure also targeting computational issues for the univariate Gaussian case was proposed by Lu et al. (2017). They obtain pilot estimates of change points on a sparse subsample and then refine these estimates in a second step in the respective neighbourhoods using dense sampling. This can reduce the computational cost to sub-linear, and is thus useful for extremely long time series. However, note that computational issues in high-dimensional setups primarily arise from the fact that data sets are too ‘wide’ and not because of being ‘long’. Moreover, there is a price to pay in terms of detectable changes using this procedure, namely, being statistically suboptimal.
Next, we propose our conceptually simple seeded binary segmentation method that has a number of advantages. To name a few, it leads to a near-linear time algorithm under all circumstances, independent of the number of change points, it simplifies theoretical analysis and has an applicability wider than the Gaussian scenario.
2.3 Seeded Binary Segmentation
The question to start with regarding the randomly generated intervals above is, whether all of them are necessary? When the starting and end points are drawn uniformly at random, there will be many intervals with length of the order of the original series (see bottom of Fig. 2). While these are one of the main drivers of computational time, if there are many change points, such long intervals are often not really useful as they contain several change points. One should rather focus on shorter intervals. As an example, the total length of random intervals in the bottom of Fig. 2 is approximately the same as for the seeded intervals shown on the top of Fig. 2. While the shown random intervals only contain a few short intervals and also not covering short intervals uniformly, the seeded intervals achieve this much better. Formally, we propose the following deterministic construction guaranteeing a good coverage of single change points.
Definition 1 (Seeded intervals)
Let denote a given decay parameter. For (i.e. logarithm with base ) define the -th layer as the collection of intervals of initial length that are evenly shifted by the deterministic shift as follows:
where , and . The overall collection of seeded intervals is
Some technical remarks are due first. Due to rounding to integer grid values, the effective lengths and shifts are in practice slightly different from and . Moreover, some intervals might appear multiple times on small scales (depending on choice of ). Such redundant ones can be eliminated. Using some rough bounds, one easily obtains a near-linear bound for the total length of all the intervals contained in : . Thus, evaluating the CUSUM statistics for all of these intervals also has a computational cost of the order . Note that one can simply stop at an earlier layer than the -th one if a certain minimal segment length is desirable (e.g. in high-dimensional scenarios) or when prior knowledge on the true minimal segment length is available.
While the lengths of intervals decay rapidly over the layers, the number of intervals is rapidly increasing as the scales become smaller. In particular, at the lowest layer the total number of intervals is of the same order as the total number of observations, i.e. for , is of the order . This allows a good coverage even in frequent change point scenarios. Hence, in some sense, the number of intervals is automatically adapted to each scale. Figure 2 shows the first few layers of two collections of seeded intervals as well as some random intervals for a comparison. The key idea is best understood for (top of Fig. 2). Here one obtains an overlapping dyadic partition. Each interval from the previous layer is split into a left, a right and an overlapping middle interval, each of half the size of the previous layer. The middle plot of Fig. 2 shows seeded intervals of slower decay, namely . Here the solid lines in layers and show intervals that were also included in the interval system for (top), while the dashed lines are additionally obtained intervals. These intervals on the intermediate layers and are still evenly spread. In particular, a sufficiently big overlap amongst the consecutive intervals within the same layer is guaranteed. This is key, because one obtains for each change point background intervals with only that single change point in there and such that the change point is not too close to the boundaries, which facilitates detection. We recommend using as a decay in practice, but as indicated above, adding intermediate layers is very easy when trying to achieve better finite sample performance in very challenging scenarios.
Several related interval constructions were proposed by Guenther Walther together with various co-authors. First proposed for the detection of spatial clusters (Walther, 2010), as well as density estimation (Rufibach and Walther, 2010), these proposals were then gradually adapted towards more classical change point detection (or fairly related problems) by Rivera and Walther (2013); Chan and Walther (2013, 2015). The difference between these various intervals and seeded intervals mainly lies in the specific choices of intervals for each scale. Pein et al. (2017) also considered dyadic partitions, but as a major difference, without an overlap, moreover, with no flexibility in decay. Recently, yet another interval system was proposed by Chan and Chen (2017) in the context of bottom up approaches. However, these specific interval construction proposals appeared as tricks and side remarks rather than pronouncing the potential of a stand-alone flexible methodology. We believe our construction is more intuitive, in particular showing the trade-offs between computational time and statistical precision.
Algorithm 1 describes our seeded binary segmentation. The decay parameter governs the trade-off between computational time and expected estimation performance. Limiting the minimal number of observations is reasonable e.g. in high-dimensional scenarios or if prior information on the minimal true segment lengths is available. Choosing a larger value not only speeds up computations, but also reduces the risk of selecting false positives coming from very small intervals during the final selection. We find it important to disentangle the construction of search intervals from the final selection as various options with advantages and disadvantages arise. Some possible selection methods are discussed in section 2.4.
|a decay parameter , a minimal segment length and a selection method.|
|Create a collection of seeded intervals with decay from Definition 1, which cover observations.|
|Take the -th interval in and denote its boundaries with and .|
|Calculate the CUSUM statistics from equation (2) for .|
|Apply the chosen selection method to to output the final change point estimates.|
Overall, the key components for our seeded interval construction are 1) a sufficiently fast decay such that the total length of all intervals is near-linear, 2) choosing a larger number of intervals on the small scales and thus suitably adapting for the possibility that there can be many more small segments than large ones, and last, but not least 3) guaranteeing a sufficient overlap amongst intervals on the same layer. The choice of can be interpreted as a trade-off between computational time and expected performance.
2.4 Selection methods
While mentioned earlier, we recall some possible selection methods here. Greedy selection takes the location of the maximal CUSUM value () as the change point candidate, then eliminates from all intervals that contain this candidate, and repeats these two steps as long as there are some intervals left. From the list of these candidates all above a certain threshold (typically some constant times ) are kept. Alternatively, one can also investigate various thresholds, each leading to a different segmentation, and at the end select the segmentation that optimises some information criterion.
The narrowest over threshold selection takes all intervals for which is over a pre-defined threshold (typically some constant times ). Out of these it takes the shortest (narrowest) interval , i.e. the one for which for all intervals with maximal CUSUM value above the threshold. Then it eliminates all intervals that contain the found change point candidate , and repeats these two steps as long as there are no more intervals left with a maximal CUSUM statistics above the pre-defined threshold. While raising the threshold tends to result in fewer change points, due to the narrowest selection followed by the elimination step, it could also happen that one obtains more change points. Hence, in some scenarios this selection method can be somewhat unstable, as even similar thresholds can lead to segmentations differing both with respect to number and location of the boundaries. It is also possible to evaluate a set of different segmentations arising from different thresholds in terms of some information criterion and then pick the overall best.
Various other options, e.g. the steepest drop to low levels proposed by Fryzlewicz (2018) or selection thresholds guaranteeing false positive control either via theoretical approximation results (Fang et al., 2019) or via simulations (facilitated by the favourable scaling of the total length of seeded intervals of ) could be adapted as well. Besides the choice of , the overall computational costs also depend on how complicated the final model selection is. Greedy selection for example is very fast, while full solution paths (using all possible thresholds) of the narrowest over threshold method can be slow. Whether this becomes an issue depends on the problem at hand. In high-dimensional setups the expensive model fits would typically dominate the cost of model selection anyway. To make this more formal, consider the following theorems.
The algorithm consists of two steps: the calculation of the CUSUM statistics from (2) for all seeded intervals and the subsequent model selection. For the first step, the cost of calculating CUSUM statistics on all seeded intervals is proportional to their total length, i.e., .
In the second step, we only consider the candidate split points from seeded intervals with maximal CUSUM values above the given threshold, and sort them either according to their maximal gains (for the greedy selection) or their lengths (for the narrowest over threshold selection). Note that the number of intervals contained in a collection of seeded intervals is of the order . Hence, the sorting procedure involves computations in the worst case. Then we check for each seeded interval in the sorted list whether it contains any previously found change point. Such a check can be done in computations as follows. We store all previously found change points in a balanced binary tree structure (e.g., the AVL tree by Adelson-Velskiĭ and Landis, 1962), where both the search and the insertion of new elements have the worst case complexity , and the memory complexity is (given the maximum of total points that could potentially be declared as change points and thus need to be stored in the AVL tree). Thus, one needs time for checking whether the candidate split is suitable (i.e. the corresponding search interval does not contain any previously found change point, for which finding the nearest neighbouring points in the AVL tree to the currently considered candidate is sufficient), and time for updating the AVL tree supposing that the candidate is suitable and thus that it is declared/detected as a new change point. Alternatively, we can utilise the structure of seeded intervals by noticing that each potential split point is contained in at most intervals (as in each of the scales there is a constant number of intervals containing ). This allows us to mark all of the intervals that contain the found change point with a cost of such that these intervals can then be skipped in consecutive steps when adding the new change points. Overall, both approaches (AVL trees and the specific structure of seeded intervals) lead to a worst case complexity for the second step as we go through at most intervals (the total number of intervals in a collection of seeded intervals) and perform at most computations at each (checking whether the interval is suitable and an update of the AVL tree). Therefore, the total computational complexity is and it does not depend on the underlying number of true change points. The total memory complexity is composed of the storage of information (maximal gain, best split, as well as start and end point for each search interval) plus the AVL tree, all of which requires at most memory.
Assume model (1) and that in seeded binary segmentation (i.e., Algorithm 1) we determine a full solution path (i.e., calculate segmentations with all thresholds corresponding to maximal gains in any of the seeded intervals) and then select the final model out of the ones in the solution path based on an information criterion. Specifically, consider an information criterion with a penalty of the form
with , and . Assume further that given the computation of is , for every finite set of split points and any arbitrary additional split point . Then seeded binary segmentation has the memory complexity and the computational complexity of
when the segmentation with the best value of the information criterion (3) in the solution path is chosen.
We note that the information criterion (3) in Theorem 2 is fairly general, including e.g., the Akaike information criterion, the Bayes information criterion and its modifications (Zhang and Siegmund, 2007) as special cases. The upper bound for the narrowest over threshold selection in Theorem 2 is worse than that for the greedy selection, because, as mentioned previously, for the narrowest over threshold selection even similar thresholds sometimes lead to solutions differing both with respect to number and location of the change point estimates. Note that for the cases when close-by thresholds produce similar models, one could reuse parts of previous segmentations and thus improve on the worst case bound. We do not know whether any smart updating (potentially using the specific deterministic structure of seeded intervals) could lead to some provably better complexity than the crude bound stated. For this one would need to be able say something about the instability in the solution path (i.e., bound the number of cases when similar thresholds produce very different segmentations).
From a practical perspective, thresholds are of interests for some constant . Thus, instead of investigating all possible thresholds, picking some of this specific order could be sufficient to obtain good empirical results while maintaining fairly fast computations, i.e., . Alternatively, one could use a reasonable threshold (maybe even a few different ones) and then create a ‘solution path’ by deleting the found change points in a bottom up fashion based on local values of the test statistic (i.e., at each step one deletes the change point which has the smallest CUSUM value in the segment determined by its closest left and right neighbours, similar to the reverse segmentation proposal of Chan and Chen (2017)) to obtain a hierarchy of nested models for which the evaluation of the information is cheap (similar to greedy selection). Yet another option is to consider fewer layers in the collection of seeded intervals, leading to much fewer intervals in total (as most intervals are at the lowest layers covering small scales) and thus also much fewer possible thresholds to be used for the solution path. While the overall cost of the selection procedure could be reduced massively, the speedup would come at a price, namely, the risk of not detecting true change points with a small spacing in between. Last, but not least, computations in parallel are easy to implement both for the calculation of the test statistic in the various search intervals, as well as the model selection using different thresholds (in the narrowest over threshold selection).
In order to apply the information criterion (3), we need to compute the full solution path, namely, the solutions for all possible choices of thresholds. Consider first the greedy selection, which has the nice property that the obtained solutions are nested, i.e. adding another split point (equivalent to lowering the threshold) splits one of the existing intervals into two parts. Hence, to obtain the full solution path, one can just let the model selection run as in Theorem 1, but with a threshold set to , leading to at most computations. Further, the penalty can be evaluated in an incremental way, i.e., each time when adding a new change point, which by assumption needs computations per update. For the least squares part in (3), it is also possible to update incrementally with computations at each step. More precisely, we have to find the nearest left and right point (out of all previously found change points) to the newly added change point. Once the neighbours are found, one can just update the least squares of the specific segment that was freshly divided in criterion (3) in time by utilising cumulative sum type of information (all of which can be pre-computed once in the beginning in time, and stored in space). Such searches of nearest neighbours can be done in time, if we build and iteratively update a balanced binary tree (e.g., the AVL tree by Adelson-Velskiĭ and Landis, 1962) for the found change points as in the proof of Theorem 1. Note that such balanced trees can be updated incrementally also in time. Given that we go through at most intervals and perform at most computations at each, we thus obtain an upper bound on the worst case computational complexity of for the greedy selection.
Consider now the narrowest over threshold selection. Note that there are at most seeded search intervals and thus thresholds which can potentially lead to different solutions. For each threshold, we can compute the solution in time by Theorem 1 and evaluate the information criterion (3) in time (i.e., in an incremental way similarly to the greedy selection considered above). Thus, in total, we have an upper bound on the worst case complexity being .
Note that only the value of the information criterion (3) needs to be stored for every threshold that leads to a different solution, requiring memory. The final solution can be calculated again with time once the optimal threshold is selected. Thus, the memory complexity is for both selection methods. Of course, if one wants to record all possible segmentations from the solution path of the narrowest over threshold selection, the memory complexity could be .
3 Statistical theory
The introduction of seeded intervals (in Definition 1) not only accelerates the computation, but also simplifies the statistical analysis of the resulting segmentation method, i.e., the seeded binary segmentation (see Algorithm 1). Consistency results depend on the assumed model as well as the selection method. As an illustration, we consider again the Gaussian change in mean model (1) for which wild binary segmentation (Fryzlewicz, 2014) and the narrowest over threshold method (Baranowski et al., 2019) were claimed to be minimax optimal. However, the proof of the former seems to contain errors, e.g., some bounds for the single change point case were wrongly applied to the multiple change point cases. This has been noted by Wang et al. (2018) in the (sub)Gaussian setup. Contrary to what is claimed by Fryzlewicz (2014), the greedy selection utilised in wild binary segmentation does not lead to optimal detection (and the same is also true for wild binary segmentation 2 of Fryzlewicz (2018) as the therein utilised steepest drop to low levels selection is also a certain greedy selection method). Put differently, without additional assumptions on the generated intervals or modification of the selection method itself, consistency seems only achievable for a smaller class of signals than claimed originally. Albeit in a nonparametric setup, Padilla et al. (2019a) mention the errors of the original proofs of Fryzlewicz (2014) as well, along some discussion on an additional assumption that they impose under which nearly optimal results can be attained (see their second remark after Theorem 2 on page 8). However, their assumption is a restriction on the generated search intervals requiring knowledge of the minimal segment lengths. Depending on the investigated data, such a background knowledge may or may not be a realistic assumption. Note that from a practical perspective greedy selection still offers advantages (e.g. easy applicability and implementation in various scenarios, segmentations are nested and stable across thresholds, etc.) and in terms of empirical performance it is often at least as good as the narrowest over threshold method for example (see the simulations in section 4). As seeded binary segmentation would suffer similarly if the greedy selection method is chosen, we show the strong consistency using the narrowest over threshold selection method (Baranowski et al., 2019) as an illustration.
Assume model (1) and let with , and with and . Assume also that
Let and denote respectively the number and locations of estimated change points by the seeded binary segmentation in Algorithm 1 with the narrowest over threshold as the selection method. Then there exist constants , , independent of and (in Definition 1), such that, given the threshold for the selection method , as ,
The signal as well as , and is allowed to depend on The assumption for detection in (4) and the localisation rates in (5) are minimax optimal up to constants (see e.g. Chan and Walther, 2013; Frick et al., 2014; Chan and Chen, 2017; Wang et al., 2018; Li et al., 2019). That is, no method can detect all change points with weaker condition than (4), and no method can achieve a better rate than , in the asymptotic sense.
We consider interval for each change point . By construction of seeded intervals in Definition 1, we can find a seeded interval such that , and . Then the problem almost reduces to the detection of change point on the interval . The rest of the proof follows nearly the same as the proof of Theorem 1 in Baranowski et al. (2019), and is thus omitted.
We can further establish the consistency of the seeded binary segmentation using certain data adaptive selection rule (i.e., strengthened Schwarz information criterion).
Assume notation from Theorem 3, and that , and , , lie in for some constants and . Let and be respectively the number and locations of estimated change points by the seeded binary segmentation in Algorithm 1 with the narrowest over threshold, the threshold of which is determined by the strengthened Schwarz information criterion using . Then there is a constant independent of and (in Definition 1) such that, as ,
It follows exactly the same way as the proof of Theorem 3 in Baranowski et al. (2019).
In a nutshell, seeded binary segmentation combined with the narrowest over threshold selection has the same statistical guarantee as the original narrowest over threshold method of Baranowski et al. (2019), but requires far less computations, with the worst case computational complexity being nearly linear (for a single threshold based selection), see Theorem 1.
4 Empirical results
We consider the same simulations as presented by Fryzlewicz (2014) in connection with wild binary segmentation. For a description of the signals see Appendix B of Fryzlewicz (2014). We used implementations of the wild binary segmentation and the narrowest over threshold method from the R packages wbs and not from CRAN, respectively, with a small bug fixed in both packages concerning the generation of random intervals. Moreover, for our proposed seeded binary segmentation methodology we slightly modified the code in wbs to allow for seeded intervals as inputs. The code is available at https://github.com/kovacssolt/SeedBinSeg. Arbitrary intervals as inputs were already supported by the not package. Note however, that the latter implementation can be slow as all segmentations in the solution path resulting from all possible thresholds are recorded, already leading to a quadratic number of candidates being saved (see Theorem 2 and corresponding discussions for details).
Table 1 shows results based on averaged values over simulations for each setup. Reported are the mean squared errors (MSE), Hausdorff distances, V measures (Rosenberg and Hirschberg, 2007, more precisely, viewing the change point segmentation as a clustering problem by identifying each constant part as a cluster)
, errors in the estimates for the number of change points as well as the total length of the constructed search intervals. Parentheses show the corresponding standard deviations. For our seeded binary segmentation methodology we choseand . Exceptions are the rows marked with a star, where the decay was chosen to be , i.e. an increase in the total length of search intervals by a factor of roughly four. For both the wild binary segmentation and the narrowest over threshold methods we chose the recommended random intervals. Moreover, all methods selected the final model using the same information criterion.
Overall, seeded binary segmentation with greedy selection (g-SeedBS) performs similar to wild binary segmentation (WBS), while seeded binary segmentation with the narrowest over threshold selection (n-SeedBS) is similar to the original narrowest over threshold method (NOT). Most importantly, using seeded intervals we achieve competitive performance with total lengths of search intervals that are orders of magnitudes smaller than using the random intervals. In these scenarios both selection types perform similarly. We have the experience that increasing the noise level would lead to better results using the greedy selection. The performance of our default methodology (with ) is slightly worse for the blocks signal compared to random intervals based counterparts. However, when choosing , and thus accepting an increase in the total length of the search intervals by about a factor four, performances are very close again. Note that the total length of search intervals is still only a tenth of that of random intervals. For a better understanding, the simulations for the blocks signal are depicted in Fig. 3. In the top row (with ) one can see points lining up vertically when considering Hausdorff distances. This line is around , and indicates that one of the true change points (lying to this distance to its closest neighbour) is missed. However, when increasing computational effort (, bottom row), this vertical line basically disappears and for all three error measures many points line up exactly on the identity line. This indicates that found change points using seeded binary segmentation with greedy selection and wild binary segmentation exactly agree in many simulation runs.
|signal||method||MSE||Hausdorff dist.||V measure||tot. length|
|blocks||g-SeedBS||2.922 (1.077)||43.150 (31.009)||0.970 (0.013)||-0.610 (0.803)||95.3||(0)|
|g-SeedBS*||2.685 (0.828)||32.690 (21.037)||0.973 (0.009)||-0.520 (0.577)||329.7||(0)|
|WBS||2.627 (0.854)||31.520 (23.262)||0.973 (0.011)||-0.500 (0.577)||3419.1||(35.7)|
|n-SeedBS||2.942 (1.002)||42.630 (28.690)||0.970 (0.013)||-0.690 (0.787)||95.3||(0)|
|NOT||2.539 (0.874)||31.700 (22.778)||0.974 (0.011)||-0.590 (0.588)||3422.3||(32.6)|
|n-SeedBS*||2.675 (0.782)||33.790 (21.056)||0.973 (0.010)||-0.600 (0.532)||329.7||(0)|
|NOT||2.724 (1.021)||42.490 (56.636)||0.972 (0.013)||-0.610 (0.650)||3421.9||(34.8)|
|fms||g-SeedBS||0.005 (0.004)||15.810 (25.830)||0.955 (0.037)||-0.020 (0.492)||19.1||(0)|
|WBS||0.004 (0.003)||9.080 (14.975)||0.963 (0.031)||0.000 (0.284)||833.8||(7.9)|
|n-SeedBS||0.004 (0.003)||15.500 (25.853)||0.958 (0.035)||-0.040 (0.448)||19.1||(0)|
|NOT||0.004 (0.002)||8.640 (15.019)||0.967 (0.028)||-0.010 (0.266)||834.4||(7.4)|
|mix||g-SeedBS||1.598 (0.517)||86.870 (57.740)||0.908 (0.044)||-1.180 (1.048)||22.3||(0)|
|WBS||1.680 (0.728)||78.260 (58.431)||0.913 (0.041)||-1.090 (1.093)||939.1||(8.1)|
|n-SeedBS||1.759 (0.605)||96.870 (64.631)||0.897 (0.052)||-1.340 (1.199)||22.3||(0)|
|NOT||1.981 (0.914)||89.220 (64.899)||0.903 (0.049)||-1.260 (1.495)||939.2||(8.3)|
|teeth10||g-SeedBS||0.061 (0.040)||7.960 (20.229)||0.933 (0.130)||-0.190 (2.419)||4.4||(0)|
|WBS||0.058 (0.040)||7.380 (20.591)||0.936 (0.133)||-0.410 (2.170)||238.7||(2.2)|
|n-SeedBS||0.066 (0.050)||10.790 (27.521)||0.911 (0.186)||-0.860 (2.903)||4.4||(0)|
|NOT||0.062 (0.044)||9.150 (24.593)||0.923 (0.162)||-0.730 (2.522)||238.8||(2.3)|
|stairs10||g-SeedBS||0.023 (0.011)||2.130 (1.468)||0.981 (0.013)||0.470 (0.745)||4.8||(0)|
|WBS||0.024 (0.013)||1.900 (1.374)||0.980 (0.014)||0.450 (0.730)||255.4||(2.3)|
|n-SeedBS||0.021 (0.011)||1.340 (1.249)||0.984 (0.013)||0.100 (0.362)||4.8||(0)|
|NOT||0.021 (0.011)||1.400 (1.326)||0.984 (0.013)||0.140 (0.450)||255.5||(2.5)|
g-SeedBS, Seeded binary segmentation with greedy selection; n-SeedBS, Seeded binary segmentation with narrowest over threshold selection; WBS, Wild binary segmentation; NOT, Narrowest over threshold method; MSE, Mean squared error; Hausdorff dist., Hausdorff distance; tot. length, total length of search intervals (in thousand); signals are described in Appendix B of Fryzlewicz (2014).
Besides increasing the computational effort by the choice of the decay , there could be other options to improve empirical performance. For example, one could find preliminary change point estimates with the recommended choice of and include additional search intervals in a second step covering the ranges between previously found neighbouring change points. Should the best candidates from these additional intervals have test statistics above the pre-defined acceptance threshold, then one can also add them to the final set of change point estimates. Another option mentioned at the definition of seeded intervals is to limit the minimal number of observations (i.e. consider fewer layers) if prior information on the minimal true segment lengths is available, which reduces the risk of selecting false positives coming from very small intervals during the final model selection. One could also use more refined test statistics instead of the CUSUM statistics that are adapted to various scales, e.g. multiscale variants.
5 Beyond the univariate case: An example for high-dimensional models
As mentioned earlier, another appealing applicability of seeded binary segmentation beyond the extensively discussed univariate Gaussian setup is in scenarios where model fits are expensive, for example in the recently emerging area of high-dimensional change point detection (e.g. means in Soh and Chandrasekaran (2017); Wang and Samworth (2018); covariance and precision matrices/graphical models in Roy et al. (2017); Gibberd and Nelson (2017); Wang et al. (2017); Gibberd and Roy (2017); Bybee and Atchadé (2018); Avanesov and Buzun (2018); Londschien et al. (2019)
; linear regression inLeonardi and Bühlmann (2016); Wang et al. (2019)). Several of these proposals require the calculation of numerous lasso (Tibshirani, 1996), graphical lasso (Friedman et al., 2008) or similar rather costly estimators. Another example would be multivariate non-parametric change point detection for which Kovács et al. (2020)
use random forests(Breiman, 2001)
and other classifiers,Padilla et al. (2019b)
use kernel density estimation, whileMatteson and James (2014) utilise energy distances. Overall, in all approaches requiring costly single fits, it is essential to keep the total length of search intervals as small as possible to make computations feasible and thus to have practically usable methods.
We would like to highlight this in a specific example involving change point detection in (high-dimensional) Gaussian graphical models. We consider a setup and estimator as introduced by Londschien et al. (2019). Let
be a sequence of independent Gaussian random vectors with meansand covariance matrices such that is piecewise constant with an unknown number of change points at unknown locations and corresponding segment boundaries . For let denote the matrix of the observations , denote with their mean and let be the corresponding covariance matrix. Let be some required minimal relative segment length and be a regularisation parameter. According to the proposal of Londschien et al. (2019), for a segment with define the split point candidate as
where corresponds to a multivariate Gaussian log-likelihood based loss in the considered segment, i.e.
for an arbitrary segment . Moreover, is the graphical lasso precision matrix estimator (Friedman et al., 2008) for the corresponding segment with a specifically scaled regularisation parameter, i.e.
The goal here is to compare the estimation performance and computational times of the estimator (6) when applied to seeded as well as random intervals. We consider a specific chain network model (Fan et al., 2009, Example 4.1) with variables: with and . The resulting precision matrix is tridiagonal. We consider a modified version by replacing the top left part of by
, i.e. an identity matrix of dimension. Let and . We take segments of length each (i.e., ) and generate independent observations in the following way. The first hundred observations are drawn from , the following hundred from , then another hundred from etc. In total, we obtain observations. We simulated different data sets this way and to each of these applied seeded binary segmentation (greedy selection) with decays as well as random intervals for wild binary segmentation, leading to the red, respectively black point clouds in Fig. 4. To be precise, we set , hence, we left out observations at the boundaries of background intervals. This also implies that all intervals with less than observations were discarded for both interval construction types. We set and a threshold of as a necessary improvement in (6) in order to keep the proposed split point in the segment and declare it as a change point.
Note that with this specific choice of we do not have a truly high-dimensional scenario in any of the segments, but regularisation is still helpful especially when considering short background intervals (where the number of variables is of similar magnitude as the number of observations). The reasons why we kept the number of variables comparably low are on the one hand the easier choice of tuning parameters, and on the other hand we would quickly face computational bottlenecks with wild binary segmentation when increasing . In this -dimensional example, around seconds are needed for wild binary segmentation to obtain an estimation accuracy that is already achieved in about seconds using seeded intervals, see top of Fig. 4. The bottom of Fig. 4 shows computational gains of similar magnitude ( vs. seconds). The difference in computational times for the top and bottom of Fig. 4 arise from the different implementation. The top shows a naive implementation where for each split point both the input covariance matrices () as well as the corresponding graphical lasso fits () are computed from scratch. However, the input covariance matrices can be updated cheaply for neighbouring split points, i.e. once is available, with a cost of it can be updated to obtain . This better implementation, shown at the bottom part of Fig. 4, saves in computational time both for wild and seeded binary segmentation, but still preserves the massive computational gains of the seeded intervals. The remaining computational costs mainly come from the expensive graphical lasso fits. Without computing the graphical lasso fits, in our R implementation one would only need about seconds for the fastest version of seeded binary segmentation () and about seconds for and seconds for for wild binary segmentation. However, as mentioned previously, without the regularised fits of the graphical lasso, the estimation error would be clearly worse, and hence, this is not an option. Thus, the next best guess to speed up would be to use warm starts from neighbouring splits when updating the graphical lasso fits, similar to the covariance matrix updates described previously. Unfortunately, not all algorithms that have been developed to compute graphical lasso fits are guaranteed to converge with warm starts. Mazumder and Hastie (2012b) proposed the DP-GLASSO algorithm that can be started at any positive definite matrix. When using the corresponding dpglasso R package on CRAN, we obtained speedups of with warm starts, but even this way the computational times were roughly times worse than using the graphical lasso algorithm of Friedman et al. (2008) implemented in the glasso R package on CRAN. The latter one we used to obtain the results of Fig. 4. Admittedly, there are a lot of factors influencing computational times (e.g. convergence thresholds) and different algorithms/implementations might be better or worse in other scenarios with a higher number of variables and/or regularisation parameter .
It is hard to draw an overall conclusion on which combination of implementation and tricks would give the fastest results for changing Gaussian graphical models, but it is even harder to be fast without utilising the more efficient seeded interval construction for the following reason. In each background interval evaluating (6) at a single split point requires the calculation of graphical lasso fits. Hence, the total computational cost (for a small ) would be roughly for seeded binary segmentation versus for wild binary segmentation, where denotes the cost of computing a graphical lasso estimator for a -dimensional covariance matrix. There are various algorithms computing exact or approximate solutions for the graphical lasso estimator with differing computational cost, but scalings for of (or even worse) are common once the input covariance matrices are already available and unless some special structures can be exploited (e.g. block diagonal screening proposed by Witten et al. (2011) as well as Mazumder and Hastie (2012a)). The terms and come from the calculation of the input covariance matrices via updates. Summarising all observations from above, for typical algorithms for graphical lasso problems, the total costs would be roughly for seeded binary segmentation vs. for wild binary segmentation.
As in the univariate Gaussian change in mean case, needs to be much larger than if there are many change points. In the worst case is of order when evaluating essentially all possible background intervals. In practice, one often does not know which choice of is sufficient. This is a problem as the performance depends very much on the choice of (see Fig. 4), especially in case there are many change points. Another advantage of seeded intervals is, that results are not depending much on the choice of the decay for the seeded interval construction compared to the very strong dependence on the choice of (the number of random intervals) for wild binary segmentation.
Overall the aim of this example is to demonstrate that seeded binary segmentation is highly useful in scenarios beyond the univariate Gaussian change in mean case. While we precisely quantify the computational gains in the changing Gaussian graphical model scenario when using graphical lasso estimators, it is more difficult to quantify this for some other costly estimators such as the lasso (Tibshirani, 1996) or random forest (Breiman, 2001) for the following reasons. As a main difference, they have a cost that not only depends on the number of variables, but also on the number of observations, moreover, possibilities to update neighbouring fits would need to be considered carefully. Even in absence of exact quantification of the computational costs in these scenarios, once can expect the substantial computational gains when using seeded intervals to be crucial to keep computations feasible when model fits are very expensive.
Two popular algorithmic approaches tackling change point detection are modifications of dynamic programming and greedy procedures, mostly based on binary segmentation. While dynamic programming provides optimal partitionings, its worst case computational complexity is typically quadratic in the length of the time series. Moreover, dynamic programming often requires considerable work to be adapted to the problem at hand, in particular when aiming to do so in efficient ways, see e.g. Killick et al. (2012); Frick et al. (2014); Li et al. (2016); Pein et al. (2017); Haynes et al. (2017a, b); Celisse et al. (2018); Fearnhead et al. (2019) for specific examples. Plain binary segmentation on the other hand is easy to adapt to various change point detection scenarios while being fast at the same time. The price to pay is not being optimal in terms of statistical estimation. The recently proposed wild binary segmentation (Fryzlewicz, 2014) and related methods improve on statistical detection by means of considering random background intervals for the detection, but these methods lose the computational efficiency of plain binary segmentation.
In this paper we showed that this loss of computational efficiency can be avoided by means of a systematic construction of background intervals. The so called seeded intervals have a total length of order compared to for wild binary segmentation with random intervals. The choice of should depend on the assumed minimal segment length which is typically unknown prior to the analysis. In frequent change point scenarios, should be clearly larger than , in worst case is of order . The seeded interval construction is not only more efficient, but the total length of search intervals is independent of the number of change points, which even guarantees fast computations under all circumstances, unlike competing methods. Importantly, for this computational gain one does not need to sacrifice anything, i.e. the improved statistical estimation performance remains as demonstrated in various simulation results. We have proved minimax optimality of our proposed seeded binary segmentation methodology for the univariate Gaussian change in mean problem, and demonstrated massive computational gains for detecting change points in large-dimensional Gaussian graphical models.
Solt Kovács and Peter Bühlmann were partially supported by the European Research Commission grant 786461 CausalStats - ERC-2017-ADG. Housen Li and Axel Munk gratefully acknowledge the support of the DFG Cluster of Excellence Multiscale Bioimaging EXC 2067.
- Adelson-Velskiĭ and Landis (1962) Adelson-Velskiĭ, G. M. and Landis, E. M. (1962). An algorithm for organization of information. Doklady Akademii Nauk SSSR, 146:263–266.
- Avanesov and Buzun (2018) Avanesov, V. and Buzun, N. (2018). Change-point detection in high-dimensional covariance structure. Electronic Journal of Statistics, 12(2):3254–3294.
- Baranowski et al. (2019) Baranowski, R., Chen, Y., and Fryzlewicz, P. (2019). Narrowest-over-threshold detection of multiple change points and change-point-like features. Journal of the Royal Statistical Society, Series B, 81(3):649–672.
- Boysen et al. (2009) Boysen, L., Kempe, A., Liebscher, V., Munk, A., and Wittich, O. (2009). Consistencies and rates of convergence of jump-penalized least squares estimators. The Annals of Statistics, 37(1):157–183.
- Breiman (2001) Breiman, L. (2001). Random Forests. Machine Learning, 45(1):5–32.
- Bybee and Atchadé (2018) Bybee, L. and Atchadé, Y. (2018). Change-point computation for large graphical models: a scalable algorithm for Gaussian graphical models with change-points. Journal of Machine Learning Research, 19(1):440–477.
- Celisse et al. (2018) Celisse, A., Marot, G., Pierre-Jean, M., and Rigaill, G. (2018). New efficient algorithms for multiple change-point detection with reproducing kernels. Computational Statistics & Data Analysis, 128:200–220.
- Chan and Chen (2017) Chan, H. P. and Chen, H. (2017). Multi-sequence segmentation via score and higher-criticism tests. arXiv:1706.07586.
- Chan and Walther (2013) Chan, H. P. and Walther, G. (2013). Detection with the scan and the average likelihood ratio. Statistica Sinica, 23(1):409–428.
- Chan and Walther (2015) Chan, H. P. and Walther, G. (2015). Optimal detection of multi-sample aligned sparse signals. The Annals of Statistics, 43(5):1865–1895.
- Donoho and Johnstone (1994) Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3):425–455.
- Du et al. (2016) Du, C., Kao, C.-L. M., and Kou, S. C. (2016). Stepwise signal extraction via marginal likelihood. Journal of the American Statistical Association, 111(513):314–330.
- Fan et al. (2009) Fan, J., Feng, Y., and Wu, Y. (2009). Network exploration via the adaptive Lasso and SCAD penalties. The Annals of Applied Statistics, 3(2):521–541.
- Fang et al. (2019) Fang, X., Li, J., and Siegmund, D. (2019). Segmentation and estimation of change-point models: false positive control and confidence regions. The Annals of Statistics. To appear.
- Fearnhead et al. (2019) Fearnhead, P., Maidstone, R., and Letchford, A. (2019). Detecting changes in slope with an penalty. Journal of Computational and Graphical Statistics, 28(2):265–275.
- Frick et al. (2014) Frick, K., Munk, A., and Sieling, H. (2014). Multiscale change point inference. Journal of the Royal Statistical Society, Series B, 76:495–580.
- Friedman et al. (2008) Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical Lasso. Biostatistics, 9(3):432–441.
- Friedrich et al. (2008) Friedrich, F., Kempe, A., Liebscher, V., and Winkler, G. (2008). Complexity penalized -estimation: fast computation. Journal of Computational and Graphical Statistics, 17(1):201–224.
- Fryzlewicz (2014) Fryzlewicz, P. (2014). Wild binary segmentation for multiple change-point detection. The Annals of Statistics, 42(6):2243–2281.
- Fryzlewicz (2018) Fryzlewicz, P. (2018). Detecting possibly frequent change-points: wild binary segmentation 2 and steepest-drop model selection. arXiv:1812.06880.
- Gibberd and Nelson (2017) Gibberd, A. J. and Nelson, J. D. (2017). Regularized estimation of piecewise constant Gaussian graphical models: the group-fused graphical Lasso. Journal of Computational and Graphical Statistics, 26(3):623–634.
- Gibberd and Roy (2017) Gibberd, A. J. and Roy, S. (2017). Multiple changepoint estimation in high-dimensional Gaussian graphical models. arXiv:1712.05786.
- Haynes et al. (2017a) Haynes, K., Eckley, I. A., and Fearnhead, P. (2017a). Computationally efficient changepoint detection for a range of penalties. Journal of Computational and Graphical Statistics, 26(1):134–143.
- Haynes et al. (2017b) Haynes, K., Fearnhead, P., and Eckley, I. A. (2017b). A computationally efficient nonparametric approach for changepoint detection. Statistics and Computing, 27(5):1293–1305.
- Hotz et al. (2013) Hotz, T., Schütte, O. M., Sieling, H., Polupanow, T., Diederichsen, U., Steinem, C., and Munk, A. (2013). Idealizing ion channel recordings by a jump segmentation multiresolution filter. IEEE Transactions on NanoBioscience, 12(4):376–386.
- Killick et al. (2012) Killick, R., Fearnhead, P., and Eckley, I. A. (2012). Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association, 107(500):1590–1598.
- Kim et al. (2005) Kim, C.-J., Morley, J. C., and Nelson, C. R. (2005). The structural break in the equity premium. Journal of Business & Economic Statistics, 23(2):181–191.
- Kovács et al. (2020) Kovács, S., Londschien, M., and Bühlmann, P. (2020). Random Forests and other nonparametric classifiers for multivariate change point detection. working paper.
- Leonardi and Bühlmann (2016) Leonardi, F. and Bühlmann, P. (2016). Computationally efficient change point detection for high-dimensional regression. arXiv:1601.03704.
- Li et al. (2019) Li, H., Guo, Q., and Munk, A. (2019). Multiscale change-point segmentation: beyond step functions. Electronic Journal of Statistics, 13(2):3254–3296.
- Li et al. (2016) Li, H., Munk, A., and Sieling, H. (2016). FDR-control in multiscale change-point segmentation. Electronic Journal of Statistics, 10(1):918–959.
- Londschien et al. (2019) Londschien, M., Kovács, S., and Bühlmann, P. (2019). Change point detection for graphical models in presence of missing values. arXiv:1907.05409.
- Lu et al. (2017) Lu, Z., Banerjee, M., and Michailidis, G. (2017). Intelligent sampling for multiple change-points in exceedingly long time series with rate guarantees. arXiv:1710.07420.
- Maidstone et al. (2017) Maidstone, R., Hocking, T., Rigaill, G., and Fearnhead, P. (2017). On optimal multiple changepoint algorithms for large data. Statistics and Computing, 27(2):519–533.
- Matteson and James (2014) Matteson, D. S. and James, N. A. (2014). A nonparametric approach for multiple change point analysis of multivariate data. Journal of the American Statistical Association, 109(505):334–345.
- Mazumder and Hastie (2012a) Mazumder, R. and Hastie, T. (2012a). Exact covariance thresholding into connected components for large-scale graphical Lasso. Journal of Machine Learning Research, 13(1):781–794.
- Mazumder and Hastie (2012b) Mazumder, R. and Hastie, T. (2012b). The graphical Lasso: new insights and alternatives. Electronic Journal of Statistics, 6:2125–2149.
- Olshen et al. (2004) Olshen, A. B., Venkatraman, E. S., Lucito, R., and Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics, 5(4):557–572.
- Padilla et al. (2019a) Padilla, O. H. M., Yu, Y., Wang, D., and Rinaldo, A. (2019a). Optimal nonparametric change point detection and localization. arXiv:1905.10019.
- Padilla et al. (2019b) Padilla, O. H. M., Yu, Y., Wang, D., and Rinaldo, A. (2019b). Optimal nonparametric multivariate change point detection and localization. arXiv:1910.13289.
- Page (1954) Page, E. S. (1954). Continuous inspection schemes. Biometrika, 41:100–115.
- Pein et al. (2017) Pein, F., Sieling, H., and Munk, A. (2017). Heterogeneous change point inference. Journal of the Royal Statistical Society, Series B, 79:1207–1227.
- Reeves et al. (2007) Reeves, J., Chen, J., Wang, X. L., Lund, R., and Lu, Q. Q. (2007). A review and comparison of changepoint detection techniques for climate data. Journal of Applied Meteorology and Climatology, 46:900–915.
- Rivera and Walther (2013) Rivera, C. and Walther, G. (2013). Optimal detection of a jump in the intensity of a Poisson process or in a density with likelihood ratio statistics. Scandinavian Journal of Statistics, 40(4):752–769.
Rosenberg and Hirschberg (2007)
Rosenberg, A. and Hirschberg, J. (2007).
V-measure: a conditional entropy-based external cluster evaluation
Proc. Conf. on Empirical Methods in Natural Language Processing, (June):410–420.
- Roy et al. (2017) Roy, S., Atchadé, Y., and Michailidis, G. (2017). Change point estimation in high dimensional Markov random-field models. Journal of the Royal Statistical Society, Series B, 79:1187–1206.
- Rufibach and Walther (2010) Rufibach, K. and Walther, G. (2010). The block criterion for multiscale inference about a density, with applications to other multiscale problems. Journal of Computational and Graphical Statistics, 19(1):175–190.
- Soh and Chandrasekaran (2017) Soh, Y. S. and Chandrasekaran, V. (2017). High-dimensional change-point estimation: combining filtering with convex optimization. Applied and Computational Harmonic Analysis, 43(1):122–147.
- Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288.
- Vostrikova (1981) Vostrikova, L. Y. (1981). Detecting ’disorder’ in multidimensional random processes. Soviet Mathematics Doklady, 24:55–59.
- Walther (2010) Walther, G. (2010). Optimal and fast detection of spatial clusters with scan statistics. The Annals of Statistics, 38(2):1010–1033.
- Wang et al. (2019) Wang, D., Lin, K., and Willett, R. (2019). Statistically and computationally efficient change point localization in regression settings. arXiv:1906.11364.
- Wang et al. (2017) Wang, D., Yu, Y., and Rinaldo, A. (2017). Optimal covariance change point localization in high dimension. arXiv:1712.09912.
- Wang et al. (2018) Wang, D., Yu, Y., and Rinaldo, A. (2018). Univariate mean change point detection: penalization, cusum and optimality. arXiv:1810.09498.
- Wang and Samworth (2018) Wang, T. and Samworth, R. J. (2018). High dimensional change point estimation via sparse projection. Journal of the Royal Statistical Society, Series B, 80(1):57–83.
- Witten et al. (2011) Witten, D. M., Friedman, J. H., and Simon, N. (2011). New insights and faster computations for the graphical Lasso. Journal of Computational and Graphical Statistics, 20(4):892–900.
- Zhang and Siegmund (2007) Zhang, N. R. and Siegmund, D. O. (2007). A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics, 63(1):22–32.