Parallelisation of a Common Changepoint Detection Method

10/08/2018 ∙ by S. O. Tickle, et al. ∙ Lancaster 0

In recent years, various means of efficiently detecting changepoints in the univariate setting have been proposed, with one popular approach involving minimising a penalised cost function using dynamic programming. In some situations, these algorithms can have an expected computational cost that is linear in the number of data points; however, the worst case cost remains quadratic. We introduce two means of improving the computational performance of these methods, both based on parallelising the dynamic programming approach. We establish that parallelisation can give substantial computational improvements: in some situations the computational cost decreases roughly quadratically in the number of cores used. These parallel implementations are no longer guaranteed to find the true minimum of the penalised cost; however, we show that they retain the same asymptotic guarantees in terms of their accuracy in estimating the number and location of the changes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The challenge of changepoint detection has received considerable interest in recent years (see, for example, Rigaill et al. (2012), Chen and Nkurunziza (2017) and Truong et al. (2018) and references therein). In particular, there has been a significant focus on the important issue of developing computationally efficient methods to detect multiple changes. This article makes a new contribution to this area by focusing on the problem of parallelising a penalised cost approach to provide a significant computational advantage without compromising on statistical efficiency.

The common changepoint problem setting considers the analysis of a data sequence, , which is ordered by some index, such as time or position along a chromosome. We use the notation for . Our interest is in segmenting the data into consecutive regions; such a segmentation can be defined by the changepoints, , where throughout we take as fixed, but unknown. Thus the set of changepoints splits the data into segments, with the segment containing data-points .

Several approaches can be used to identify the locations of these changes. Within this article, we focus on a class of methods which involve finding the set of changepoints that minimise a given cost. The cost associated with a specific segmentation consists of two important specifications. The first of these is , the cost incurred from a segment of the data. Common choices for include quadratic error loss, Huber loss and the negative log-likelihood (for an appropriate within-segment model for the data); see Yao and Au (1989), Fearnhead and Rigaill (2017) and Chen and Gupta (2000) for further discussion. For example, using quadratic error loss gives:

(1)

Note that in the case of a piecewise constant signal observed with additive Gaussian noise, (1) is equivalent to twice the negative log-likelihood. The second specification is , the penalty incurred when introducing a changepoint into the model. Common choices for include the Akaike Information Criterion, Schwarz Information Criterion and modified Bayesian Information Criterion; see Rigaill et al. (2013), Haynes et al. (2017) and Truong et al. (2017) and references therein for further discussion. Finally, it is assumed that the cost function is additive over segments. The objective is then to find the segmentation which minimises the cost. In other words, we wish to find:

(2)

Dynamic programming methods exist which are guaranteed to find the global minimum of (2). Optimal Partitioning, due to Jackson et al. (2005), uses dynamic programming to solve (2) exactly in a computation time of . Killick et al. (2012) introduce the PELT algorithm, which also solves (2) exactly, and can have a substantially reduced computational cost. In situations where the number of changepoints increases linearly with , Killick et al. (2012) show that PELT’s expected computational cost can be linear in . However, the worst case cost is still , suggesting that significant computational savings are still desirable in practice.

Parallel computing techniques are an increasingly popular means of doing precisely this. The application of parallelisation is vast, with use in such areas as meta-heuristics, cloud computing and biomolecular simulation, as discussed in

Alba (2005), Mezmaz et al. (2011), Schmid et al. (2012) and Wang and Dunson (2014) among many others. Some methods are more easily parallelisable in that it is plain how to split a search space or other task between different nodes. These problems are often described as ‘Embarrassingly Parallel’. For the changepoint detection problem, some existing methods may be described as such. These include Binary Segmentation, due to Scott and Knott (1974), and its related approaches, notably the Wild Binary Segmentation (WBS) method of Fryzlewicz (2014). However, it is not so straightforward to parallelise dynamic programming methods such as PELT. This shall be the focus of this paper.

One of our approaches to parallelising algorithms such as PELT will use the fact that (2) can still be solved exactly when we restrict the changepoints to an ordered subset . Let denote the minimum of (2) when we restrict changepoints to and consider data only up to time ; in addition let be an ordered set of (estimated) changepoints, so that, for :

Using the initial condition , this gives a means of recursively calculating .

The general format of this paper is as follows: Section 2 introduces two means of parallelising dynamic programming methods for solving (2), which we refer to as Chunk and Deal. In each case, we provide a description of the proposed algorithm with practical suggestions for implementation, followed by a short discussion of the theoretical justifications behind these choices. We devote Section 3 to examining this latter aspect in detail. In particular, we establish the asymptotic consistency of Chunk and Deal in a specific case with recourse to the asymptotic consistency of the penalised cost function method. Section 4 compares the use of parallelisation to other common approaches in a number of scenarios involving changes in mean. We conclude with a short discussion in Section 5. The proofs of all results may be found in the appendices and supplementary materials.

2 Parallelisation of Dynamic Programming Methods

In this section, we introduce Chunk and Deal, two methods for parallelising dynamic programming procedures for changepoint detection. For convenience, we shall herein refer to this exclusively as the parallelisation of PELT.

We introduce the notation when referring to applying PELT to a dataset but only allowing candidate changepoints to be fitted from within the set . Note that we trivially require . The general setup for the parallelisation procedure then takes the following form:

  • (Split Phase) We divide the space into (not necessarily disjoint) subsets , where is the number of computer cores available;

  • Each of the cores then performs , returning a candidate set, , of changes, which are returned to the parent core;

  • (Merge Phase) The parent core then performs , and the method returns , the set of estimated changes found at this stage.

Note that in the above we require .

2.1 Chunk

The Chunk procedure consists of dividing the data into continuous segments and then handing each core a separate segment on which to search for changes. This splitting mechanism is shown in Figure 1. One problem with this division arises from changes which can be arbitrarily close to, or coincide with, the ‘boundary points’ of adjacent cores. This necessitates the use of an overlap - a set of points which are considered by both adjacent cores for potential changes, also shown in Figure 1. For a time series of length , we choose an overlap of size either side of the boundary for each core. The full procedure for Chunk is detailed in Algorithm 1.

Figure 1: The time series is split into continuous segments by the Chunk procedure, in this case with 5 cores (l). An overlap is specified between the segments such that points within are considered by both adjacent cores (r).
Data: A univariate dataset, .
Result: A set of estimated changepoint locations .
Step 1: Split the dataset into the subsets such that , , ; for  do
       On core , find ;
end for
Step 2: Sort into ascending order; Step 3: Calculate and return .
Algorithm 1 Chunk for the PELT procedure

Given that Algorithm 1 executes PELT multiple times, it is not immediate that Chunk represents a computational gain. We therefore briefly examine the speed of the procedure from an intuitive perspective. Taking the worst case computational cost of PELT to be , where is the candidate set of changepoints, then the worst case cost of the split phase will be . The cost of the merge phase is dependent on the total number of estimated changes generated in the split phase. If we can estimate changepoint locations to sufficient accuracy, then as each change appears in at most two of the ‘chunks’, the number of returned changes ought to be at most . Thus the merge phase has a cost that is . This intuition is confirmed later, in Corollary 3.3.1.

In order to guarantee that the method does not overestimate the number of changes, some knowledge of the location error inherent in the PELT procedure is needed. This motivates the results of Section 3, which in turn imply various practical choices for the length of the overlap region, . In particular, using will give an effective guarantee of the accuracy of the method. Other sensible choices for can be made based on the trade-off between accuracy and speed (see Section 3 for details).

2.2 Deal

The Deal procedure relies on distributing points to the computing cores in the same manner as a playing card dealer. Define as the largest integer such that . The split phase then partitions as follows:

This splitting mechanism is shown in Figure 2. On the core, the objective function to be minimised then becomes:

,

as discussed in Section 1. The full procedure for Deal is detailed in Algorithm 2.

Figure 2: Points coloured differently are processed by different cores. In the above case 5 cores are used. A core may fit changes only in locations of a given colour.
Data: A univariate dataset, .
Result: A set of estimated changepoint locations .
Step 1: Split the dataset into subsets such that ; for  do
       On core , find ;
end for
Step 2: Sort into ascending order; Step 3: Calculate and return .
Algorithm 2 Deal for the PELT procedure

As for the Chunk procedure, the implementation of Deal leads to computational gains. By the previous section, the worst case computational time of the split phase of Deal will be . The speed of the merge phase is again dependent on the number of changes detected at the split phase. We demonstrate in the proof of Corollary 3.3.1 that the number of changes detected by each core is at most , meaning that the worst case performance of the merge phase is .

For both procedures, using the standard SIC penalty gives a maximum location error of in the asymptotic setting, see Theorems 3.2 and 3.3 for details. With the Deal procedure, however, an additional lower bound constraint is enforced on the number of cores required for this location error (see Theorem 3.3); we therefore recommend setting to be as large as the number of cores available in most practical settings.

We remark that while the Chunk and Deal procedures do not inherit the exactness of PELT in finding the optimal solution to (2), they nevertheless track the true optimum very closely, as seen by the empirical results in Section 4.

3 Consistency of Parallelised Approaches

As exactness with respect to minimising (2) cannot be assumed for the two methods, we must verify that they retain the desirable properties of PELT. To this end, we now turn to consider the consistency of the parallel procedures for a change in mean setting.

We now stipulate that a time series has changepoints corresponding to proportions , for some fixed , such that, for a given , the changepoints are defined as . For the asymptotic setting we consider, take to be fixed.

With this framework in place, we note that the consistency results for Chunk and Deal we develop in Section 3.1 require one particular result not provided by Killick et al. (2012), namely consistency of PELT for the change in mean setting.

Proposition 3.1.

We consider the change in mean setting for the univariate time series:

(3)

where , for and

are a set of centered, independent and identically distributed Gaussian random variables. Take a series with

changes and true changepoint locations (where ). Apply the PELT procedure, minimising squared error loss, with a penalty of , for any , to produce an estimated set of change locations , some number of estimated changes . Then, for any , as , where:

.

Proof: See Section C of the Supplementary Materials.

This result also extends naturally to the multivariate setting, with a penalty of (see Section C of the Supplementary Materials for details). For the univariate case, the proof of Proposition 3.1 follows a similar pattern to that of Yao (1988), though we relax Yao’s condition that an upper bound on the estimated number of changes is specified a priori.

3.1 Consistency and Computational Cost of Chunk and Deal

We now extend the consistency result in the unparallelised setting to obtain equivalent results for Chunk and Deal. These results not only give a bound on the maximum location error of an estimated changepoint, but also provide some insight into the best setting of, for instance, and , which we can use in turn to provide a theoretical result on the computational power of the new methods.

Theorem 3.2.

For the change in mean setting specified in (3), assume that for a data series of length we have cores across which to parallelise a changepoint detection procedure. Defining as for the previous results for any , then additionally assuming that with gives that, under the Chunk procedure for parallelising a procedure which minimises least squared error under a penalty of , as .

Proof: See Appendix.

Note that the definition of the event is as for Proposition 3.1, meaning that the maximum asymptotic location error under the Chunk procedure is . This error suggests that we can choose to be small compared to , the number of data points on a core. For instance, setting

ensures accuracy in probability for large

with negligible computational impact relative to .

In addition, we remark that the conditions on for Chunk are relatively weak, and are in place to avoid segments of fixed size for . For most practical values of , we advise setting such that the length of the segments is at least to avoid intersecting overlaps.

Theorem 3.3.

The same result as for Theorem 3.2 holds with the Deal parallelisation procedure, assuming in addition that .

Proof: See Appendix.

Note that the conditions on are stronger for Deal than for Chunk, with a lower bound corresponding with the maximum location error inherent in the event . We believe the constraint on is an artefact of the proof technique. Intuitively we would expect the statistical accuracy of Deal to be larger for smaller ; as, for example, corresponds to optimally minimising the cost. Practically, setting is unlikely to be problematic for typical values of , a notion which we confirm empirically in Section 4.

Finally, given these results, we are now in a position to give a formal statement on the worst case computational cost for both Chunk and Deal, when the computational cost of setting up a parallel environment is assumed to be negligible.

Corollary 3.3.1.

Under the change in mean setting outlined in Proposition 3.1, with probability tending to 1 as , the computational cost for Chunk when parallelising the PELT procedure using computer cores is , while for Deal the cost is
, compared to a cost of for unparallelised PELT.

Proof: See Appendix.

We remark that setting in Corollary 3.3.1 guarantees a worst case computational cost of for both Chunk and Deal, no matter the performance of PELT. In addition, we note that we achieve a computational gain which is quadratic with in the best case. We emphasise again that this result ignores the cost of setting up a parallel environment, which can lead to PELT performing better computationally for small . Therefore, we now conduct a simulation study in order to understand the likely practical circumstances in which parallelisation is a more efficient option.

4 Simulations

We now turn to consider the performance of these parallelised approximate methods on simulated data.

While these suggested parallelisation techniques do speed up the implementation of the dynamic programming procedure underlying, say, PELT, the exactness of PELT in resolving (2) is no longer guaranteed. We therefore compare parallelised PELT with Wild Binary Segmentation (WBS), proposed by Fryzlewicz (2014), a non-exact changepoint method which has impressive computational speed.

Simulated time series with piecewise normal segments were generated. Five scenarios, with changes at particular proportions of the time series, were examined in detail in the study. For a time series length of 100000, these scenarios are shown in Figure 3.

Figure 3: Five scenarios under examination in the simulation study. Clockwise from top left are scenarios A, B, C, E and D with 2, 3, 6, 14 and 9 changes respectively.

Different lengths of series for each of the five scenarios, keeping the proportionate change sets the same, were used to examine the statistical power of PELT, Chunk, Deal and WBS under a number of replications for the error terms ( in all cases). In addition, four change magnitudes (0.25, 0.5, 1 and 2) were used to examine the behaviour of the algorithms in each of the scenarios as was increased.

The number of false positives (which were counted as the number of estimated changes more than points from the closest true change) and missed changes (the number of true changes with no estimated change within points), as well as the maximum observed location error and average location error across all repetitions were measured. Finally, the average cost of the segmentations (using mean squared error) generated by the methods relative to the optimal given by PELT were recorded.

As can be seen from Tables 1 - 3, Chunk and Deal closely mirror PELT in statistical performance in finding approximately the same number of changes in broadly similar locations. The performance of WBS was generally worse across these measures, although WBS did mirror PELT moderately closely and did occasionally perform better, particularly in certain scenarios for average location error. However, as the number of changes was increased, WBS was generally outperformed by both Chunk and Deal.

From Table 4, we note that, in practice, Deal often outperforms Chunk in terms of computational speed for a given number of cores. This is due to the fact that the Deal procedure will rarely perform at the worst case computational speed during the split phase (which typically dominates the computation time), as one of the candidates around a true change is very likely to be chosen as a candidate changepoint (see the proof of Theorem 3.3). This means that more candidates for the most recent changepoint are pruned than for Chunk. PELT was observed to be the fastest method for the smallest value of across all scenarios. It was at the larger values of where the quadratic gains in speed of Chunk and Deal became apparent, as can also be seen in Figure 4, which shows the relative computational gain for Scenario C when and across multiple different values of .

Finally, from Table 5, both Chunk and Deal are seen to track PELT very closely in terms of the final cost of the model. In particular, Deal seems to perform well for smaller values of , while Chunk in general appears to find solutions of a very similar global cost to the PELT algorithm for larger . Caution should be exercised, however, as only the value of was tested.

Average False Alarms Length Length Length
Scenario Method 0.25 0.5 1 2 0.25 0.5 1 2 0.25 0.5 1 2
A PELT 0.69 0.77 0.26 0.05 1.36 0.74 0.15 0.01 1.28 0.59 0.10 0.00
(2 changes) Chunk4 0.70 0.87 0.26 0.05 1.50 0.74 0.16 0.01 1.29 0.59 0.10 0.00
Deal4 0.68 0.73 0.26 0.05 1.37 0.74 0.15 0.01 1.37 0.74 0.15 0.01
WBS 0.54 0.66 0.29 0.08 1.20 0.66 0.16 0.00 1.26 0.59 0.10 0.00
B PELT 0.17 0.29 0.14 0.04 0.76 0.46 0.17 0.02 0.98 0.83 0.09 0.00
(3 changes) Chunk4 0.14 0.23 0.17 0.04 0.72 0.46 0.15 0.01 0.98 0.53 0.09 0.00
Deal4 0.16 0.27 0.16 0.02 0.77 0.46 0.17 0.01 0.77 0.46 0.17 0.01
WBS 0.15 0.25 0.19 0.07 0.55 0.45 0.12 0.02 0.97 0.93 0.24 0.10
C PELT 0.92 1.31 0.76 0.09 3.08 2.10 0.38 0.01 3.94 1.89 0.20 0.00
(6 changes) Chunk4 0.91 1.25 0.79 0.08 2.84 2.12 0.38 0.01 3.96 1.88 0.19 0.00
Deal4 0.88 1.29 0.74 0.08 3.07 2.17 0.39 0.01 3.07 2.17 0.39 0.01
WBS 0.86 1.23 1.07 0.23 2.73 2.40 0.66 0.08 4.11 2.17 0.53 0.11
D PELT 1.03 1.47 0.85 0.10 3.72 2.88 0.58 0.04 5.28 2.76 0.43 0.01
(9 changes) Chunk4 1.09 1.42 0.82 0.10 3.38 2.85 0.57 0.02 5.26 2.75 0.43 0.01
Deal4 1.03 1.41 0.87 0.08 3.73 2.89 0.60 0.03 3.73 2.89 0.60 0.03
WBS 0.97 1.27 1.01 0.17 3.20 3.10 0.90 0.20 5.42 3.26 0.79 0.17
E PELT 1.04 1.72 1.20 0.13 4.39 4.12 0.88 0.01 8.22 4.12 0.55 0.00
(14 changes) Chunk4 1.09 1.66 1.21 0.12 4.25 4.13 0.89 0.01 4.38 4.12 0.55 0.00
Deal4 1.04 1.66 1.20 0.10 4.32 4.07 0.86 0.01 4.32 4.18 0.86 0.01
WBS 1.01 1.67 1.24 0.24 3.86 4.23 1.24 0.18 8.14 4.50 1.08 0.18
Table 1: The average number of false alarms recorded across all 200 repetitions for each of the 5 scenarios A, B, C, D and E. A false alarm is defined as an estimated changepoint which is at least points from the closest true changepoint.
Average Num. Missed Length Length Length
Scenario Method 0.25 0.5 1 2 0.25 0.5 1 2 0.25 0.5 1 2
A PELT 1.77 1.10 0.22 0.01 1.38 0.72 0.14 0.00 1.28 0.59 0.10 0.00
(2 changes) Chunk4 1.94 1.30 0.21 0.01 1.55 0.72 0.15 0.00 1.29 0.59 0.10 0.00
Deal4 1.77 1.09 0.22 0.01 1.39 0.73 0.15 0.00 1.30 0.58 0.10 0.00
WBS 1.84 1.29 0.22 0.01 1.45 0.66 0.16 0.00 1.26 0.59 0.10 0.00
B PELT 2.62 2.04 1.17 1.00 2.47 1.94 1.06 0.00 2.45 0.86 0.09 0.00
(3 changes) Chunk4 2.65 2.12 1.20 1.01 2.48 1.95 1.06 0.00 2.45 0.86 0.09 0.00
Deal4 2.65 2.08 1.19 1.02 2.50 1.94 1.14 0.00 2.48 0.87 0.09 0.00
WBS 2.65 2.13 1.29 0.91 2.51 1.95 1.06 0.01 2.43 1.02 0.16 0.01
C PELT 5.53 4.55 0.74 0.04 4.79 2.08 0.37 0.00 3.94 1.89 0.20 0.00
(6 changes) Chunk4 5.65 4.71 0.83 0.04 4.91 2.10 0.37 0.00 3.96 1.88 0.19 0.00
Deal4 5.53 4.62 0.73 0.04 4.86 2.15 0.39 0.00 3.96 1.01 0.19 0.00
WBS 5.57 4.71 1.22 0.08 4.90 2.36 0.56 0.03 4.05 2.08 0.48 0.04
D PELT 8.20 6.63 2.13 0.81 7.45 4.32 0.72 0.02 6.38 2.75 0.43 0.00
(9 changes) Chunk4 8.36 6.74 2.20 0.83 7.62 4.29 0.71 0.01 6.37 2.75 0.43 0.00
Deal4 8.22 6.68 2.23 0.84 7.50 4.34 0.77 0.02 6.39 2.73 0.45 0.00
WBS 8.22 6.66 2.65 0.66 7.79 4.57 1.07 0.07 6.48 3.21 0.67 0.02
E PELT 13.0 11.0 5.18 1.20 12.0 6.70 2.07 0.00 9.85 4.26 0.55 0.00
(14 changes) Chunk4 13.2 11.2 5.25 1.24 12.2 6.79 2.10 0.00 9.90 4.25 0.55 0.00
Deal4 13.0 11.1 5.40 1.29 12.0 6.68 2.09 0.00 9.88 4.33 0.58 0.00
WBS 13.1 11.2 6.09 1.53 12.3 7.46 2.51 0.16 10.2 5.00 0.97 0.04
Table 2: The average number of missed changes across all 200 repetitions for each of the 5 scenarios A, B, C, D and E. A missed change is defined as a true changepoint for which no estimated change lies within points.
Average Location Error Length Length Length
Scenario Method 0.25 0.5 1 2 0.25 0.5 1 2 0.25 0.5 1 2
A PELT 64.1 21.8 7.43 5.96 70.0 24.7 16.1 14.4 46.0 11.7 3.21 1.26
(2 changes) Chunk4 48.3 21.9 7.63 5.67 89.5 12.4 3.63 1.24 47.4 11.7 3.21 1.26
Deal4 59.1 20.7 7.59 5.65 56.8 12.0 3.33 1.19 46.7 11.7 3.20 1.22
WBS 86.2 34.7 12.7 10.7 52.4 12.3 3.40 1.20 46.0 12.1 3.18 1.26
B PELT 69.7 31.2 17.8 15.7 75.9 42.8 26.4 17.1 47.5 12.1 3.00 1.27
(3 changes) Chunk4 76.5 37.1 18.3 15.7 72.4 41.6 25.3 16.3 47.0 10.8 3.00 1.27
Deal4 50.4 22.5 17.8 10.0 74.8 41.2 25.9 16.4 47.8 12.4 3.00 1.26
WBS 59.9 38.7 17.4 13.8 32.2 11.0 3.25 1.52 47.4 14.4 5.82 3.07
C PELT 29.1 17.2 5.42 3.22 71.1 16.4 7.19 5.14 50.3 12.5 3.04 1.23
(6 changes) Chunk4 31.3 18.5 4.94 2.61 64.1 16.4 6.64 4.59 50.7 12.4 3.01 1.24
Deal4 28.1 16.7 4.75 2.67 69.2 16.1 6.78 4.69 50.2 11.5 2.96 1.18
WBS 21.8 14.1 5.87 2.51 65.1 17.7 5.79 1.88 80.7 24.0 5.62 1.93
D PELT 19.3 12.2 4.37 2.80 57.3 15.0 7.98 3.20 85.5 11.8 3.33 1.26
(9 changes) Chunk4 22.3 12.6 4.67 2.48 59.5 14.6 7.68 2.77 86.2 11.7 3.32 1.26
Deal4 19.4 12.0 4.08 2.40 55.1 14.7 4.95 2.80 85.8 11.8 3.42 1.28
WBS 17.6 10.4 4.41 4.12 58.3 20.0 5.29 1.76 199 20.4 6.47 2.39
E PELT 14.1 9.72 3.83 2.09 52.0 12.5 4.06 1.71 51.1 12.2 3.29 1.27
(14 changes) Chunk4 15.3 9.36 3.96 1.97 63.4 13.0 4.07 1.71 53.6 12.2 3.29 1.29
Deal4 14.2 9.88 3.87 1.70 51.6 12.5 4.03 1.75 51.1 12.5 3.33 1.27
WBS 13.7 9.67 4.20 2.58 56.9 17.1 9.05 1.64 70.6 36.3 5.18 1.90
Table 3: The average location error between those true changes which were detected by the algorithms and the corresponding estimated change across all 200 repetitions for each of the 5 scenarios.
Mean Computational Gain Length Length Length
Scenario Method 0.25 0.5 1 2 0.25 0.5 1 2 0.25 0.5 1 2
A Chunk4 0.05 0.05 0.05 0.05 2.47 2.66 2.82 2.84 12.2 10.0 12.9 14.9
(2 changes) Deal4 0.05 0.05 0.05 0.05 2.97 3.09 3.25 3.30 21.4 24.2 21.4 22.7
B Chunk4 0.05 0.05 0.05 0.05 2.75 2.78 2.74 2.68 13.4 8.55 14.4 10.7
(3 changes) Deal4 0.05 0.05 0.05 0.05 2.98 3.20 3.09 3.11 14.2 14.0 34.8 36.3
C Chunk4 0.06 0.04 0.05 0.06 2.72 2.97 2.79 2.77 7.65 11.1 17.6 10.7
(6 changes) Deal4 0.06 0.05 0.05 0.05 3.05 3.33 3.32 3.16 26.0 32.0 14.8 10.7
D Chunk4 0.05 0.06 0.05 0.08 3.17 2.96 3.22 2.96 7.91 8.21 10.2 10.7
(9 changes) Deal4 0.05 0.06 0.06 0.08 3.59 3.95 3.56 3.87 24.3 31.5 26.8 33.2
E Chunk4 0.06 0.05 0.06 0.05 3.10 2.94 2.69 2.48 7.00 7.09 19.8 21.4
(14 changes) Deal4 0.06 0.05 0.07 0.05 3.75 3.61 3.22 3.22 14.1 17.0 12.6 16.4
Table 4: The average relative computational speed of the Chunk and Deal procedures compared to PELT using 4 cores.
Average Cost - Optimal Length Length Length
Scenario Method 0.25 0.5 1 2 0.25 0.5 1 2 0.25 0.5 1 2
A Chunk4 1.69 1.12 0.03 0.02 3.07 0.05 0.01 0.00 0.03 0.00 0.01 0.00
(2 changes) Deal4 0.10 0.21 0.29 0.19 0.10 0.18 0.34 0.41 0.07 0.14 0.24 0.19
WBS 1.70 1.76 0.63 0.65 1.63 0.09 0.08 0.06 0.00 0.10 0.10 0.10
B Chunk4 0.13 0.59 0.16 0.10 0.14 0.01 0.01 0.01 0.00 0.00 0.00 0.00
(3 changes) Deal4 0.10 0.23 0.53 2.35 0.13 0.23 0.62 2.52 0.09 0.28 0.49 0.52
WBS 1.04 1.60 1.67 2.18 1.23 1.01 2.09 1.93 2.00 2.00 2.80 2.80
C Chunk4 1.47 1.33 0.26 0.16 3.39 0.18 0.02 0.00 0.05 0.01 0.00 0.00
(6 changes) Deal4 0.19 0.49 1.40 4.73 0.22 0.58 1.25 4.48 0.26 0.46 0.64 0.66
WBS 2.12 3.79 4.14 3.72 7.01 3.51 4.18 3.97 4.30 4.80 4.30 4.50
D Chunk4 1.99 1.11 0.39 0.29 3.68 0.05 0.01 0.01 0.05 0.01 0.00 0.00
(9 changes) Deal4 0.22 0.54 1.32 2.98 0.31 0.77 1.69 4.97 0.37 0.76 1.82 5.42
WBS 2.09 5.01 5.38 5.26 9.03 6.12 5.94 5.94 6.60 6.60 6.30 5.10
E Chunk4 1.96 1.75 0.49 0.29 5.93 0.53 0.04 0.01 0.09 0.01 0.00 0.00
(14 changes) Deal4 0.21 0.72 2.08 4.79 0.42 1.16 3.19 12.0 0.58 1.16 2.79 7.49
WBS 3.06 6.38 8.95 8.04 9.76 9.84 7.52 9.37 11.8 11.0 11.2 9.00
Table 5: The average cost, calculated using the log likelihood of the segments, resulting from executing each procedure. This is adjusted according to the equivalent cost computed by PELT (which is optimal).
Figure 4: Relative computational gain for Chunk and Deal across a differing number of cores under Scenario C with and . The line is shown for comparison, demonstrating the super-linear computational gain.

5 Discussion

We have proposed two new methods for changepoint detection, Chunk and Deal, each based on parallelising an existing method, PELT. These methods represent a substantial computational gain in many cases, particularly for large . In addition, by establishing the asymptotic consistency of PELT, we have been able in turn to show the asymptotic consistency of the Chunk and Deal methods, such that the error inherent to all three is in terms of the maximum location error of an estimated change relative to the corresponding true change.

We have demonstrated empirically that an implication of this is that Chunk and Deal, while not inheriting the exactness of PELT, do perform well in finding changes in practice.

6 Acknowledgments

Tickle is grateful for the support of the EPSRC (grant number EP/L015692/1). The authors also acknowledge British Telecommunications plc (BT) for financial support, and are grateful to Kjeld Jensen and Dave Yearling in BT Research & Innovation for helpful discussions.

Appendix A Appendix

The following results will be stated with respect to a general . Theoretically, this means that any can be used in Algorithm 1 or Algorithm 2, however in the simulation study detailed in Section 4, was used as the overlap length (for Chunk), while the cutoff value for closeness detailed in the merge phase (Step 3) of both procedures was taken as .

Proof of Theorem 3.2: It is necessary to establish that:

  1. Each change is detected by the core in which it is present.

  2. At the merge phase, only one estimated change per core is kept.

Proof of (I): Taking , then as . Thus, the Chunk procedure will inherit the asymptotic consistency of the base procedure providing no change consistently falls arbitrarily close to the boundary between two cores. In which case, for , the segment length would be reduced from to (although the smallest possible segment length is ). As such a segment length violates the condition on fixed values for outlined in Section 3, it is therefore necessary to establish that a true change positioned at a point within of either the beginning or the end of the series will be detected with probability 1 as . It is sufficient to extend Corollary B.2.1 (see Section B of the Supplementary Materials) to the case where the first true change is at location .

However, as the minimum segment length is at least , then by the argument of the proof of Corollary B.2.1 each changepoint is detected by at least one core to which it is given in probability for increasing . Formally, one can consider the difference:

and it can be shown that . In particular, , so the Chunk procedure will detect a change at the boundary with a segment length of .

Proof of (II): We examine all segmentations, , such that , for fixed , . Then if :

where .

With recourse once more to the result of Laurent and Massart (2000) (see equation (6) in Section B of the Supplementary Materials for the general result), if then:

for any , providing that . As there are possibilities for the segmentation , then uniformly across all segmentations, taking for example, gives that the difference in the residual sum of squares is uniformly .

Now from the proof of Proposition 3.1 (see Section C of the Supplementary Materials) we know that in probability any segmentation with more than changes will have a greater cost than the true segmentation if a penalty of is used. Therefore, across all segmentations under consideration here, the cost is at least greater than the cost of the true segmentation if a penalty of is used.

Proof of Theorem 3.3: Recall that and . The idea will be to show that the core which is ‘dealt’ a particular true change, , will always return this true change as a candidate changepoint for the merge phase. By Yao (1988), letting be a set of estimated changes which miss the true change by at least , then again by the proof of Corollary B.2.1 the cost of this segmentation is strictly worse than the cost of also fitting changes at the points and . By then considering the difference:

in a similar fashion to the proof of Corollary B.2.1, it can be shown that in probability:

where again is the absolute change in mean at the changepoint .

Proof of Corollary 3.3.1: It is sufficient to prove the following Claim regarding the number of candidate changes each core returns.

Claim: In probability, and for any candidate set given to the cores in accordance with the conditions of Theorem 3.2 and Theorem 3.3:

  1. under the Chunk procedure, the maximum number of points returned for the merge phase is bounded above by ,

  2. under Deal, the maximum number of points recorded as estimated changes is bounded above by for each core.

Proof of Claim:

Proof of (I): We note that when is constant, the result is immediate from the proof of Lemma 3.1.

When , it suffices to show that across all cores which are given no true changes, the probability of any of these cores returning a true change converges to 0. Given that the number of cores which are given a change is fixed (and bounded above at - as each change could fall inside an overlap), the result is then immediate from the proof of Theorem 3.2.

Considering a single core with no true changes, we adapt the argument from the proof Proposition 3.1. For a quantity which is distributed according to a distribution, then by Laurent and Massart (2000):

Fitting changes across a core will give that the residual sum of squares relative to a fit of no changes across the same core follows a distribution. Therefore, following the application of a Bonferroni correction across all possible placings of changes gives that the difference between the null fit and the best possible fit of changes is then bounded in probability as:

In particular, setting and as before, gives that:

and so scaling this by :

Therefore, the computation time of the merge phase of Chunk is in the worst case, which along with the worst case cost from the split phase of gives the worst case computation time for the whole procedure.

Proof of (II): Define, for a given core under the Deal procedure:

where is the final point given to the core which is strictly before , and is the first point given to the core which is after . In the same way as for the proof of Proposition 3.1, we examine the best possible segmentations which include as a subset of the estimated changepoints for a core, and show that all are rejected in favour of in probability. We then show that this is true across all cores in probability.

For a given core, suppose is a set of points estimated as changes under the Deal procedure such that . By construction of , all points in must lie in a region between two points of which also does not contain any true changes. We can therefore apply the same argument as for Proposition 3.1 to the difference:

where refers to any such region between two consecutive points of which contains a point found only in . Uniformly across such regions, and supposing such estimated changes are found within , it can be seen that the positive term in the expression of the difference above is distributed as . Thus letting and again with recourse to the Bonferroni correction argument as in Proposition 3.1, for a given :

Note that this argument does not consider segmentations which do not contain as a proper subset. In order to extend this argument, we define the following three sets of segmentations (with respect to a given core):

Note that and that the argument showing that any segmentation containing is rejected uniformly in favour of may be extended to any element of to show that any segmentation with more than estimated changes in total and which has at least two estimated changes between each true change is uniformly dominated by a corresponding element of .

In the same way, let us now consider extensions from a general element, , where here an extension is defined as a superset of which also contains additional estimated changes from regions between two estimated changes within not containing a true change. Letting, for example:

for some and . Then any extensions of consists of placing any further estimated changes in any of the regions between the changes above with the exception of either (if ) the region or (if ) the region . Let be an arbitrary such extension, and again let be any region between two consecutive points of which contains a point found only in . As before, uniformly across such regions, and supposing again that such estimated changes are found within , letting:

then again Diff is distributed as . With recourse to the same argument as before (noting again that any such region will have at most candidate points for the extension - no matter which base element of we pick), and extending to other elements of , we conclude that any segmentation with more than estimated changes which places just one estimated change between two true changes in at least one case will be rejected uniformly (and for all cores) in favour of an element of .

Finally, we consider all segmentations with more than changes which place no estimated changes between two true changes in at least one case. We again compare with . Letting, for example:

for some . Then any extensions of consists of placing any further estimated changes in any of the regions between the changes above with the exception of the region . Let be an arbitrary such extension, and again let be any region between two consecutive points of which contains a point found only in . Then again letting:

then for changes in the region , Diff is distributed as . We can again extend this argument to extensions of other elements of to conclude that segmentations with more than changes which have no estimated changepoints between two consecutive true changes in at least one case will be uniformly rejected in favour of an element of .

Therefore, as any segmentation with more than changes for any core is an extension of an element of , or (as such a segmentation must contain a region between two consecutive true changes with at least three estimated changes), then across all cores, a segmentation must be picked from within one of the classes , or in probability. Thus, the maximum number of estimated changepoints that a core can return in the Deal procedure is .

The number of candidates returned for the merge phase of the Deal procedure is therefore bounded in probability by , so that the maximum computation time of the merge phase is in the worst case, giving the total worst case computation time for the whole procedure.

References

  • Alba (2005) Alba, E. (2005). Parallel Metaheuristics. John Wiley & Sons, Inc., Hoboken, New Jersey, United States of America.
  • Chen and Nkurunziza (2017) Chen, F. and Nkurunziza, S. (2017). On estimation of the change points in multivariate regression models with structural changes. Communications in Statistics - Theory and Methods, 46(14):7157 – 7173.
  • Chen and Gupta (2000) Chen, J. and Gupta, A. K. (2000). Parametric Statistical Changepoint Analysis. Birkhäuser, Boston, Massachusetts, United States of America.
  • Fearnhead and Rigaill (2017) Fearnhead, P. and Rigaill, G. (2017). Changepoint detection in the presence of outliers. arXiv:1609.07363v2, pages 1 – 29.
  • Fryzlewicz (2014) Fryzlewicz, P. (2014). Wild binary segmentation for multiple change-point detection. The Annals of Statistics, 42(6):2243–2281.
  • Haynes et al. (2017) Haynes, K., Eckley, I., and Fearnhead, P. (2017). Computationally efficient changepoint detection for a range of penalties. Journal of Computational and Graphical Statistics, 26(1):134–143.
  • Jackson et al. (2005) Jackson, B., Scargle, J., Barnes, D., Arabhi, S., Alt, A., Gioumousis, P., Gwin, E., Sangtrakulcharoen, P., Tan, L., and Tsai, T. (2005). An algorithm for optimal partitioning of data on an interval. IEEE Signal Processing, 12(2):105–108.
  • Killick et al. (2012) Killick, R., Fearnhead, P., and Eckley, I. (2012). Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association, 107(500):1590–1598.
  • Laurent and Massart (2000) Laurent, B. and Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302 – 1338.
  • Mezmaz et al. (2011) Mezmaz, M., Melab, M., Kessaci, Y., Lee, Y., Talbi, E.-G., Zomaya, A., and Tuyttnes, D. (2011). A parallel bi-objective hypbrid metaheuristic for energy-aware scheduling for cloud computing systems. Journal of Parallel and Distributed Computing, 71(11):1497–1508.
  • Rigaill et al. (2013) Rigaill, G., Hocking, T. D., Bach, F., and Vert, J. P. (2013). Learning sparse penalties for change-point detection using max margin interval regression.

    Proceedings of the 30th International Conference on Machine Learning (ICML-13)

    .
  • Rigaill et al. (2012) Rigaill, G., Lebarbier, E., and Robin, S. (2012). Exact posterior distributions and model selection criteria for multiple change-point detection problems. Statistics and Computing, 22(4):917–929.
  • Schmid et al. (2012) Schmid, N., Christ, C., Christen, M., Eichenberger, A., and van Gunsteren, W. (2012). Architecture, implementation and parallelisation of the gromos software for biomolecular simulation. Computer Physics Communications, 183(4):890–903.
  • Scott and Knott (1974) Scott, A. and Knott, M. (1974).

    A cluster analysis method for grouping means in the analysis of variance.

    Biometrics, 30(3):507–512.
  • Truong et al. (2017) Truong, C., Gudre, L., and Vayatis, N. (2017). Penalty learning for changepoint detection. Proceedings of the 2017 25th European Signal Processing Conference (EUSIPCO) in Kos, Greece.
  • Truong et al. (2018) Truong, C., Oudre, L., and Vayatis, N. (2018). A review of changepoint detection methods. arXiv:1801.00718, pages 1–31.
  • Wang and Dunson (2014) Wang, X. and Dunson, D. B. (2014). Parallelizing mcmc via weierstrass sampler. arXiv:1312.4605v2, pages 1–35.
  • Yao (1988) Yao, Y.-C. (1988).

    Estimating the number of change-points via schwarz’ criterion.

    Statistics & Probability Letters, 6(3):181–189.
  • Yao and Au (1989) Yao, Y.-C. and Au, S. T. (1989). Least-squares estimation of a step function. Sankhya: The Indian Journal of Statistics, Series A, 51(3):370–381.

Appendix B Yao’s Results and Extension

The following two lemmas are due to Yao (1988).

Lemma B.1.

Suppose . Then for any as :

(4)
Lemma B.2.

Let be an upper bound on the number of changes, and let be the set of estimated changes generated (by Yao’s procedure). For every s.t. and ,

as , where:

and for .

Corollary B.2.1.

Lemma B.2 can be extended to , for any .

Proof of Corollary B.2.1: The argument for the location accuracy being in Yao (1988) comes from showing that the residual sum of squares for a segmentation that misses a change by more than this amount can be reduced by an amount that is greater than with probability tending to as increases, by adding three changes at the changepoint plus or minus . Thus such a segmentation cannot be optimal as the penalised cost for the latter segmentation will be less than the original one. We therefore need only show that this argument holds if we replace an accuracy of with for any .

To do this it suffices to show that a segmentation which misses a particular change by at least