Ensemble Binary Segmentation for irregularly spaced data with change-points

03/07/2020
by   Karolos K. Korkas, et al.
0

We propose a new technique for consistent estimation of the number and locations of the change-points in the structure of an irregularly spaced time series. The core of the segmentation procedure is the Ensemble Binary Segmentation method (EBS), a technique in which a large number of multiple change-point detection tasks using the Binary Segmentation (BS) method are applied on sub-samples of the data of differing lengths, and then the results are combined to create an overall answer. We do not restrict the total number of change-points a time series can have, therefore, our proposed method works well when the spacings between change-points are short. Our main change-point detection statistic is the time-varying Autoregressive Conditional Duration model on which we apply a transformation process in order to decorrelate it. To examine the performance of EBS we provide a simulation study for various types of scenarios. A proof of consistency is also provided. Our methodology is implemented in the R package eNchange, available to download from CRAN.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/28/2019

Localised pruning for data segmentation based on multiscale change point procedures

The segmentation of a time series into piecewise stationary segments is ...
12/31/2019

Consistency of Binary Segmentation For Multiple Change-Points Estimation With Functional Data

For sequentially observed functional data exhibiting multiple change poi...
04/24/2019

State-domain Change Point Detection for Nonlinear Time Series

Change point detection in time series has attracted substantial interest...
10/04/2021

Graph-based multiple change-point detection

We propose a new multiple change-point detection framework for multivari...
11/14/2019

Estimation of dynamic networks for high-dimensional nonstationary time series

This paper is concerned with the estimation of time-varying networks for...
02/14/2013

A consistent clustering-based approach to estimating the number of change-points in highly dependent time-series

The problem of change-point estimation is considered under a general fra...
11/27/2020

Multiple change point detection under serial dependence: Wild energy maximisation and gappy Schwarz criterion

We propose a methodology for detecting multiple change points in the mea...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Irregularly spaced time series, i.e. data recorded as and when they emerge, have attracted significant attention due to the development and increasing automation in information technology. This type of data, aka high-frequency data, has not only penetrated financial markets, whereby trade transactions, as a result of continuous electronic trading, have generated a vast amount of datasets. Examples of irregularly spaced time series can be found in many industrial and scientific domains such as in natural disasters where floods, earthquakes or volcanic eruptions typically occur at irregular time intervals; in clinical trials where a patient’s state of health is typically observed at different points of time; or in data collecting sensors (smart applications of Internet of Things) which are activated only when an event happens to, e.g., preserve battery energy.

Modelling high-frequency data successfully requires that the dynamics in the time series data are captured efficiently and standard time series methods are not appropriate for these data. To address this, a class of models, the autoregressive conditional duration models (henceforth, ACD), was originally introduced by Engle and Russell (1998)

. In particular, the authors model the conditional mean as an autoregressive process which has a multiplicative relationship with positive valued error terms. These specifications resemble the multiplicative structure of the GARCH model used for modelling an asset’s return volatility, but thanks to the positive valued error terms ACDs are more suitable for modelling the dynamics of continuous positive valued random variables, such as durations (the time it takes between two trades occurring in a trading book), trading volume (the size of orders), or market depth (the available liquidity at any point of time). ACD models have become popular with many variations owing partially to the popularity of the GARCH model andits numerous extensions. Without being exhaustive, we refer the reader to Weibull ACD, gamma ACD, and generalized gamma ACD; to models with more flexible dynamic specification such as the logarithmic ACD model of

Bauwens and Giot (2000), the Box-Cox ACD model of Dufour and Engle (2000); and to non-parametric or semi-parametric versions found in Cosma and Galli (2006) or Saart et al. (2015). An alternative to ACD’s autoregressive specification is to parameterize the intensity function which assumes a self-exciting process. This class of stochastic process was originated by Hawkes (1971)

and, hence, these stochastic processes are referred to as Hawkes processes. They serve as epidemic models: the occurrence of a number of events (e.g. seismic or buy/sell) will increase the probability of further events. Their use in modelling financial high-frequency data has picked up with many applications and variations.

In practice, however, time series entail changes in their dependence structure and the above mentioned models are more appropriate for stationary time series. It will be a crude approximation to adopt them in modeling the non-stationary processes without at least a minimal adaptation in a model’s assumptions. There is also high risk in prediction and forecasting in doing so, as noted by Mercurio and Spokoiny (2004) for financial data. Dealing with high-frequency financial data is not less risky. A representative example is the U-shape in a stock’s trading activity typically observed during a day. We cannot expect a market’s behaviour to be like for like at every point of time. Information flows almost continuously and, therefore, the intra-day dynamics vary significantly and cannot be ignored.

The simplest type of deviation from stationarity is, arguably, piecewise stationarity. This implies that the parameters of a stochastic process remain constant (hence, the process is stationary) for a certain period of time. In this paper, we focus on this type and we aim to identify the stationary segments by detecting the locations and number of change-points. If, of course, we knew the segments a-priori we would fit a stationary model to each of the segments and proceed with the prediction or forecasting tasks, but this knowledge is typically absent.

Detecting change-points has attracted significant attention. One approach to solve this problem is to formulate it through an optimization task i.e. minimising a multivariate cost function (or criterion) and adding a penalty when the number of change-points is unknown (see (Yao, 1988)). Depending on the model a user can adopt certain cost functions: the least squares for change in the mean of a series (Yao and Au, 1989) or (Lavielle and Teyssiere, 2007), the Minimum Description Length criterion (MDL) for non-stationary time series (Davis et al., 1995), the Gaussian log-likelihood function for changes in the volatility (Lavielle and Teyssiere, 2007) or the covariance structure of a multivariate time series (Lavielle and Teyssiere, 2006).

Change-point detection methods that adopt a multivariate cost function often come with a high computational cost. Dynamic programming (Bellman and Dreyfus (1966) and Kay (1998)), Segment Neighbourhood (Auger and Lawrence, 1989) or Optimal Partitioning (Jackson et al., 2005), are used in solving change-point problems, but their complexity is at least where is the sample size.

An alternative approach to the change-point estimation problem is to minimize a series of univariate cost functions in a ‘greedy’ manner, i.e. detect a change-point and then progressively move to identify more. A popular representative of this category is the Binary Segmentation method (BS) and the reasons for its popularity are its low computational complexity and easiness of implementation: after identifying a single change-point (through the use of a certain statistic such as the CUSUM) the detection of further change-points continues to the left and to the right of the initial change-point until no further changes are detected.

The BS algorithm has been adopted to solve various types of problems with Inclan and Tiao (1994)

to be, perhaps, the first to use it to detect breaks in the variance of a sequence of independent observations.

Kim et al. (2000) extend Inclan and Tiao (1994) method to a GARCH(1,1) model and Lee and Park (2001) extend the same method to linear processes. Fryzlewicz and Subba Rao (2011) use the BS method to test for multiple change-points in an ARCH process. Cho and Fryzlewicz (2012) apply the binary segmentation method on the wavelet periodograms in order to identify change-points in the second-order structure of a nonstationary process. Using the wavelet periodogram, Killick et al. (2013) propose a likelihood ratio test under the null and alternative hypotheses, but assume that the number of change-points are bounded from above. The BS method is also used for multivariate (possibly high-dimensional) time series segmentation in Cho and Fryzlewicz (2015) for detecting change-points in the second-order structure and in Cho and Korkas (2018) for detecting change-points in high-dimensional multivariate GARCH processes. For irregularly spaced time series, at least in the context of ACD or Hawkes processes, no literature exists that proposes a BS method (or an alternative change-point detection method) to detect change-points in a piecewise stationary ACD or Hawkes process. We note that Roueff et al. (2016) introduce the time-varying Hawkes process which is locally stationary, not necessarily piecewise stationary and, hence, they do not deal with change-point detection.

The BS algorithm, under specific model specifications, may be inefficient in detecting the change-points. Fryzlewicz (2014) illustrates that with a simple signal+noise model having only three change-points close to each other and in the middle of the series. At the beginning of the BS algorithm the CUSUM statistic does not clearly point to a change-point, hence, it does not move to search for more. This behaviour should not come as a surprise: BS is a greedy algorithm searching for a single change-point at each iteration and failing to do so results in an unnecessary early stop. The Wild Binary Segmentation algorithm (WBS Fryzlewicz (2014)), a state-of-the-art change-point method, aims to solve the above mentioned limitation using a ‘certain random localization mechanism’. We discuss WBS later on, but we note here that this randomized algorithm was extended to univariate time series (Korkas and Fryzlewicz, 2017) and high-dimensional panel data (Wang and Samworth, 2018).

In this work, we capitalize on BS’s popularity and propose a new randomised version of it which we term Ensemble Binary Segmentation (EBS). In particular, we draw a number of random segments from a given univariate time series and apply the BS algorithm in each of these segments. The estimated change-points are then collected from over the BS applications. Due to this simple mechanism the ways to combine the estimated change-points are numerous, for instance, the final set of change-points comprise of those that appear frequently or appear more frequently relative to other estimated change-points. An extra feature is that the estimated change-points can be ranked based on their relative frequency which is useful when post-processing change-points.

The reader might question why WBS is not preferred for detecting change-points in a piecewise stationary ACD or Hawkes process and a new method is proposed instead. First, EBS exhibits better actual performance both in terms of computational speed and accuracy. The way that WBS works does not allow an efficient post-processing of the estimated change-points resulting in sometimes significant oversegmentation (in the case of a few change-points) or undersegmentation (in the case of frequent change-points). The source of the problem is the spurious detection of change-points which is a consequence of the distributional features of the ACD multiplicative form. Further, EBS, in practice, is fast because it does not search for a single change-point in every iteration, but rather it can locate all the change-points with even a single random draw with a small, albeit non-zero, probability. Second, WBS is a special case of BS while the method we propose here can be extended to include other segmentation algorithms. In other words, when drawing a number of of random segments we do not have to apply BS in each of these, but any other method can be potentially used. Therefore, studying EBS is a crucial starting point for other ensemble-type of change-point detection techniques.

The action of aggregating many method outputs to randomized versions of the data is not new. In fact, it has been studied extensively in the statistics and machine learning literature. Random forest is a popular representative that enjoys good predictive performance. Briefly, a random forest works as follows: grow multiple regression (decision) trees to randomized parts of the training dataset, and then average the trees. Our proposed method works in a similar manner and provides a fresh solution to the change-point problem.

The paper is structured as follows: In Section 2 we present the time-varying ACD model (tvACD), the core of our detection algorithm and in Section 3 the time-varying Hawkes process which serves as an alternative to tvACD. The EBS algorithm is presented in Section 4, including its theoretical consistency in detecting the number and locations of change-points. The same section covers the post-processing of the estimated change-points and conducts simulation studies to obtain universal algorithm parameters that almost minimize the need for tuning by the user. Further, in Section 5 we conduct a simulation study to examine the actual performance of our proposed algorithm vis-à-vis the standard BS algorithm. In Section 6 we apply our method to financial high-frequency transaction data. Proofs of the results are in Appendix. Our methodology is implemented in the R package eNchange, available to download from CRAN.

2 Time-varying ACD model

Let be the transaction time where and be the duration between trades/events. We consider the following time-varying ACD model

(1)
(2)

where is the conditional mean duration of the

-th event with parameter vector

and is a general distribution over with mean equal to 1 and parameter vector . In this work we assume that , even though extensions to other cases (e.g. the Weibull distribution) are possible but technically challenging.

We denote the vector of parameters involved in the conditional mean by

Then, is piecewise constant in with change-points () i.e. at any we have that for all .

The number of change-points and their locations are assumed to be unknown and we aim to estimate them. The parameter values in are also unknown, but these can be estimated after the change-points have been detected and stationary segments are identified.

We further assume the following conditions on (1) and (2)

  1. [label=(A0), start=1]

  2. For some , and all , we have

  3. For some and all , we have

Assumptions 12 guarantee that between any two consecutive change-points, admits a well-defined solution a.s. and is weakly stationary (see e.g., Theorem 4.35 of Douc et al. (2014)).

For a non-stationary stochastic process , its strong-mixing rate is defined as a sequence of coefficients i.e.

Proposition 1.

In Fryzlewicz and Subba Rao (2011) the mixing rate of univariate, time-varying ARCH processes was investigated, and in Cho and Korkas (2018) the mixing rate of any pair of time-varying GARCH processes was established. A tvACD process is also strong mixing at a geometric rate, under the following Lipschitz-type condition on the density of .

  1. [label=(A0), start=3]

  2. The distribution of satisfies the following: for any , there exists fixed independent of such that

Fryzlewicz and Subba Rao (2011) show that 1 is satisfied when certain conditions about a density function hold i.e. the first derivative is bounded, after some finite point ; the derivative declines monotonically to zero; and . It is straightforward to show that satisfies these conditions, hence, 1 holds when .

3 Alternative model: time-varying Hawkes process

A different approach to modeling the arrival times is through the Hawkes process: a self-exciting process where the arrival of an event (or events) causes the conditional intensity function to increase. It is defined as follows:

Definifition (Hawkes process) Let be a counting process on with associated history . The point process is said to be a Hawkes process if the conditional function takes the following form

(3)

where is the initial intensity at time and is a memory kernel (also termed excitation function).

Kernel modulates the change that event has on the intensity function at time . In his original work, Hawkes (1971) assumes an exponential kernel

(4)

The parameters control the increase in the arrival density after each arrival in the system, while parameters control the decay rate of this arrival. The unconditional expected value is

and therefore, in order for stationarity to hold the following condition must be satisfied

(5)

The integrated intensity function is defined as

and from the time-change property the transformed durations

are exponentially distributed with unit rate (or, equivalently, they follow a standard Poisson process). Under the model (

3) the integrated intensity can be written as

where

This property is due to the exponential kernel appearing often and, hence, can be recursively computed reducing the calculation of (and the likelihood function) from quadratic to linear time. Ogata (1988) calls the residual process and it serves as a diagnostic test for the goodness-of-fit of a point process on .

We now consider the following time-varying Hawkes process (which we term tvHawkes)

(6)

We denote the vector of parameters involved in the conditional intensity by

Then, is piecewise constant in with change-points such that () i.e. at any we have that for all . Similar to the tvACD case, we do not know the number of change-points in and their locations which we aim to estimate them.

To prove consistency of EBS in detecting the number of change-points and their locations when the underlying model is a tvHawkes process we would need assumptions around non-negativity of and stationarity as in (5) in between any two consecutive change-points. However, to the best of our knowledge, no mixing rates have been established for a time-varying Hawkes process and, therefore, it is not trivial to estimate an upper bound for the cumulative error term. In the next section, we only focus on tvACD while we use tvHawkes as a benchmark (and an alternative) to tvACD.

4 Two-stage change-point detection methodology

4.1 Stage A: Transformation of the point processes

In the first stage of our proposed methodology with form the process with any fixed . This function is required to be bounded and Lipschitz continuous.

  1. [label=(A0), start=4]

  2. The function satisfies and is Lipschitz continuous, i.e., .

Empirical residuals are widely adopted for detecting changes in the parameters of a stochastic model and in this work it will form the basic statistic for our detection algorithm in the context of point processes.

For the tvACD model, following Fryzlewicz and Subba Rao (2014) we select such that

(7)

where the last term is added to ensure the boundness of . This transformation decorrelates the original tvACD process and lightens its tails. Therefore, it serves as our main change-point detection statistic by observing that when any parameter undergoes a change at some point , so does . In this work we favour where as in (7

) to reduce the rightward skew observed in the distribution of

. Our consistency result is based on , but the main algorithm uses the log-transformation purely because it gives better results.

It is important to note that any ‘diagonal’ transformation of such as or should not be used because the squared process will be highly autorrelated. This will distort the change-point detection most likely by producing a false picture of the locations of the change-points. For a discussion on transformations we refer to the original work of Fryzlewicz and Subba Rao (2014).

We prepare the ground for a consistency result for our proposed method. Let denote a stationary ACD process with parameters , and the innovations coinciding with over the associated segment . For each we also define and denoting the index of the change-point strictly to the left and nearest to by with which and are defined.

Proposition 2.

Suppose that 21 hold, and let . Then, we have the following decomposition

(8)
  • are piecewise constant as where . All change-points in belong to .

  • satisfies

Proof of Proposition 2 can be found in Appendix. Unlike , we have which is exactly constant between any two adjacent change-points without any boundary effects. By its construction, does not satisfy , but thanks to the mixing properties of as of Proposition 1, its scaled partial sums can be appropriately bounded. In the next section, we introduce the multiple change-point detection algorithm.

4.2 Stage B: The Segmentation algorithm

The Binary Segmentation algorithm

We first present the Binary Segmentation algorithm within the framework of the ACD model. In the next section, we explain how this algorithm can be enhanced by ensembling detected change-point from multiple applications of the BS algorithm on smaller (and random) segments.

We consider the CUSUM-type statistic which has been widely adopted for change-point detection in both univariate and multivariate data

(9)

A large value of typically indicates the presence of a change-point in the level of in the vicinity of . In particular, if where

(10)

and is a threshold (the choice of which is discussed in Section 4.2), then the location of the change-point is estimated as

The BS algorithm is formulated in Algorithm 1. It starts by initialising with , and and proceeds by recursively applying the CUSUM statistic (9) on and . The BS algorithm stops in each current interval when no further change-points are detected, the obtained CUSUM values fall below threshold .

Input: , , , ,
Step 1: compute for ; Step 2: set
Step 3: if  then
       BinSeg(, , , , ) BinSeg(, , , , )
end if
Output:
Algorithm 1 BinSeg (Binary Segmentation algorithm)

The Ensemble Binary Segmentation algorithm

Our recommended enhancement of the BS algorithm, which we argue to be a significant improvement, is based on the fact that the BS method can possibly fit the wrong model when multiple change-points are present as it searches the whole series. The application of the CUSUM statistic can result in spurious change-point detection when e.g. the true change-points occur close to each other. Especially, Olshen et al. (2004) notice that BS method can fail to detect a small change in the middle of a large segment which is illustrated in Fryzlewicz (2014).

To solve this, Fryzlewicz (2014) proposes a randomised version of the binary segmentation method (termed Wild Binary Segmentation – WBS) where the search for change-points proceeds by calculating the CUSUM statistic in smaller segments whose length is random. By doing so, WBS aims to draw favourable intervals with probability tending to one with the sample size containing at most a single change-point.

However, WBS is tailored around BS and the CUSUM statistic and it is not always straightforward to apply to other segmentation methods. In addition, in order to adapt the WBS technique to detecting change-points in, for instance, the second-order structure of a time series as in Korkas and Fryzlewicz (2017), we require a number of new solutions. These include introducing the smallest possible random interval length, and limiting the permitted “unbalancedness” of the CUSUM statistics due to the fact that many of the intervals considered are short, which typically causes spurious behaviour of the corresponding CUSUM statistics. Even with these new solutions in hand, a post-processing step is still needed in order to control the total number of detected change-points. However, we observe that the post-processing does not contain information about the importance of the estimated change-points. In simple words, a spurious change-point is likely deemed as important as a real change-point, which often resulted in real change-points to be removed after post-processing was applied. We note that this phenomenon does not arise in the signal+iid Gaussian noise setting (Fryzlewicz, 2014), and it is entirely due to the distributional features of the multiplicative setting.

We propose a randomised algorithm, but instead of drawing random intervals, calculate the CUSUM statistic in each of these and proceed in a “to-the-left-and-to-the-right” manner, we run multiple BS on random segments of the underlying univariate series. We then collect all the obtained change-points and calculate the frequency of the occurrences by simply counting the number of times a certain change-point appears over draws of the BS algorithm. This results in a better performance compared with BS as it is more likely to draw favourable intervals which can contain a single change-point (as in the WBS methodology) or more than one (as in the BS methodology). By doing so we aim to balance the benefits of both worlds. The action of ensembling (hence, borrowing from the machine learning literature we term our methodology Ensemble BS or EBS) the change-points allows us to rank a change-point based on its importance.

In the last stage, we have the options to either i. inspect the histogram of the estimated change-points; or ii. to keep only the change-points that appear more than times; or iii. to apply (ii) and then post-process the detected change-points by removing change-points that are ranked lower and/or are ‘close’ to high ranked change-points.

As an illustration, we simulated a tvACD model of the following form

(11)

with sample size and at .

We chose this setup because change-points in the middle of a long time series is challenging for BS to detect. The CUSUM statistic (in absolute value) will fail to exceed the threshold and the BS algorithm will stop, see Figure 1. On the contrary, when taking random intervals and then calculating the CUSUM statistic it was more likely for it to exceed the threshold, see the right top plot in the same figure. What is important to observe is that only has to be as close to the first change-point as possible in order for the CUSUM statistic to surpass the threshold, while end point can be much further to the right. This observation holds symmetrically i.e. if is close to the last (fourth) change-point then can freely take any value in the interval . In those both favourable scenarios the BS algorithm will proceed to identify the rest change-points consistently.

It is interesting to note that the number of draws did not alter the change-point performance. From the bottom two plots in Figure 1 one can see that the shapes of the empirical distributions look identical even though in the first case , while in the second was 10-fold.

Figure 1: A simulated tvACD process from the model (11) (top left). Various CUSUM statistics calculated on the transformed series where the start and end points differ. In blue and ; in red and ; in green and (top right). The empirical histogram of the estimated change-points from draws (bottom left). The empirical histogram of the estimated change-points from draws (bottom right). The four vertical bars in red are the real locations of the change-points.

We now describe the EBS algorithm in more detail. First, denote by for the set of all the change-points detected by the BS algorithm on a random interval . Let , the ensemble collection of all the detected change-points from the applications of BS, and its cardinality. Evaluate each change-point by counting the number of times (frequency) it appears in the ensemble , i.e.

(12)

where is the indicator function. In the machine learning literature, formula (12) is referred as majority voting. We can also obtain the relative frequency of a change-point using to create a relative importance plot (histogram).

Obviously, a change-point that ranks high should be preferred over other change-points. On the other hand, to make a decision about how many change-points are finally selected, one way is to select from the ensembles that such that

(13)

We emphasize that other types of thresholding can be utilized, for instance, a change-point is selected when its rank is above the average rank, i.e.

Input: , , ,
Step 1: for  do
       BinSeg(, , )
end for
Step 2: for  do
       if  then
            
       else
            
       end if
      
end for
Step 3: for  do
       if  then
            
      
end for
Step 4 (Optional for post-processing): Rank by for Output:
Algorithm 2 EnBinSeg (Ensemble Binary Segmentation algorithm)

The relative importance plot also provides a type of ‘scree plot’, whereby an elbow (kink) indicates what the right number of selected change-points is. This type of scree plot is common in, for example, Principal Component Analysis and the selection of the number of principal components.

It is not hard to see that EBS and WBS are similar in the sense that, for a sufficiently large , EBS will select a significant number of favourable draws each containing a single change-point. Applying the BS method on each of these intervals is equivalent to applying the CUSUM statistic once. In practice, however, we noticed that this is not the case and the added feature of EBS in ranking the estimated change-points resulted in a significant improvement in performance.

Theoretical properties of EBS

In this section we present the consistency theorem for the EBS algorithm for the total number and locations of the change-points with and . To achieve this, we impose the following conditions in addition to 1 to 1 mainly to control the detectability of each .

  1. [label=(B0), start=1]

  2. The number of change-points in (1)-(2) is unknown and allowed to increase with and only the minimum distance between the change-points can restrict the maximum number of .

  3. There exists a fixed constant such that .

  4. The distance between any two adjacent change-points satisfies , where for a large enough .

  5. The magnitudes of the change-points satisfy where .

Theorem 1 Suppose that Assumptions (A1)-(A5) and (B1)-(B4) hold. With the number of change-points as and the locations of those change-points as , let and be the number and locations of the change-points (in ascending order) estimated by the Ensemble Binary Segmentation algorithm. There exist constants and such that if , then , where

for certain and for certain and . The guaranteed speed of convergence of to is no faster than where is the number of random draws.

The rate of convergence for the estimated change-points obtained for the BS method by Fryzlewicz and Subba Rao (2014) is where when is . In the EBS setting, the rate is square logarithmic when is of order , which represents an improvement.

We now elaborate on the minimum number of random draws required to ensure that the bound on the speed of convergence of to 1 in Theorem 1 is suitably small. Suppose that we wish to ensure that

This is equivalent to

by noting that around .

Let us consider the “easiest” case, i.e. . This results in a logarithmic number of draws, which leads to particularly low computational complexity, and it also has the same complexity with the WBS case. When , then the required increases almost linearly, but it has less computational complexity than WBS, which also explains why EBS is generally faster than WBS.

Finally, we discuss which appears in (13), and it acts as a decision rule when aggregating estimated change-points across applications of the BS algorithm. Theoretically, tends to when and for a sufficiently chosen . In practice, EBS tended to return spurious change-points due to the distributional features of the multiplicative setting which require us to control the partial sums of in (ii) of Proposition 2. Our recommendations for the choice of along with the choice of are discussed in Section 4.2.

Post-processing

In practice, the real number of change-points is not known to us and to reduce the risk of over-segmentation we propose a post-processing method. The need for post-processing the estimated change-points from a detection routine is common within the context of multiplicative models. We refer the reader to Inclan and Tiao (1994), Cho and Fryzlewicz (2012) and Korkas and Fryzlewicz (2017). In these works, the post-processing method compares every detected change-point from the main detection method against the adjacent ones (re)using the CUSUM statistic. Even though there are variations in implementing a post-processing between them, their common drawback is the lack of information about the importance of the detected change-points.

Our proposal in filtering estimated change-points is simple: starting with the highest ranked based on its , we remove from set if it is within a distance and examine whether is within this distance. If is not within distance, we keep this change-point and repeat the process until all change-points are separated by .

Choice of parameters for transformation

The right choice of the transformation function will influence the empirical performance of our methodology and its power in particular. This choice boils down to determine the coefficients .

The BASTA–res algorithm of Fryzlewicz and Subba Rao (2014) performs change-point detection in the univariate ARCH process, as already seen similar to the ACD process, by analysing the transformation of the input time series obtained similarly to . They recommend the use of ‘dampened’ versions of the GARCH parameter estimates which we also adopt here for the ACD process. This leads to the choice of , and , with within-series dampening parameters .

Empirically, the motivation behind the introduction of is as follows. For with time-varying parameters, we often observe that and are over-estimated in the sense that is close to one, especially when dealing with real data. There has been evidence in the literature that change-points may cause persistence estimation in volatility models (e.g. Francq et al. (2001)). Mikosch and Stărică (2004), among others, show that the estimated persistence close to unity in GARCH models is likely spurious and confounded by neglected change-points. We observed the same phenomenon when trying to fit an ACD process to simulated and real data. Using the raw estimates in place of ’s in (7), therefore, is not the best approach.

Cho and Korkas (2018) propose to choose the dampening factor as

By construction, is bounded as and approximately brings and to the same scale. The selection of the order can be done through the means of an information criterion; however, taking into consideration the possibility that the estimated persistence will be close to unity, we conducted a simulation experiment to establish the right choice of order.

First, we considered a tvACD process with sample size and four change-points introduced to the tvACD process parameters in (1)-(2) at . The parameters change at as follows:

We varied in above from 0.05 to 0.85, practically meaning a tvACD process with minor to strong persistence in the second segment. For the dampening factor we chose to vary it from 1 to 10 (instead of using the formula of (Cho and Korkas, 2018)) to understand its impact in change-point detection performance. For every pair we simulated a tvACD process and we performed two types of transformation. In the first case, the unobservable in the transformation was replaced with the empirical estimates

(14)

obtained with the MLE estimates of the ACD parameters. In the second case, we performed the same transformation, but assuming in (14). In both cases, we then applied the EBS methodology using the parameters , , (the choice of these default values is discussed in the next section).

The results indicate that the transformation involving the empirical estimates from an ACD(0,1) model vis-a-vis ACD(1,1) had a superior performance. The dampening factor did not have as a strong impact and generally the performance remained unchanged for increasing . In a separate experiment not shown here the dynamic selection of worked better and is, hence, recommended.

Choice of threshold and parameters

In this section we present choices of the parameters involved in the Theorems and . In particular, we have that the threshold , in both theorems, includes the constant . To approximate the distribution of the CUSUM statistic in the absence of change-points one approach is to use a parametric resampling procedure that described in Cho and Korkas (2018) and which provides good results. A similar approach has widely been adopted in the change-point literature including Kokoszka and Teyssière (2002) who test for the presence of a change-point in the parameters of univariate GARCH models.

However, this resampling procedure adds computational time and it does not fully take advantage of the dampening transformation aiming to bring a series closer to an iid exponential distribution. For that reason we conduct experiments to establish the universal

value of the threshold parameter under the null hypothesis of no change-points.

In particular, we generate stationary ACD processes of size , varying from to with a step 50. The exact choice of the ACD model parameters (i.e. ) did not alter the results and, therefore, not reported here. Then we find that maximises (9). The ratio

gives us an insight into the magnitude of parameter . We repeated this experiment 100 times for different values of and we selected as the %-percentile for each instance of . Our results indicated that tends to decrease as we increase the sample size and remains unchanged after a certain point (see Figure 2).

To propose a general rule that will apply in most cases we fitted the regression

Having estimated the values for we were able to use fitted values for any sample size . For samples larger than , we used the same values as for .

Figure 2: Estimated ratio over 100 simulations plotted against sample size. Left (right) is the 95% (99%) percentile over 100 simulations for each size.

We turn to the choice of - the number of ensembles to run - and - the relative frequency threshold. In Section 4.2 we discussed that the minimum number of is in the range of a few thousands when the distance of any two adjacent change-points is of . However, in practice and thanks to the randomised ensemble mechanism of EBS, can be much smaller. As for , the theoretical choice of 0 is adequate, albeit for the distributional features of the multiplicative setting we consider here a non-zero positive value can work better in practice. Similar to the procedure of the choice of above, we simulate stationary ACD processes and we apply the EBS algorithm for every triplet keeping everything else the same. We vary from to with a step 50; from to with a step 0.01; and from to with a step . Finally, we repeat the experiment, but instead of a stationary process we simulate tvACD processes to establish the right choice of in the existence of multiple change-points. In particular, five change-points are introduced to the tvACD process parameters in (1)-(2) at . The parameters change at as follows:

To assess the detection accuracy of our proposed methodology we calculate the ratio of the number of change-points detected within 1% of the real change-points over the number of real change-points. For the model with no change-points, we define detection accuracy as being one when no change-points were detected by EBS; otherwise, we calculate it as , where is the estimated number of change-points returned by EBS.

By inspecting Figure 3, we can see that the choice of results in a significant improvement in the detection accuracy compared with lower values. For larger values, we do not see any further improvement, even when the sample size increased.

The most important output of this exercise is that, conditioning on , a high number of draws did not result a considerable improvement in accuracy. On the contrary, a low number in the range of hundreds gave similar results to even for larger samples.

Figure 3: Heatmaps of detected numbers of change points (measured by the ratio ) in EBS, depending on and for different sample sizes. A ratio value close to 1 indicates that the exact number of change-points were detected and within distance from the real ones.

For the simulation study and the real application we set and which are also the default values in the R package.

5 Simulation study

5.1 Models with no change-points

We simulated stationary time series with innovations for ACD and sample sizes for different specifications. For the Hawkes process we chose the time horizon , whereby the sample size will depend on the choice of the parameters. Roughly speaking, for , and the sample size . We report the number of occasions (out of ) the methods incorrectly rejected the null hypothesis of no change-points for each case.

Model BS EBS
S1: iid standard poisson with () and 4 2
S2: Hawkes process with parameter , and () 0 1
S3: Hawkes process with parameter , and () 0 0
S4: ACD process with parameter , and and 0 0
S5: ACD process with parameter , and and 0 0
Table 1: Stationary processes results. For each of the Hawkes processes the average sample size is given in brackets. Figures show the number of occasions the methods incorrectly detected change-points.

From Table 1, we can see that both BS and EBS performed well meaning that the risk of segmenting a stationary process is limited. It should be mentioned though that when the effective size of a stationary Hawkes process is small and the ratio is approaching 1, both EBS and BS tend to incorrectly reject the null hypothesis of no change-points more often. This was due to the Hawkes estimation optimization routine itself which became erratic and often did not converge. To avoid this, we had to re-run the optimizer multiple times using a different set of initial parameters each time to ensure that the optimizer has converged.

5.2 Models with change-points

We now examine the detection performance of our method for a set of non-stationary models, both ACD and Hawkes processes. We consider various test models and examine BS and EBS performance over 100 simulations for each of the test model. To compare the performance between the two detection methods we calculate the following error measures: , , . In addition, the EBS method has improved rates of convergence compared with BS, hence, we examine the total number of change-points identified within from the real ones in order to assess the performance on how close the change-points are to their estimates.

Since the purpose of EBS is to also assist in the post-processing stage, its accuracy should be judged in parallel with the total number of change-points identified. We use a test from Korkas and Fryzlewicz (2017) that tries to accomplish this. Assuming that the maximum distance from a real change-point is denoted by , an estimated change-point is correctly identified if (here within of the sample size). In the case where two (or more) estimated change-points are within this distance

then only one change-point, the closest to the real change-point, is classified as correct and the rest are deemed to be false, except if any of these are close to another change-point. An estimator performs well when the hit ratio

is close to 1. According to the authors, the term penalises cases where, for example, the estimator correctly identifies a certain number of change-points all within the distance , but . It also penalises the estimator when overestimates the total number of change-points and all detected change-points are within distance of the true ones.

  • tvHawkes processes with a single change-point A change-point is introduced to the Hawkes process parameters in (6) at where . Parameters and change at as , and . We choose , , , and .

    Considering this model allows us to assess the detection performance of our proposed methodology when a change-point occurs at the end of a time series which is of interest when data update continuously (on-line).

    Both methods performed well by identifying the single change-point within a small distance from the real change-point, and it is worth mentioning that their performance was almost identical. This is an indicative example where selecting EBS over BS is safe even though BS is also expected to be optimal in detecting a single change-point.

  • tvHawkes processes with two change-points This is a similar model to M1 with the difference that, in addition to the first change-point that occurs at , a second change-point occurs at . Further, only the parameters changes. Hence, we choose , , , and .

    The motivation of examining this non-stationary model derives from the seasonality observed in irregularly spaced time series. A typical example is the intraday trading activity of a stock where the majority of trades occurs either in the open or the close sessions.

    In this scenario, EBS performed better mainly resulted by the better accuracy in detecting the two change-points. In particular, EBS achieved a hit ratio of 0.93 which was resulted by identifying the two change-points within %1 in %90 cases of the cases. This is a good example where EBS is able to improve accuracy without the risk of oversegmenting a series.

  • tvHawkes processes with three change-points only in the parameter Three change-points are introduced to the Hawkes process parameters in (6) at , and , respectively where . The parameters and remain constant and equal to 2 and 0.7 respectively, while change at as . We set and .

    In this scenario, we aim to challenge the detection methodologies by considering changes only in one parameter. Both BS and EBS performed well (EBS did marginally better) detecting the three change-points accurately in many instances.

  • tvHawkes processes with four change-points only in the parameter Four change-points are introduced to the Hawkes process parameters in (6) at , , and , respectively where . The parameters and remain constant and equal to 0.2 and 0.5 respectively, while change at as