Scalable Multiple Changepoint Detection for Functional Data Sequences

08/05/2020 ∙ by Trevor Harris, et al. ∙ 0

We propose the Multiple Changepoint Isolation (MCI) method for detecting multiple changes in the mean and covariance of a functional process. We first introduce a pair of projections to represent the high and low frequency features of the data. We then apply total variation denoising and introduce a new regionalization procedure to split the projections into multiple regions. Denoising and regionalizing act to isolate each changepoint into its own region, so that the classical univariate CUSUM statistic can be applied region-wise to find all changepoints. Simulations show that our method accurately detects the number and locations of changepoints under many different scenarios. These include light and heavy tailed data, data with symmetric and skewed distributions, sparsely and densely sampled changepoints, and both mean and covariance changes. We show that our method outperforms a recent multiple functional changepoint detector and several univariate changepoint detectors applied to our proposed projections. We also show that the MCI is more robust than existing approaches, and scales linearly with sample size. Finally, we demonstrate our method on a large time series of water vapor mixing ratio profiles from atmospheric emitted radiance interferometer measurements.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The statistical analysis of functional time series has become increasingly important to many scientific fields including Climatology (Shang and Hyndman, 2011), Finance (Kokoszka and Zhang, 2012), Geophysics (Hörmann and Kokoszka, 2012), Demography (Hyndman and Booth, 2008), and Manufacturing (Woodall, 2007). A functional time series is a sequence of infinite dimensional objects, such as curves and surfaces, observed over time. A functional time series is analogous to finite dimensional univariate or multivariate time series, except that entire functions are observed at each time point. Just as in univariate and multivariate time series data, a functional time series may also experience abrupt changes in its generating process. These abrupt changes, or changepoints, can complicate statistical analysis by invalidating stationarity assumptions. They can also be interesting in their own right because they reveal unexpected heterogeneous patterns. Identifying changepoints in functional time series has, therefore, become an important issue and has started to gain increasing interest due to the surge in available functional data.

Within the functional data analysis (FDA) literature, changepoint detection has largely focused on the At Most One Change (AMOC) problem. In Berkes et al. (2009) a Cumulative Sum (CUSUM) test was proposed for independent functional data, which was further studied in Aue et al. (2009), where its asymptotic properties were developed. This test was then extended to weakly dependent functional data by Hörmann et al. (2010) and epidemic changes by Aston and Kirch (2012). Zhang et al. (2011) introduced a test for changes in the mean of weakly dependent functional data using self-normalization to alleviate the use of asymptotic control. Later, Sharipov et al. (2016) similarly developed a sequential block bootstrap procedure for these methods. More recently, Gromenko et al. (2017) considered changes in spatially correlated functional data, and Aue et al. (2018) proposed a fully functional method for finding a change in the mean without losing information due to dimension reduction.

Detecting multiple changepoints in a functional time series has received relatively scant attention compared to the AMOC problem. Recently, Li and Ghosal (2018) proposed a Bayesian method for directly identifying multiple changepoints in the mean, by transforming the functional data into wavelets and identifying changes in the wavelet coefficient processes. In Chiou et al. (2019), a dynamic segmentation method for finding multiple changepoints in the mean was proposed, which used dynamic programming and backward elimination to find an optimal set of changepoints. Alternatively, multiple changepoints have also be identified by adapting AMOC methods with a binary segmentation algorithm to recursively partition the functional time series (Berkes et al., 2009; Aue et al., 2018). The consistency of binary segmentation approaches was recently shown in Rice and Zhang (2019).

Despite the advances, the computational complexity of these multiple changepoints detection methods has been an obstacle for their wide application to large functional data. Bayesian methods that rely on Markov Chain Monte Carlo sampling are intrinsically burdensome because they typically require an enormous number of samples to reach convergence. Ordinary dynamic programming and binary segmentation algorithms scale quadratically and log-linearly, respectively, with the data’s sample size. As larger and larger functional time series data sets are curated, methods that scale linearly with sample size are called for to meet the computational demand. In the univariate and multivariate changepoint detection literature, optimal linear-time methods for multiple changepoint detection have already emerged. These include the Pruned Exact Linear Time (PELT) algorithm

(Killick et al., 2012), the Functional Pruning Optimal Partitioning (FPOP) algorithm (Maidstone et al., 2017), and the robust Functional Pruning Optimal Partitioning (r-FPOP) (Fearnhead and Rigaill, 2019).

Another limitation of the existing functional changepoint detection methods is that most methods can only detect changes in the functional process’s mean. While mean changes are the most conspicuous, covariance changes are equally important and may also occur. The need for detecting covariance changes has already been noticed and tackled in the univariate time series literature where methods targeting both mean and variance changes, such as PELT and FPOP, have been developed. However, methods that target both mean and covariance changes in a functional process are still not available. This motivates us to develop a functional changepoint approach that can take the covariance into account. Lastly, most previous methods were developed under the assumption that the data follows a Gaussian process. Their performance on non-Gaussian, skewed, or heavy tailed datam, which is often encountered in practice, is not well studied and could be suboptimal.

For example, remote sounding instruments such as Atmospheric Emitted Radiance Interferometers (AERI) are currently providing a wealth of functional time series data (see Figure 1), while also posing many challenges to existing changepoint detectors (Kulla and Ritter, 2019; Sakai et al., 2019). For instance, the AERI data in Figure 1

exhibit heavy tails and positive kurtosis due to being a ratio of densities. AERIs also monitor the atmosphere continuously, which leads to a long time series of functional data, often on the order of hundreds of thousands to millions of functions. Finding changepoints in the AERI data is a daunting task for existing detectors due to their non-Gaussianity and immense sample sizes. However, identifying changepoints is crucial for detecting atmospheric events such as air parcel mixing, rapid fluxes in aerosols, and evaporation or precipitation events. Therefore, it is necessary to develop changepoint detection methods that are robust to violations of Normality and that scale to millions of observations.

Figure 1: AERI water vapor mixing ratio profiles from August 9th to September 6th 2008. The -axis is time, the -axis is altitude, and the color values indicate the density of water vapor at each altitude and time. Each vertical bar is an entire function.

We propose the Multiple Changepoint Isolation (MCI) method to detect multiple changepoints in a functional time series’s mean and covariance function. Our method is based on a combination of two univariate projections of the functional data representing the high and low frequency features of the data, an adaptive linear time regionalization procedure to isolate changepoints into regions, and the univariate CUSUM statistic for AMOC detection within each region. Our method remedies three outstanding issues in the functional changepoint detection literature by (1) having linear time computational complexity, (2) taking broader types of changepoints into account, and (3) being robust to heavier or asymmetric tails than Gaussian processes have. We show in Section 4 that our method is consistently accurate and powerful under an array of settings, and that our new detection procedure is better adapted to our projections than existing univariate changepoint detectors.

2 Model for Functional Data with changepoints

Let be a sequence of continuous functions in , the Hilbert space of square integrable functions with domain satisfying . The interval is assumed to be the domain without loss of generality, and we further assume that each function is observed on the same finite grid of points .

Suppose that the observations follow the process

(1)

where refers to a functional process with mean function and covariance function . We assume that is piecewise weakly stationary, that is and for all except when is a changepoint. To allow for changepoints, we assume there are and time points where and respectively. and are both assumed to be unknown non-negative integers.

Let with and with denote the times where and “break” or “change”, i.e., where and . For simplicity and ease of notation, we union the two sets of changepoints into a combined sequence , where . is potentially smaller than because there can be overlap between and

. The goal of our method is to estimate

all changepoint locations from the sequence .

3 Multiple Changepoint Isolation

Given a functional sequence , we propose the Multiple Changepoint Isolation (MCI) method to identify all changepoints in model (1). Our procedure consists of four major steps.

  1. Project the functional data onto the real line to reduce the functional data into finite dimensions. Two projections are proposed to obtain two separate univariate time series (Section 3.1).

  2. Use total variation denoising (TVD) (Rudin et al., 1992) to identify a set of potential changepoint locations in each projection (Section 3.2).

  3. Split each projection into multiple regions based on the locations found in step two and a new “changeset regionalization” procedure. This procedure isolates each changepoint to its own region, which reduces the multiple changepoint problem into a sequence of single changepoint problems, one for each region (Section 3.3).

  4. Apply the univariate CUSUM statistic (Page, 1954) to each region in each projection to find all changepoints in the functional sequences. We account for multiple testing with the Benjamini-Hochberg false discovery rate adjustment (Benjamini and Hochberg, 1995) (Section 3.4).

3.1 Univariate projections

The first step is to project the functional data onto the real line. Projection is a classic technique of functional data analysis because it reduces the infinite dimensions of functional data to a finite dimensional space. In doing so, projections can often act as feature extractors that reveal major patterns and trends in the data.

The predominant technique for projecting functional data is the Karhunen-Loève (KL) expansion (Ramsay, 2004). The KL expansion states that any can be represented by a linear combination of orthogonal basis functions:

(2)

where , is the mean function of , and

are the orthogonal eigenfunctions of the covariance operator. In practice, this representation is truncated after

basis functions to obtain a finite approximation. Because all functions share the same orthogonal basis, we can represent each function with just the set of coefficients , also known as the functional principal components (FPC). This could allow us to reduce the problem of finding changepoints in the functional sequence to finding changepoints in the multivariate sequence . Some variations on FPC can be found in Tucker et al. (2013) and Lee and Jung (2016), for when the data are not aligned.

The distinguishing feature of FPC analysis is that eigenfunctions are sorted by order of descending eigenvalues. Consequently,

represents the primary direction of variation, represents the next primary direction of variation orthogonal to , and so on. Generally, only a small number () of eigenfunctions are required to represent the majority of the variation in the data. However, selecting can require a potentially expensive cross-validation step. For changepoint detection, we argue that only the first FPC will be necessary and that the information in the trailing FPCs can be expressed through a single alternative projection.

The first FPC represents the greatest variation in a functional time series, so all trailing FPCs beyond the first only capture smaller scale variation. The trailing FPCs will, therefore, naturally have a lower signal-to-noise ratio for identifying changepoints, which leads to two adverse effects: increased estimation error and reduced test power. Estimation error increases because changepoints will often manifest in multiple FPCs simultaneously, meaning changepoints may get repeatedly estimated on progressively noisier data. This leads to many spurious changepoint estimates near the true ones. Test power decreases because more FPCs mean more tests will be performed, so the necessary false discovery rate correction will have to control for a higher number of tests. This directly leads to a power loss for each individual test.

For these reasons, we choose to employ only the first FPC in the projection. However, this does not mean we simply ignore the information in the trailing FPCs. Instead, we employ an alternative projection that can more powerfully capture the smaller scale variation. To do this, we propose using the arc length (Rudin and others, 1964) of each function as a projection. The arc length of a differentiable function is defined as

(3)

where is the differential operator. On a real valued differentiable function, the arc length is equivalent to the Total Variation norm of the function, which measures the entire spectrum of variability in the function. The arc length can be highly sensitive to changes in the covariance, or “high frequency” features, of the data because the differencing operator tends to exaggerate high frequency features. The arc length can, therefore, act as a compressed view of the higher order KL coefficients. In contrast, the first FPC is sensitive to low frequency features because it only captures the dominant mode of variation in the data.

The arc length, as a total variation measure, has frequently appeared in the signal processing and numerical methods literature as part of the total variation denoising (Rudin et al., 1992) and total variation diminishing procedures (Versteeg and Malalasekera, 2007). To our knowledge, our work is the first use of arc length in an FDA context for measuring the high frequency variability in curves.

We summarize our dimension reduction scheme as follows. First, project onto its first KL coefficient, which we denote by the sequence . Next, compute the arc length of each in and denote this sequence as . The two univariate sequences, and , respectively represent the low and high frequency features of the sequence. These projections reduce the functional changepoint problem to finding and synthesizing changepoints in two univariate sequences.

3.2 Total variation based segmentation

The second step is to primitively segment the projected time series into regions of at most one change (AMOC) using total variation denoising (TVD). TVD has been frequently used (Rojas and Wahlberg, 2014; Chan et al., 2014; Hyun et al., 2016; Lin et al., 2016)

for detecting changepoints in sequences of univariate random variables. This is because TVD estimates a piecewise constant fit with the estimated jumps often occurring at or near the true changepoints.

Let generically denote a univariate projection of the functional sequence , i.e., could be or . We assume the following signal-plus-noise model for :

(4)

where is an vector with entries , is a piecewise constant vector with entries , and the error term

has a finite second moment. Because

is piecewise constant, the changepoints in are reflected in the jumps in . We use TVD to estimate as follows:

(5)
(6)

where is the Euclidean norm, is the Total Variation norm, and is a tuning parameter that controls the degree of regularization. We adapt a Bayesian Information Criteria (BIC) (Yao, 1988) minimization approach to choose an appropriate , see the algorithm in Alg. 3. The estimate will necessarily be a piecewise constant vector, because the penalty term induces sparsity on the differences of (Tibshirani et al., 2005). This gives rise to TVD’s alternative formulation as the fused lasso, a special case of the generalized lasso (Tibshirani et al., 2011).

The TVD estimator is minimax rate optimal when the true signal has bounded total variation (Mammen et al., 1997; Donoho et al., 1998). In the one-dimensional case considered here, the TVD estimator is also optimally adaptive to the piecewise constant signal (Guntuboyina et al., 2017). Optimal adaptivity means that if the true signal is piecewise constant with steps, then the estimator will approach the oracle estimator, which a priori knows the true locations of the jumps, at a rate (Lin et al., 2016; Dalalyan et al., 2017). TVD estimation is also extremely computationally efficient since direct TVD solvers with linear time complexity exist (Condat, 2013). Together, these results indicate that the TVD estimator is nearly optimal for piecewise signal recovery both theoretically and computationally and is, therefore, appealing for prescreening changepoints.

We say TVD is an optimal changepoint pre-screener, but not an optimal changepoint detector, due to its suboptimal changepoint detection behavior noted by Fryzlewicz and others (2014). The two main issues with TVD are that it generally overestimates the number of changepoints and that it lacks exact support recovery. This leads to TVD only being -sign consistent (Rojas and Wahlberg, 2014); that is, the true break locations will only be contained within an radius of the estimated breaks. We mitigate the first issue in Section 3.3 by introducing a refinement procedure, as was done in Chan et al. (2014) and Hyun et al. (2016), to group together “nearby” changepoints. Non-exact support recovery is corrected in Section 3.4 by applying CUSUM to the regions surrounding the grouped changepoints.

3.3 Changeset regionalization

Instead of using TVD directly to estimate changepoint locations, we use TVD jumps as a guide for finding regions of the time series that are likely to contain AMOC. Once these regions are specified, the standard CUSUM statistic can be applied within each region to test which regions contain changepoints and estimate their locations. This approach is similar to the method in Lin et al. (2016), which uses TVD to prescreen for potential changepoints, and then refines the screened changepoints with a post-processing procedure. The difference between our method and theirs is that we introduce a new refinement procedure for the TVD estimate, called “changeset regionalization”, and we base our post-processing procedure on the CUSUM statistic rather than filtering and thresholding. The latter frees us from resorting to permutations to estimate the threshold, allowing us to enjoy a significant computational advantage.

Recall denotes a univariate process and let denote its TVD estimate as in (5). We define the set as the set of jumps in . The set is sub-optimal for partitioning into regions because TVD tends to split single large breaks into multiple, tightly grouped, smaller breaks (Rojas and Wahlberg, 2014). To remedy this, we introduce the set .

Definition 3.1.

Let and let denote the power set of . We define the set as the set meeting the following two conditions:

  1. for any , where denotes the Hausdorff distance.

  2. for all meeting condition 1. Here denotes the cardinality of the set.

Given a value , each element of the set is constructed by merging nearby changepoints into one set of changepoints, called “changeset”. This procedure is specified in Algorithm 2 as well as illustrated in Figure 2. The value can either be specified directly or estimated from the data as part of Algorithm 3. We take the latter approach in this paper. Each changeset in is assumed to have AMOC and to be in the correct temporal order without loss of generality. After that, we define regions based on the changesets. If eventually changesets are formed, we will accordingly define regions along the univariate process with each region containing only a single changeset. This indicates that each region contains AMOC, i.e. each region isolates a changepoint away from the others. The region within , , is defined as the following interval:

(7)

where , , , and is the cardinality of . Each interval is the largest possible interval within that contains but no other elements of . Figure 2 diagrams the process of changeset regionalization.

Figure 2: Panel A: The polished TVD fit (red line) over a time series (grey), with breaks in the TVD demarcated with black dashes along the bottom axis. Each black dash is an element of . Panel B: Black dashes (changepoints) have been grouped together into three “changesets” according to Definition 3.1, which leads to three distinct regions under (7). Changesets (and regional bounding boxes) are color coordinated and each set of same colored changepoints is one element of . Note that regions do not contain neighboring changesets, which is why we say the regions “isolate” the changepoints. Panel C: Close up view of each of the three regions. The standard CUSUM will be applied to regions 1, 2, and 3 to find all changepoints.

Our definition of naturally has overlapping adjacent regions, as can be seen in Figure 2. We find that overlapping regions, as opposed to nonoverlapping partitions, are quite beneficial for changepoint analysis. The overlap gives each region extra data to improve the power and accuracy of its CUSUM estimate. This is particularly helpful when changepoints are close to one another.

3.4 Changepoint detection

The final step of our procedure is to detect changepoints contained within the regions obtained in Sections 3.3. A suite of classical and well studied methods for estimating AMOC is based on the CUSUM statistic (Page, 1954). The basic CUSUM statistic can be used to first evaluate the existence of a single change in the mean of a univariate process through the following hypothesis test:

where is unknown. A CUSUM process is defined as

where . The CUSUM statistic is defined as the supremum of the scaled CUSUM process

(8)

where is a consistent estimator of the long run variance . We approximate with , where is the TVD estimate of . Under ,

where is a Brownian motion so is a Brownian bridge on . Critical values for can be computed using the well known Kolmogorov Distribution (Shao and Zhang, 2010):

(9)

If the test rejects , then the estimated changepoint location is .

We apply CUSUM independently to each of the regions found via the changeset regionalization procedure described in Section 3.3. Each region, therefore, has an associated p-value and a potential estimated changepoint location. Due to the issue of multiple testing, we adjust the p-values using the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995) to control the CUSUM’s False Discovery Rate (FDR) at a pre-specified level. We retain all estimated changepoints with an adjusted p-value below the pre-specified level.

We summarize all four steps in this section in the following algorithm (Alg. 1).

Input : Functions ,…,, significance level
Output : changepoint locations
1 for each projection do
2       Project ,…, onto the real line (Sec. 3.1) Estimate free parameters and (Alg. 3) Coarsely segment projected data with TVD with (Eqn. 5) Combine segments into regions using changeset regionalization with (Alg. 2) Compute CUSUM statistic on each segment to get potential changepoints (Def. 8) Adjust p-values with an FDR correction. Retain changepoints with p-values less than
3 end for
Concatenate and sort all projection based changepoints together Average together any changepoints in the concatenated set within of each other Return all remaining changepoints
Algorithm 1 Multiple Changepoint Isolation

The parameter (line 4) is chosen to scale with , as was done in Lin et al. (2016) (See Remark 5), to ensure the TVD is well estimated while still filtering out a majority of the extraneous changepoints. Rojas and Wahlberg (2014) derived that (line 5) depends on sample size as . However, directly using made the numerical estimation challenging due to the small magnitudes of . Instead, we make depend on and find that this works well for all of the scenarios we considered. (line 10) was found to work well for reducing the number of redundant changepoints between the arclength and FPC projections in all scenarios.

3.5 Computational complexity

Our method’s computational complexity grows linearly with sample size because each sub-step of the method grows linearly in time, and the number of sub-steps does not grow with sample size. For the FPC decomposition, we use the FACE algorithm (Xiao et al., 2016), which is linear in sample size and function length. Computing the arclength of each function requires only a single pass over the data, so the arclength projection is linear as well. For TVD estimation, we use the Condat algorithm (Condat, 2013), which is linear in time. Refining the segmentation over the data with changeset regionalization can be done with a single iteration using algorithm 2. Finally, the CUSUM is computable in linear time, and the overlapping regions only cause CUSUM to be run (nearly) twice over the data set. The optimization procedure for and uses a grid search over a fixed sequence of and values, ensuring that the optimization step is also linear in sample size.

4 Simulations

We investigate MCI’s empirical performance in detecting changes in the mean and covariance of a functional process. We consider the case when there are no changepoints, when there are only a few changepoints (“sparse” setting), and when there are many changepoints (“dense” setting). We also consider the data with light and heavy tails, and with symmetric and skewed distributions.

4.1 Simulation setup

We start by simulating symmetric functional data. We use a Gaussian process (GP) model and a -process (TP) model to simulate symmetric light-tailed and heavy-tailed functional data respectively. Let denote the symmetric process. We have

where is the mean and follows a zero-mean GP or TP with Matérn covariance function (Stein, 2012),

with variance parameter , range parameter , smoothness parameter , and as a modified Bessel function of the 2nd kind of order . If

follows a TP, then an additional degrees of freedom parameter

is required; we set in all simulations. We set the smoothness parameter to ensure mean square differentiability of the sample paths so that the arc length projections are meaningful.

To generate a skewed functional process on the domain , we let

We call the transformed Gaussian process and transformed -process a log-sum Gaussian process (LS-GP) and a log-sum -process (LS-TP) respectively. All results in Section 4.3 are based on LS-GP and LS-TP simulated data. Results under a GP or TP are deferred to Section B.1 in the appendix. All results in Section 4.3 are based on LS-GP and LS-TP simulated data, because many real datasets, including the data we analyze here, are skewed. Results under a GP or TP show similar patterns as the LS-GP and LS-TP simulations, although results improved overall because skewed distributions are more challenging for changepoints detectors. We defer the GP and TP results to Section B.1 in the appendix.

We vary the mean , the variance , and the range in the simulation over the values listed in Table 1. The parameter is not varied because it acts on the process in nearly the same way as . We specify possible mean functions for the LS-GP or LS-TP following the simulations in Chiou et al. (2019).

Differences between functions, as measured by the norm, yield different scales of change between the mean functions. The change from to is the largest, to is moderate, and to and to are both small. Additionally, these functions represent changes in both magnitude and the shape of the functional process, as shown in Figure 3.

Figure 3: Five possible latent mean functions for the LS-GP and LS-TP. Differences between functions yield different magnitudes of changepoints, i.e. to is a large easily detectable change while to is much smaller and harder to detect.

To generate randomly spaced changepoints in either , , or of an iid LS-GP (or LS-TP) sequence, we first randomly sample parameter values and segment lengths (i.e. segment sample sizes). We generically denote the sequence of parameter values as and the sequence of segment lengths as . Only one parameter is varied at a time, while the others are fixed at prespecified values. To generate the functional time series, we sample LS-GP’s (or LS-TPs) with parameter , then LS-GP’s (or LS-TPs) with parameter , and so on for the segments. Each segment is concatenated together so that the boundaries between segments represent changepoints in the functional time series. Parameter values are sampled from Table 1, and sampling is done so that consecutive values will are not the same. Segment lengths are samples from either a or for the “dense” or “sparse” setting, respectively.

Parameter Possible Values
, , , ,
0.50, 0.66, 0.83, 1.00, 1.16, 1.33, 1.50, 1.66, 1.83, 2.00
0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0
1.0
Table 1: Possible values for the LS-GP (LS-TP) parameters. indicates the mean function, while , r, and are scalar valued parameters in the Matérn covariance function.

4.2 Assessment criterion

We ran 500 simulations for each parameter setting and calculated the Annotation error (Truong et al., 2019) and the Energy distance (Székely, 2003) between the true changepoints and the estimated changepoints, which we call the “Energy error”. The Annotation error, widely used for assessing changepoints detection, measures the difference in the number of detected changepoints and the number of true changepoints.

Definition 4.1.

Let and be two sets, then the Annotation distance between and is

(10)

A low Annotation error means that the algorithm consistently estimates the number of changepoints correctly.

The Energy error is non-standard compared to the usual Hausdorff metric for changepoint comparison (Truong et al., 2019), but we find that it offers a more comprehensive depiction of the differences between the real and estimated changepoints when there are multiple changepoints. This is because the Hausdorff metric tends to overly weight single changepoint errors rather than assess the overall performance. We provide details on the difference between the Hausdorff metric and the Energy distance in the appendix B.2. We also provide simulation results under the Hausdorff metric in the Supplement. The Energy distance between two sets is defined as follows.

Definition 4.2.

Let and be two sets, then the Energy distance between and is

(11)

A low Energy distance between the estimated and actual changepoints, i.e., a low Energy error, means that the estimated changepoints are very similar to the true changepoints. In all simulation settings we compare the different detectors on their ability to achieve low Annotation and low Energy errors.

Because the DSBE method proposed by Chiou et al. (2019) is the only multiple changepoint detection method that is computationally comparable to MCI, we compare our method to DSBE. Additionally, we include comparisons against three univariate, linear time changepoint detectors applied to the two projections: the first functional principal component and the arc length. This will allow us to learn the advantages of our detection method on top of the projections. The three methods are the PELT algorithm Killick et al. (2012), the r-FPOP algorithm (Fearnhead and Rigaill, 2019), and the WBS procedure (Fryzlewicz and others, 2014). For PELT, we used the cpt.meanvar() function in the changepoint package with default settings and method = “PELT”. For r-FPOP, we used the Rob_seg.std() function from the robseg package with loss and tuning parameter . The authors recommended to scale with the log of the sample size and the multiplier 5 was found to have the overall best results. For WBS, we used the wbs() function, from the wbs package, with default settings and changepoints found via ssic.penalty minimization. For DSBE, we used the author’s provided code with number of changepoint candidates , minimum segment length , and significance threshold . DSBE is only applied to the mean change simulations because DSBE was designed to only detect mean changes.

4.3 Assessment results

4.3.1 No changepoints

We first consider the situation where there are no changepoints, i.e. . Functional observations are generated from either a LS-GP or LS-TP with constant mean function and Matèrn covariance with , , and . Figure 4 shows distribution of Annotation errors under the LS-GP and LS-TP for each method.

Figure 4: Annotation error when no changepoints are present in the data. MCI and DBSE nearly always detect 0 changepoints in both the light tail (LS-GP) and the heavy tail (LS-TP) setting. PELT and WBS detect an average of 55 and 25 changepoints, respectively, when no changepoints exist and the functional process is heavy tailed. FPOP performs better than PELT and WBS, but still detects around six false changepoints on average. Error values are plotted on the scale.

All methods have an almost uniformly zero Annotation error under the LS-GP, meaning that all methods have an essentially zero false positive rate. However, under the LS-TP, only MCI and DSBE maintain their almost uniformly near-zero error rates, while the Annotation error for FPOP, PELT, and WBS all increases dramatically. PELT and WBS have particularly high Annotation error rates, indicating their sensitivity to larger random fluctuations caused by the heavy tailed generating process. FPOP is more robust than PELT and WBS, due to FPOP using the robust loss function, but it is still less robust than either DSBE or MCI.

4.3.2 Sparse changepoints

We next consider the situation when the changepoints are relatively far apart, i.e., the changepoints are sparse. We simulate data as described in Section 4.1 with changepoints and segment lengths sampled from . To study changepoints in , we fix the covariance parameters to , , and . For changepoints in variance, we fix , , and , and for changepoints in range we fix , , and .

Figure 5: Annotation and Energy error rates for sparse changes in the mean function. Under a log-sum Gaussian process (LS-GP), error rates are comparable between MCI, PELT, and WBS. Under a log-sum -process (LS-TP), only MCI is able to maintain a small error rate. Error values are plotted on the scale.

Figure 5 summarizes each method’s Annotation error and Energy error in detecting changes in the mean. Our MCI method maintains the overall lowest Annotation error and the overall lowest Energy error. Together, this shows that MCI is accurately estimating the number and location of the changepoints, whether the error process is light or heavy tailed. The univariate methods applied to our projections have reasonably low Annotation error rates on the LS-GP, but show extremely high Annotation error rates on the LS-TP due to overestimation, while their Energy error rates are high on both. DSBE sees the least deterioration from LS-GP to LS-TP compared to the alternative methods, although its Energy error rates were already very high under LS-GP.

Figure 6: Annotation and Energy error rates for sparse changes in the covariance parameters of a LS-GP and a LS-TP. MCI performs the best in all cases. Error values are plotted on the scale.

We then investigated changes in the variance and the range of Matérn covariance function for the LS-GP and LS-TP. The results are summarized in Figure 6. DSBE was not included in these simulations because it was not designed to detect changes in covariance. MCI again achieves the lowest overall error rates among all methods and for both parameters. Even with the heavy tailed LS-TP data, MCI maintains a remarkable nearly zero Annotation error. Although the MCI’s Energy error increases with the LS-TP data, it is still several times smaller than that of the other methods in detecting range changes. Variance changes were difficult for all methods across the board.

4.3.3 Dense changepoints

Finally, we consider the situation when the changepoints are relatively close to each other, i.e., when changepoints are dense. To simulate data with dense changepoints, we set and sample segment lengths from . This design results in changepoints that are, on average, ten times as dense as the sparse changepoints in Section 4.3.2. The parameter setting for studying changes in the mean, variance, and range are the same as in Section 4.3.2.

Figure 7: Annotation and Energy error rates for dense changepoints. Under both the LS-GP and the LS-TP, MCI attains the lowest error rates. Error values are plotted on the scale.

We summarize each method’s ability to detect dense mean changes in Figure 7. MCI again attains the lowest overall Annotation and Energy error rates. However, we find that the gap between MCI and other methods is smaller than when the changepoints were sparse. This is because, on the one hand, MCI’s error rates increase due to the CUSUM having less data per segment to estimate the location of changepoints. On the other hand, the competing methods see a decrease in their error rates for two reasons. One is that PELT and FPOP tend to estimate many changepoints, so when the true number of changepoints is also high, their Annotation and Energy errors will naturally drop. The other reason is that WBS uses randomly sized intervals and a binary segmentation algorithm to find changepoints, both of which make the method less sensitive to changepoints density.

Figure 8: Annotation and Energy error rates for dense changepoints in the covariance parameters. MCI performs the best for detecting variance and range changes in the LS-GP and for detecting range changes in the LS-TP. No method dominates for detecting variance changes in the LS-TP. Error values are plotted on the scale.

We then evaluate the changepoint detection in covariance parameters, and the results are summarized in Figure 8. MCI has the lowest average error rates across all simulations. Again, MCI’s advantage over other methods for dense changepoints is less striking than for sparse changepoints for the same reasons mentioned above. The pattern of the relative performance of different methods is similar to that for mean changes in Figure 7. Compared to detecting changes in the mean, the detection of changes in the covariance parameter seems more challenging, as evidenced by larger Energy error rates. Variance changes under a TP were again challenging across the board, with no method clearly dominating the others.

In sum, Figures 5, 6, 7, and 8 serve to show that MCI is generally more accurate and more skillful for a broader type of changepoints than the other approaches. This is because MCI effectively combines two essential steps with the first prescreening extraneous changepoints through TVD and changeset regionalization and the second implementing the powerful CUSUM statistic on the filtered candidates. Under the light tailed LS-GP data, MCI has the lowest Annotation and Energy error rates across all changepoint types and densities. In contrast, the three other algorithms’ performance varies greatly depending on the type and density of the changepoints. Furthermore, MCI is more robust to the distribution of the functional data. Although with LS-TP data, the error rate of MCI increases across all changepoint detection compared to LS-GP data, the other methods show erratic and greatly deteriorated skill for heavy-tailed data.

4.4 Linear computation

To demonstrate MCI’s linear computation time, we conduct a simulation study to empirically assess the run time growth of MCI over an increasing sequence of sample sizes. We take the simulation with mean zero GP and covariance parameters , , and as example. We consider sample sizes from to in unit increments, and run simulation 1000 times for each sample size. We show the run time of MCI under each of the seven sample sizes in Figure 9. The run time of MCI exhibits a linear trend with the sample size.

Figure 9: Median run time of MCI (black points) under seven sample size settings (10000-70000 observations). Trend line (red) shows the claimed linear relationship between sample size and computational time. Simulations were performed on single core of a 2017 MacBook Pro with a 2.33GHz i5 processor.

5 Application to Profiles of Water Vapor

We now revisit the motivating data example illustrated in the Introduction and apply our MCI method to this data.

5.1 Data

The water vapor mixing ratio is the water vapor density over the dry air density in a given atmospheric unit. It is an important variable in Meteorology for distinguishing individual air masses, monitoring the effects of soil evapotranspiration and large water body evaporation (North et al., 2014), and for the early detection of heavy precipitation events (Sakai et al., 2019). Changepoint detection is useful for identifying sudden changes in an air parcel’s water vapor content due to precipitation events and air parcels mixing. Retrospectively identifying sudden changes in the water vapor profiles’ structure is often a necessary first step before constructing statistical models for identifying precipitation events.

We apply our functional changepoint detector to the water vapor mixing ratio profiles collected from the Atmospheric Emitted Radiance Interferometer (AERI) instrument at the Lamont, Oklahoma Facility. The AERI instruments are maintained by the U.S. Department of Energy’s (DOE) Atmospheric Radiation (ARM) Program to collect high-resolution atmospheric profile data (Stokes and Schwartz, 1994). In this dataset, each profile consists of 58 measurements of the water vapor mixing ratio along a single atmospheric column from 0 to 44,000 meters above ground level in Lamont, Oklahoma. Whole profiles were collected every 8 minutes, thus providing near-continuous monitoring of atmospheric conditions. We removed the top 18 altitude points representing the 11,000 to 44,000-meter range, due to the extremely high rate of measurement errors in this range. Therefore, we only consider the 40 measurements from 0 to 10,000 meters.

Figure 10: Marginal distributions at two altitude bands in the AERI profile data (450 meters and 1800 meters). Non Gaussian behavior such as skewness and heavy tails are observed.

For our analysis, we consider the entire time series of water vapor profiles from January 4th, 2007 to March 10th, 2014. This period corresponds to 234,062 profiles, each sampled at the same 40 altitudes. To illustrate the data, we plotted profiles water vapor profiles from August 9th through September 6th in 2008 in Figure 1. Each vertical line represents an individual profile, with colors indicating the value of the profile at each altitude. Abrupt increases and decreases in water vapor along time are visible, indicating rapid changes from high density to low density and vice versa. The changes could be caused by sudden precipitation events and air mass mixing.

In Figure 10

, we show the marginal distribution at two altitude bands (450 and 1850-meters) across all profiles. For this plot, we standardized each profile by removing the profile mean and dividing by the profile standard deviation so that mean and variance changes would not obscure the marginal distribution. Even after standardization, the marginals show non-Gaussian behavior, including heavy tails and kurtosis.

5.2 Identification of changepoints

Figure 11: Four examples of the changepoints detected by MCI in the first FPC of the AERI profiles. The black curve is the projected values, the red vertical lines marks the estimated changepoints, and the red horizontal lines indicate the segment means between changepoints
Figure 12: Four examples of the changepoints detected by MCI in the arclength of the AERI profiles. The black curve is the projected values, the red vertical lines marks the estimated changepoints, and the red horizontal lines indicate the segment means between changepoints.

We applied our MCI method to the water vapor mixing ratio profiles and found 210 changepoints. Figure 11 shows four examples of the changepoints identified in the first functional principal component, and Figure 12 shows four examples of changepoints identified in the arclength projection.

A number of commonalities between the two plots are apparent. The first is that panel A (top left) and panel D (bottom right) are highly similar in appearance, and the changepoints identified in these regions are similar between the two projections. This happens because the first FPC and the arclengths are not necessarily orthogonal to each other, and changepoints may manifest in both low and high frequency spectrum. Another common feature is that MCI is robust to independence violations and heavy tails in both the arclength and the first FPC. This can be seen in the right half of panel 3 (bottom left) in both Figures 11 and 12, where both time series exhibit an autoregressive structure, yet MCI does not detect extra changepoints due to the spurious mean changes induced by the correlations. Heavy tailed behavior can be observed in Panels A, B, and D, where large “spikes” in the time series can be observed. MCI does not detect these anomalies as changepoints in either projection.

Figures 11 and 12 also show that there are many differences between the two projections, meaning that the first FPC and the arc length measure very different aspects of the data. Panels B and C are almost completely different in the two plots, including the locations of the detected changepoints. The MCI was, therefore, able to pick up on a broader range of changepoints than the first FPC would allow, because of the arclength projection.

6 Discussion

We propose the Multiple Changepoint Isolation method for detecting multiple changepoints of a functional time series. The changes can be either in the mean or in the dependence structure of the functional data. Our method first introduces a pair of projections to extract the functional process’s high frequency and low frequency features. We then use TVD and a new changeset regionalization procedure to identify regions of the time series that likely contain at most one changepoint. CUSUM is then applied to each region individually to identify the potential changepoint. Our extensive simulations show that the MCI method is accurate, computationally linear time, and robust to data distributions. We finally demonstrate MCI on water vapor mixing ratios over time.

Using the leading FPC coefficient and the arc length as the two projections effectively preserves the major low frequency and high frequency variability in the functional data sequence. Our method is, therefore, able to detect a broad range of changepoints. In contrast, entirely functional metric based approaches are, in general, only powerful against changes in the mean (Aue et al., 2018). Our effective data reduction compared to using several FPCs also reduces the number of tests in the multiple testing context and thus helps to retain high power in the changepoint detection. More projections will lead to more testings and less power.

Our changepoint detection differs from the existing TVD (or fused lasso) based approaches in two significant ways. First, we used TVD only as a screening method to eliminate locations that are very unlikely to be changepoints. Second, we proposed a changeset regionalization procedure to further filter out extraneous changepoints and identify proper regions to apply the CUSUM statistic. This strategy is more efficient and precise than directly using TVD for changepoint detection since TVD is shown only to be -consistent (Rojas and Wahlberg, 2014).

In future work, we would like to consider the theoretical underpinnings of the MCI method more rigorously. For instance, it remains to be shown whether the MCI is a consistent estimator of the changepoints or whether the MCI is asymptotically powerful though our simulations seem to imply those properties. We could further study the MCI’s robustness and compare its current form with alternatives using robust changepoint statistics. The procedure to optimize the tuning parameters and (Alg 3

) is also heuristic and not guaranteed to find globally optimal parameters. A better algorithm, with optimality guarantees, may be possible. However, optimal parameters are not strictly necessary for optimal changepoint detection, so studying the trade-off between parameter optimization and changepoint detection may also be interesting.

References

  • J. A. Aston and C. Kirch (2012) Detecting and estimating changes in dependent functional data.

    Journal of Multivariate Analysis

    109, pp. 204–220.
    Cited by: §1.
  • A. Aue, R. Gabrys, L. Horváth, and P. Kokoszka (2009) Estimation of a change-point in the mean function of functional data. Journal of Multivariate Analysis 100 (10), pp. 2254–2269. Cited by: §1.
  • A. Aue, G. Rice, and O. Sönmez (2018) Detecting and dating structural breaks in functional data without dimension reduction. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80 (3), pp. 509–529. Cited by: §1, §1, §6.
  • Y. Benjamini and Y. Hochberg (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57 (1), pp. 289–300. Cited by: item 4, §3.4.
  • I. Berkes, R. Gabrys, L. Horváth, and P. Kokoszka (2009) Detecting changes in the mean of functional observations. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (5), pp. 927–946. Cited by: §1, §1.
  • N. H. Chan, C. Y. Yau, and R. Zhang (2014) Group lasso for structural break time series. Journal of the American Statistical Association 109 (506), pp. 590–599. Cited by: §3.2, §3.2.
  • J. Chiou, Y. Chen, T. Hsing, et al. (2019) Identifying multiple changes for a functional data sequence with application to freeway traffic segmentation. The Annals of Applied Statistics 13 (3), pp. 1430–1463. Cited by: §1, §4.1, §4.2.
  • L. Condat (2013) A direct algorithm for 1-d total variation denoising. IEEE Signal Processing Letters 20 (11), pp. 1054–1057. Cited by: §3.2, §3.5.
  • A. S. Dalalyan, M. Hebiri, J. Lederer, et al. (2017) On the prediction performance of the lasso. Bernoulli 23 (1), pp. 552–581. Cited by: §3.2.
  • D. L. Donoho, I. M. Johnstone, et al. (1998) Minimax estimation via wavelet shrinkage. The annals of Statistics 26 (3), pp. 879–921. Cited by: §3.2.
  • P. Fearnhead and G. Rigaill (2019)

    Changepoint detection in the presence of outliers

    .
    Journal of the American Statistical Association 114 (525), pp. 169–183. Cited by: §1, §4.2.
  • P. Fryzlewicz et al. (2014) Wild binary segmentation for multiple change-point detection. The Annals of Statistics 42 (6), pp. 2243–2281. Cited by: §3.2, §4.2.
  • O. Gromenko, P. Kokoszka, and M. Reimherr (2017) Detection of change in the spatiotemporal mean function. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 (1), pp. 29–50. Cited by: §1.
  • A. Guntuboyina, D. Lieu, S. Chatterjee, and B. Sen (2017) Adaptive risk bounds in univariate total variation denoising and trend filtering. arXiv preprint arXiv:1702.05113. Cited by: §3.2.
  • S. Hörmann, P. Kokoszka, et al. (2010) Weakly dependent functional data. The Annals of Statistics 38 (3), pp. 1845–1884. Cited by: §1.
  • S. Hörmann and P. Kokoszka (2012) Functional time series. In Handbook of statistics, Vol. 30, pp. 157–186. Cited by: §1.
  • R. J. Hyndman and H. Booth (2008) Stochastic population forecasts using functional data models for mortality, fertility and migration. International Journal of Forecasting 24 (3), pp. 323–342. Cited by: §1.
  • S. Hyun, M. G’Sell, and R. J. Tibshirani (2016) Exact post-selection inference for changepoint detection and other generalized lasso problems. arXiv preprint arXiv:1606.03552. Cited by: §3.2, §3.2.
  • R. Killick, P. Fearnhead, and I. A. Eckley (2012) Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association 107 (500), pp. 1590–1598. Cited by: §1, §4.2.
  • P. Kokoszka and X. Zhang (2012) Functional prediction of intraday cumulative returns. Statistical Modelling 12 (4), pp. 377–398. Cited by: §1.
  • B. S. Kulla and C. Ritter (2019) Water vapor calibration: using a raman lidar and radiosoundings to obtain highly resolved water vapor profiles. Remote Sensing 11 (6), pp. 616. Cited by: §1.
  • S. Lee and S. Jung (2016) Combined analysis of amplitude and phase variations in functional data. arXiv preprint arXiv:1603.01775. Cited by: §3.1.
  • X. Li and S. Ghosal (2018) Bayesian change point detection for functional data. arXiv preprint arXiv:1808.01236. Cited by: §1.
  • K. Lin, J. Sharpnack, A. Rinaldo, and R. J. Tibshirani (2016) Approximate recovery in changepoint problems, from estimation error rates. arXiv preprint arXiv:1606.06746. Cited by: §3.2, §3.2, §3.3, §3.4.
  • R. Maidstone, T. Hocking, G. Rigaill, and P. Fearnhead (2017) On optimal multiple changepoint algorithms for large data. Statistics and Computing 27 (2), pp. 519–533. Cited by: §1.
  • E. Mammen, S. van de Geer, et al. (1997) Locally adaptive regression splines. The Annals of Statistics 25 (1), pp. 387–413. Cited by: §3.2.
  • G. R. North, J. A. Pyle, and F. Zhang (2014) Encyclopedia of atmospheric sciences. Vol. 1, Elsevier. Cited by: §5.1.
  • E. S. Page (1954) Continuous inspection schemes. Biometrika 41 (1/2), pp. 100–115. Cited by: item 4, §3.4.
  • J. O. Ramsay (2004) Functional data analysis. Encyclopedia of Statistical Sciences 4. Cited by: §3.1.
  • G. Rice and C. Zhang (2019) Consistency of binary segmentation for multiple change-points estimation with functional data. arXiv preprint arXiv:2001.00093. Cited by: §1.
  • C. R. Rojas and B. Wahlberg (2014) On change point detection using the fused lasso method. arXiv preprint arXiv:1401.5408. Cited by: §3.2, §3.2, §3.3, §3.4, §6.
  • L. I. Rudin, S. Osher, and E. Fatemi (1992) Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena 60 (1-4), pp. 259–268. Cited by: item 2, §3.1.
  • W. Rudin et al. (1964) Principles of mathematical analysis. Vol. 3, McGraw-hill New York. Cited by: §3.1.
  • T. Sakai, T. Nagai, T. Izumi, S. Yoshida, and Y. Shoji (2019) Automated compact mobile raman lidar for water vapor measurement: instrument description and validation by comparison with radiosonde, gnss, and high-resolution objective analysis.. Atmospheric Measurement Techniques 12 (1). Cited by: §1, §5.1.
  • H. L. Shang and R. J. Hyndman (2011) Nonparametric time series forecasting with dynamic updating. Mathematics and Computers in Simulation 81 (7), pp. 1310–1324. Cited by: §1.
  • X. Shao and X. Zhang (2010) Testing for change points in time series. Journal of the American Statistical Association 105 (491), pp. 1228–1240. Cited by: §3.4.
  • O. Sharipov, J. Tewes, and M. Wendler (2016) Sequential block bootstrap in a hilbert space with application to change point analysis. Canadian Journal of Statistics 44 (3), pp. 300–322. Cited by: §1.
  • M. L. Stein (2012) Interpolation of spatial data: some theory for kriging. Springer Science & Business Media. Cited by: §4.1.
  • G. M. Stokes and S. E. Schwartz (1994) The atmospheric radiation measurement (arm) program: programmatic background and design of the cloud and radiation test bed. Bulletin of the American Meteorological Society 75 (7), pp. 1201–1222. Cited by: §5.1.
  • G. J. Székely (2003) E-statistics: the energy of statistical samples. Bowling Green State University, Department of Mathematics and Statistics Technical Report 3 (05), pp. 1–18. Cited by: §4.2.
  • R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight (2005) Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (1), pp. 91–108. Cited by: §3.2.
  • R. J. Tibshirani, J. Taylor, et al. (2011) The solution path of the generalized lasso. The Annals of Statistics 39 (3), pp. 1335–1371. Cited by: §3.2.
  • C. Truong, L. Oudre, and N. Vayatis (2019) Selective review of offline change point detection methods. Signal Processing, pp. 107299. Cited by: §4.2, §4.2.
  • J. D. Tucker, W. Wu, and A. Srivastava (2013) Generative models for functional data using phase and amplitude separation. Computational Statistics & Data Analysis 61, pp. 50–66. Cited by: §3.1.
  • H. K. Versteeg and W. Malalasekera (2007) An introduction to computational fluid dynamics: the finite volume method. Pearson education. Cited by: §3.1.
  • W. H. Woodall (2007) Current research on profile monitoring. Production 17 (3), pp. 420–425. Cited by: §1.
  • L. Xiao, V. Zipunnikov, D. Ruppert, and C. Crainiceanu (2016) Fast covariance estimation for high-dimensional functional data. Statistics and computing 26 (1-2), pp. 409–421. Cited by: §3.5.
  • Y. Yao (1988)

    Estimating the number of change-points via schwarz’criterion

    .

    Statistics & Probability Letters

    6 (3), pp. 181–189.
    Cited by: §3.2.
  • X. Zhang, X. Shao, K. Hayhoe, D. J. Wuebbles, et al. (2011) Testing the structural stability of temporally dependent functional observations and application to climate projections. Electronic Journal of Statistics 5, pp. 1765–1796. Cited by: §1.

Appendix A Additional Algorithms

We now introduce several algorithms mentioned in the body of the manuscript. The first is the changeset regionalization Algorithm 2, which merges “nearby” changepoints into groups called changesets.

Input : TVD changepoints and link radius c.
Output : Change set
1 Initialize j = 1 Initialize for  to  do
2       if  then
3            
4      else
5             Initialize Increment
6       end if
7      
8 end for
Return
Algorithm 2 Changeset regionalization

Algorithm 2 starts with the singleton set containing the first changepoint, , detected by TVD. It then checks if the next changepoint, , is within of and then either adds adds to the set containing or starts a new set containing only . It then proceeds down the list of changepoints until every changepoint belongs to a set. Since the changepoints are assumed to be in order, this only requires a single pass over the list of changepoints.

The second algorithm, Alg. 3

, tunes the hyperparameters of TVD and Alg.

2. That is, we write the regularization parameter of TVD as and the linkage parameter of Alg. 2 as and try to find optimal values for and . Alg. 3 essentially performs coordinate descent on and to find the pair that minimize the BIC of a step wise function with jumps at the changepoints (using and ).

Input : Projections and significance level
Output : free parameters and
1 Initialize for  to  do
2       Segment with TVD with Regionalize TVD segmentation with Alg. 2 with Compute CUSUM on each region and adjust p-values with FDR correction Denote as the changepoints found via CUSUM with p-values fit step wise function with steps at compute and save BIC of the fitted function
3 end for
4let be the value with the minimal BIC for  to  do
5       Segment with TVD with Regionalize TVD segmentation with Alg. 2 with Compute CUSUM on each region and adjust p-values with FDR correction Denote as the changepoints found via CUSUM with p-values fit step wise function with steps at compute and save BIC of the fitted function
6 end for
let be the value with the minimal BIC Return
Algorithm 3 Optimize parameters

Appendix B Additional Simulations

We investigate MCI’s empirical performance in detecting changes in the mean and covariance of a symmetric functional process. We again consider the case when there are no changepoints, when there are only a few changepoints (“sparse” setting), and when there are many changepoints (“dense” setting). The exact same generation process is used for each case, as in the manuscript, except that the log sum transformation ( is omitted.

b.1 Assessment results

b.1.1 No changepoints

We first consider the situation where there are no changepoints, i.e. . Functional observations are generated from either a LS-GP or LS-TP with constant mean function and Matèrn covariance with , , and . Figure 13 shows the results of each process.

Figure 13: Annotation error when no changepoints are present in the data. MCI and DBSE nearly always detect 0 changepoints in both the light tail (GP) and the heavy tail (TP) setting. PELT and WBS detect an average of 20 and 22 changepoints, respectively, when no changepoints exist and the functional process is heavy tailed. FPOP performs better than PELT and WBS, but still detects around ten false changepoints on average.

All methods have an almost uniformly zero Annotation error under the Gaussian process, meaning that all methods have an essentially zero false positive rate. However, under a -process, only our method MCI and DSBE maintain their almost uniformly near-zero error rates, while the Annotation error for FPOP, PELT, and WBS all increases dramatically. PELT and WBS have particularly high Annotation error rates, indicating their sensitivity to larger random fluctuations caused by the heavy tailed generating process. FPOP is more robust than PELT and WBS, due to FPOP using the robust loss function, but it is still far less robust than either DSBE or MCI.

b.1.2 Sparse changepoints

We next consider the situation when the changepoints are relatively far apart, i.e., the changepoints are sparse. We simulate data as described in Section 4.1 with changepoints and segment lengths sampled from . To study changepoint in mean, we fix the covariance parameters to , , and . For changepoints in variance, we fix , , and , and for changepoints in range we fix , , and .

Figure 14: Annotation and Energy error rates for sparse changes the mean function. Under a Gaussian process (GP), error rates are comparable between MCI, PELT, and WBS. Under a -process (TP), only MCI is able to maintain a small error rate.

In Figure 14, we summarize each method’s Annotation error and Energy error in detecting changes in the mean. Our MCI method maintains the overall lowest Annotation error and the overall lowest Energy error. Together, these plots show that MCI accurately estimates the number and location of the changepoints, whether the error process is light or heavy tailed. The other methods perform similarly with MCI on the GP data, but then experience massive increases in both Annotation error and Energy error rates on the TP data.

Figure 15: Annotation and Energy error rates for sparse changes in the covariance parameters of a GP and a TP.

We then consider changes in the variance and the range in Matérn covariance function, respectively. The results are summarized in Figure 15. DSBE was not included in these simulations since it was not designed to detect changes in covariance. MCI again achieves the lowest overall error rates among all methods and for both parameters. Even with the heavy tailed TP data, MCI maintains a remarkable nearly zero Annotation error. Although the MCI’s Energy error increases with the TP data, it is still several times smaller than that of the other methods. It is worth mentioning that MCI is a powerful detector for changes in either range or variance, whereas the other methods exhibit wildly different skill on these two parameters when data is Gaussian.

b.1.3 Dense changepoints

Finally, we consider the situation when changepoints are relatively close to each other, i.e., when changepoints are dense. To simulate data with dense changepoints, we set and sample segment lengths from . This design results in changepoints that are ten times as dense as the changepoints in Section B.1.2. The parameter setting for studying changes in the mean, variance, and range are the same as in Section 4.3.2.

Figure 16: Annotation and Energy error rates for dense changepoints. Under both the GP and the TP, MCI attains the lowest error rates.

We summarize each method’s ability to detect dense mean changes in Figure 16. MCI again attains the lowest overall Annotation and Energy error rates. However, we find the gap between MCI and other methods is smaller than that with sparse changepoints. This is because, on the one hand, MCI’s error increases slightly from the sparse setting. High changepoint density reduces CUSUM’s power by reducing region lengths and hence the amount of data per region. On the other hand, the error rates of FPOP, PELT, and WBS all decrease as the number of changepoints increases. FPOP and PELT appear to overestimate the number of changepoints when the real number of changepoints is low (Figure 13), but they end up with fewer false changepoints when the real number is high. WBS uses randomly sized intervals and a binary segmentation algorithm to find changepoints, so densely packed changepoints have less effect on WBS’s accuracy.

Figure 17: Annotation and Energy error rates for dense changepoints in the covariance parameters.

We then evaluate the changepoint detection in covariance parameters, and the results are summarized in Figure 17. MCI has the lowest average error rates across all simulations. Again, MCI’s advantage over other methods for dense changepoints is less striking than for sparse changepoints for the same reasons mentioned above. The pattern of the relative performance of different methods is similar to that for mean changes in Figure 16. Compared to detecting changes in the mean, the detection of changes in the covariance parameter seems more challenging, as evidenced by larger Energy error rates.

In sum, Figures 14, 15, 16, and 17 serve to show that MCI is generally more accurate and more skillful for a broader type of changepoints than the other approaches. This is because MCI effectively combines two essential steps with the first prescreening extraneous changepoints through TVD and changeset regionalization and the second implementing the powerful CUSUM statistic on the filtered candidates. Under the light tailed GP data, MCI has almost uniformly zero Annotation and Energy error across all changepoint types and densities. In contrast, the three other algorithms’ performance varies greatly depending on the type and density of the changepoints. Furthermore, MCI is more robust to the distribution of the functional data. Although with TP data, the error rate of MCI increases across all changepoint detection compared to GP data, the other methods show erratic and greatly deteriorated skill for heavy-tailed data.

b.2 Hausdorff vs Energy error

The Hausdorff metric will only look at the “worst-case” estimation error, and so it is difficult for it to say anything about the average estimation error. If a changepoint detector perfectly detects each changepoint, except for one that it misses by a wide margin, then it will have a high Hausdorff error. Likewise, if a different changepoint detector were to miss all changepoints by that same wide margin, then it would have just as high of an error rate as the first detector. This seems to contradict our intuitive notion that the first changepoint detector is superior. The Energy distance would show that the first detector is much better than the other. The Hausdorff error also does not punish detectors that greatly overestimate the number of changepoints as long as those estimates are close to a real changepoint. Again, the Energy distance would punish this “shotgun” approach to changepoint detection.

We provide here plots using the Hausdorf error metric as counterparts to Figures 18, 19, 19, 21 from Section B.1. The results are largely similar to the results based on Energy distance, although distinctions between each method become less clear.

Figure 18: Annotation and Energy error rates for sparse changes the mean function. Under a Gaussian process (GP), error rates are comparable between MCI, PELT, and WBS. Under a -process (TP), only MCI is able to maintain a small error rate.
Figure 19: Annotation and Energy error rates for sparse changes in the covariance parameters of a GP and a TP.
Figure 20: Annotation and Energy error rates for dense changepoints. Under both the LS-GP and the TP, MCI attains the lowest error rates.
Figure 21: Annotation and Energy error rates for dense changepoints in the covariance parameters.