On Matched Filtering for Statistical Change Point Detection

06/09/2020 ∙ by Kevin C. Cheng, et al. ∙ 0

Non-parametric and distribution-free two-sample tests have been the foundation of many change point detection algorithms. However, randomness in the test statistic as a function of time makes them susceptible to false positives and localization ambiguity. We address these issues by deriving and applying filters matched to the expected temporal signatures of a change for various sliding window, two-sample tests under IID assumptions on the data. These filters are derived asymptotically with respect to the window size for the Wasserstein quantile test, the Wasserstein-1 distance test, Maximum Mean Discrepancy squared (MMD^2), and the Kolmogorov-Smirnov (KS) test. The matched filters are shown to have two important properties. First, they are distribution-free, and thus can be applied without prior knowledge of the underlying data distributions. Second, they are peak-preserving, which allows the filtered signal produced by our methods to maintain expected statistical significance. Through experiments on synthetic data as well as activity recognition benchmarks, we demonstrate the utility of this approach for mitigating false positives and improving the test precision. Our method allows for the localization of change points without the use of ad-hoc post-processing to remove redundant detections common to current methods. We further highlight the performance of statistical tests based on the Quantile-Quantile (Q-Q) function and show how the invariance property of the Q-Q function to order-preserving transformations allows these tests to detect change points of different scales with a single threshold within the same dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Given a time-varying signal, the problem of change point detection (CPD) is to identify specific points in time where the signal exhibits a significant change either in its deterministic content or underlying stochastic distribution. Having been studied for nearly a century by researchers, CPD was originally motivated in problems of fault detection and quality control [shewhart_economic_1931]. Since then, a wide range of CPD methods has been developed and applied across a diverse set of applications including finance [adams_bayesian_2007], human activity analysis [liu_change-point_2013], ECG and EEG processing [qi_novel_2014, dizaji_change-point_2017], speech [li_m-statistic_2015], and climate change [reeves_review_2007].

Change point detection methods are classified along a number of different dimensions

[truong_selective_2020, aminikhanghahi_survey_2017], such as whether labels are available at training (supervised or unsupervised), the detection setting (offline or online), the number of change points assumed, the dimensionality of the signal, and the existence of modeling assumptions on the data (model-based or similarity-based). Supervised CPD methods require an annotated training corpus where each time-varying signal has associated, labeled change points. Once the method is “trained” on these examples, it can be used to process new signals whose change points are unknown. In contrast, unsupervised CPD methods require only a set of signals to make detections. Online CPD methods make detections using only historical data and thus are useful in streaming or real-time settings. Offline methods process an entire sequence retrospectively. Single change point problems terminate after one change is detected while multiple CPD methods focus on time series where many changes may occur and are further divided based on whether or not the number of change points is known a-priori. While recent benchmarks [burg_evaluation_2020] indicate that many competitive CPD methods are only capable of processing univariate signals, a growing number can handle multivariate signals.

CPD methods can also be classified as either primarily model-based or similarity-based. We define model-based techniques as those that make specific probabilistic assumptions about underlying distributions of data, such as by characterizing changes in the values of known [chamroukhi_joint_2013] or learned [lee_time_2018] parameters. Model-based efforts originate from the landmark works by [wald_sequential_1947, page_continuous_1954, shewhart_economic_1931]. The pioneering use of state-space dynamical models [willsky_generalized_1976] formed the basis for work in hierarchical Bayesian models such as switching linear-dynamical systems (SLDS) [murphy_switching_1998]. A competitive model-based method in recent benchmarks [burg_evaluation_2020] is “Bayesian Online Change Point Detection” (BOCPD) [adams_bayesian_2007]

, which estimates the probability of change at each time via a switching state model that assumes within a segment that data arises IID from an exponential family likelihood. Other model-based efforts have pursued Bayesian

nonparametric approaches using Gaussian processes [saatci_gaussian_2010] to avoid this restrictive IID assumption and achieve more flexible within-segment data models.

Generally, model-based methods for CPD are effective when the modelling assumptions hold and are able to capture the key characteristics of the signal change. In contrast, similarity-based methods employ a test statistic derived directly from computing some suitable similarity or “distance” between windows of data samples without underlying assumptions concerning the specific generating distribution of the data. Therefore, similarity-based methods can be applied even if the proper assumptions about transitions between segments or the data distribution within a single segment are unknown or not easily expressed in a tractable probabilistic model.

In this work, we pursue a similarity-based method capable of handling multivariate signals for unsupervised, multiple CPD where the number of changes is not assumed known. This setting is motivated by intended applications in human activity analysis in the modern era, where large volumes of highly sampled sensor data are affordable but annotation is prohibitively expensive and suitable probabilistic assumptions for this data are challenging. To address CPD in these contexts we propose a simple but effective method that has few hyperparameters and makes minimal assumptions on the data. In addition, though strictly non-causal, the proposed approach is appropriate for offline or online applications where some delay can be tolerated.

A common framework for similarity-based CPD is to compute a univariate statistic measuring the similarity between two windows of data to either side of a purported change point. Methods reliant on this framework process an entire signal by iteratively “sliding” the pair of adjacent windows forward in time over the data, computing a test statistic typically derived from the Empirical Distribution Functions (EDFs), Quantile Functions (QFs) or Quantile-Quantile (Q-Q) functions associated with the data in a given pair of windows. We can then apply hypothesis testing to this statistic where the null hypothesis posits that no change point exists between the current two windows and thus the empirical distributions of the adjacent windows are drawn from the same distribution. If the null is true, the derived statistic should be small in some sense. At a change point, the data in the two windows come from different distributions and thus the expected value of the derived statistic peaks. Therefore, it is common under this framework to only consider local maxima as candidate change points for hypothesis testing. If the distribution of the statistic’s value under the null hypothesis is known, the null can be rejected and a change point declared with a certain confidence should the peak value exceed some corresponding threshold. In contexts where the distribution under the null cannot be attained, this peak thresholding method can still be used to make detections but lacks formal guarantees.

In this similarity-based context, three broad classes of tests have been applied for CPD. The first class is likelihood ratio tests based on the KL-divergence [sugiyama_direct_2008, liu_change-point_2013]. The second class is non-parametric statistical tests, such as the Kolmogorov-Smirnov (KS), Cramer-von Mises, and Mann-Whitney statistics [hawkins_nonparametric_2010, ross_two_2012]

. The last class is tests based on distance metrics between empirical probability distributions. Prominent among them is the family of integral probability metrics

[sriperumbudur_empirical_2012] which includes the Maximum Mean Discrepancy (MMD) [gretton_kernel_2012], and the Wasserstein distance [sommerfeld_inference_2016], which has a Q-Q variant, the Wasserstein Quantile Test (WQT)  [ramdas_wasserstein_2015, cheng_optimal_2020].

Two concepts central to the ideas in this paper are the notion of a distribution-free test and a peak-preserving transformation. Non-parametric tests that are distribution free have their probabilistic distribution of the test statistic under the null hypothesis independent of the distribution generating the data. Therefore, if a test is distribution-free, threshold values that correspond to rejection of the null at a fixed false-alarm rate can be applied regardless of the change in distribution to be detected; thus, one does not require methods such as density estimation to derive the distribution under the null. Peak-preservation applies to any method that preserves the expected value of the test statistic at a candidate change point. Since the change point statistic is expected to peak at a true change point, any peak-preserving post-processing method will maintain the statistical significance of the test statistic at the change points.

One issue that is not well studied in multiple change-point detection is the exact determination of change points once the appropriate statistic is computed. Because the test statistic time series is itself random (since it is a function of the underlying random observations), one tends to see multiple local maxima in the vicinity of a change resulting in either a large number of false alarms or the need for ad-hoc post processing to identify true change points. Current state-of-the-art methods simply consider local maxima above a specified threshold as change points [truong_selective_2020], [li_m-statistic_2015], and remove duplicate detections within a specified window [sugiyama_direct_2008]. However, it is clear that the sliding window methods produce a correlated test statistic where the effects of the change at a given point in time are spread over an interval which contains the change point.

Motivated by this fact, we draw on the classical signal processing idea of a matched filter [oppenheim_signals_2016] as a tool to better localize change points. More specifically, the asymptotic (as window length goes to infinity) forms of the expected “signatures” produced by sliding window methods are derived for the Wasserstein-1 Distance (W1-DT), Wasserstein Quantile Test (WQT), Sliced Wasserstein Quantile Test (SWQT), Maximum Mean Discrepancy squared (MMD), and Kolmogorov-Smirnov (KS) distance. Under a commonly-used assumption that the data in each segment is independent and identically distributed (IID) [adams_bayesian_2007, truong_selective_2020], we prove that the expected signature of each statistical test converges to a function that is independent (up to a scale factor) of the data distribution prior to and following the change point. Thus, distinct from the tests themselves being distribution-free, these filters are shown to be distribution-free in that they can be applied without knowing the distribution of the data. Using the asymptotically derived signature, we construct finite length filters in a manner that is peak-preserving. In summary of our main findings, the filters for the KS and W1-DT are piecewise linear while those of the WQT and MMD, which are based on a square distance, are quadratic.

Matched filters are generally known to be time-reversed versions of the signal to be detected, and are considered optimal with respect to signal to noise ratio when, for example, the noise is additive, white, and Gaussian. We do not claim that the stochasticity in the test statistic for these sliding window methods is of the form for which the matched filtering process is in any sense optimal. However, despite the formal lack of optimality and the fact that for real times series the data are not IID, we demonstrate empirically that the use of the matched filter simplifies the peak-detection process and improves CPD performance on simulated and real-world activity data.

Finally, exploration of the performance of these filters across the five non-parametric test statistics identified above brings to light interesting properties of tests based on the Q-Q function for multiple change point problem. Specifically, it is well known that the Q-Q function is both invariant to order-preserving transformations of the data [wasserman_all_2006] and we show that it is also is highly sensitive to small changes in the support of the data. In the context of CPD, these properties allow Q-Q tests to detect relatively small changes in a manner that is practically independent of the overall scale of the data. Whether this characteristic is useful or a source of false alarms depends heavily on the underlying application, an issue that is examined in this work using both simulated and real-world data.

In summary, the main contributions of this paper are as follows:

  • We develop a principled methodology for deriving and applying matched filters for similarity-based change point detection. We prove that our proposed matched filters are distribution-free and peak-preserving, which preserves the guarantees provided by standard hypothesis tests when used to detect change points given the filtered signal.

  • We offer formal proofs deriving the asymptotically matched filter for four common statistical tests: the Kolmogorov-Smirnov test (KS), Maximum Mean Discrepancy squared (MMD), Wasserstein-1 distance (W1-DT), Wasserstein quantile test (WQT), and propose the sliced Wasserstein quantile test (SWQT) as a multivariate extension of the WQT.

  • We demonstrate empirical benefits for the above theoretical contribution in both simulation studies and real-world multivariate change point benchmarks. We show how matched filters with suitable finite-length approximations can deliver improved precision in change point localization as well as reductions in false positives, and importantly, remove any need for additional post-processing for duplicate detection removal common to other methods.

  • We provide insight into how the choice of test statistic impacts empirical performance, specifically highlighting differences in sensitivity between statistics based on the Quantile-Quantile (Q-Q) function compared statistics derived from the Empirical Distribution Function (EDF) or Quantile Function (QF). These insights are justified through theory and demonstrated in simulated and real-world human activity datasets.

The remainder of this paper is organized as follows. Sec. II introduces the CPD problem and setup the basic framework for CPD with statistical tests on sliding windows. In addition, we outline the main properties that differentiate Q-Q tests from other statistical tests. Sec. III-A motivates and outlines the simple approach of deriving and applying matched filters and state the main theorems for the asymptotically matched filters for WQT, MMD, KS, and W1-DT tests. Details of each proof are left to the appendix. Sec. IV

shows that the empirically computed matched filters and properties of Q-Q based tests match our theoretical results. Furthermore we demonstrate the improvement in detection as shown through false alarm rates, and the related metrics of precision and recall evaluating on simulated data as well as real-world benchmarks based on human activity (HASC

[ichino_hasc-pac2016:_2016], MASTRE [hussey_monitoring_2020]) and honeybee activity (Beedance) [oh_learning_2008].

Fig. 1: Result of change point detection using the WQT on sliding windows of size on simulated data outlined in IV

. The noisy unfiltered change point statistic can cause false detections, and complicate localization of change points. The data alternates between two normal distributions

. The detection threshold corresponds to rejection of the null with 95% confidence.

Ii Change Point Detection

Ii-a Problem Statement

Assume a time series, where represents a compact set, is constructed with the following model.

  1. The data consists of distinct time segments with , such that within each time segment, are IID samples from a fixed but unknown distribution.

  2. The distributions in successive time segments are different but in general two non-adjacent segments can have samples from the same distribution.

The set of time points , are referred to as the change points. Given these conditions, the problem of unsupervised Change Point Detection (CPD) is to estimate the (possibly empty) set of change points purely from the provided data without any information or assumptions about the number or location of change points.

Ii-B Notation and Background

A set of total samples, , each drawn IID from some distribution , has an associated Empirical Distribution Function (EDF), and Quantile Function (QF) defined as,

(1)

where the indicator function is,

(2)

Given another set of IID samples, drawn from distribution , the Quantile-Quantile (Q-Q) function is as follows,

(3)

The EDF, QF and Q-Q functions represent stochastic processes on their respective domains [billingsley_convergence_1999]. Thus in this work, almost sure convergence () and weak convergence or convergence in distribution (), refers to convergence of stochastic processes. While the relevant background is covered in the appendix, we point the readers to additional works [vaart_asymptotic_1998, van_der_vaart_weak_1996, billingsley_convergence_1999, pollard_convergence_2012] for more comprehensive coverage in this area.

Ii-C Quantile-Quantile Tests

On real valued data, many of the two-sample tests discussed in this paper directly compare EDFs or QFs (Tab. I

). In the Q-Q case, the comparison is made to the uniform distribution. Indeed, when two distributions are equal, as under the CPD null, the resulting Q-Q function matches the uniform distribution on

. Thus, we designate statistics of this form as Q-Q tests.

Since the Q-Q function is non-decreasing from , Q-Q tests are inherently bounded. The maximum is achieved when the open intervals covering the respective supports of and are disjoint, as shown in Appx. -G for the WQT. Q-Q functions are also known to be invariant to any transformation on the data that is order-preserving, or in other words, monotonically increasing [wasserman_all_2006], such as positive affine transforms.

The bounded property of the Q-Q function makes Q-Q based tests particularly sensitive to small shifts in support. In addition, the invariance of Q-Q functions to order-preserving transformations makes Q-Q tests rather insensitive to such transformations of the data. More precisely, the Q-Q test statistic will be identical for a given a time series and where is an order-preserving function. For Q-Q tests applied to multiple change point detection problems, these properties imply that a single threshold can be used to detect changes of widely varying magnitudes; at least those changes induced by shifts in support or order-preserving transformations. We verify this claim using simulated data in Section IV-B. In the case of real world data, we discuss how the empirical results in Section IV-C may be interpreted in light of the properties of Q-Q test.

Fig. 2: Diagram of the assumed modeling setting for the test statistic response as function of time around a true change point. Top: The observed signal is assumed to arise from two distinct distributions, and , with change point at time . Before , data is sampled IID from . After , data is IID from . Bottom: Our sliding window framework computes a test statistic between adjacent windows and of size at each time . We consider the expected value of the statistic changes from the true change point moving right to (without loss of generality, moving left is the same as moving right assuming the proposed test is symmetric). The left window represents a mixture of samples from and with mixture proportion . The right window will be purely sampled from . Takeaway: Formally characterizing the test statistic response as a function of is the goal of our asymptotically matched filter.

Ii-D Statistical Tests for Change Point Detection

Our general framework for CPD generates a test statistic by sliding adjacent windows of a constant size and computing a two-sample test from samples within each window (Fig. 2).

At each time , we define two windows of samples of size ; one to the left of , with distribution and the other from the samples to the right , with distribution . A statistical test is then applied to these two windows. We use notation to represent a general statistical test. This can be substituted for the various specific statistical tests defined in Sec. III.

The nominal approach for identifying change points given a test statistic is to label local maxima of a computed statistic above some threshold parameter [li_m-statistic_2015]. However, as is evident in Fig. 1, randomness in the statistic can adversely affect CPD methods that use thresholds for detection. Furthermore, in the presence of change points, multiple peaks complicate the exact localization of change points. These problems, along with the fact that sliding windows produce a correlated statistic, motivate the application of matched filtering of the test statistic for CPD.

Iii Matched Filters for Statistical Tests

Iii-a Asymptotically Matched Filters

As shown in Fig. 2, for the sliding window framework with constant window size , the effect of a change point located at will be reflected in the test statistic on the interval . Consequently, the matched filter is derived from the expected response of the test statistic on this interval.

In this change point scenario, samples are assumed to be drawn IID from a distribution for , and from another distribution for . We then generate the EDFs from adjacent sliding windows and the test statistic as described in section II-D. Data in the windows that span the change point can be modeled as coming from a mixture distribution between and .

Without loss of generality (as long as is shown to be symmetric), we consider the case starting at and sliding the window to the right such that the change point is located in the left set of samples. In this setup, samples from the left window can be modeled as IID samples from the mixture and the distribution of the samples in the right window remains constant, . We redefine these distributions in terms of a mixture parameter where corresponds to . In this mixture view, our left window and right window distributions are,

(4)

This reformulation allows us to analyze the expected response of the statistic asymptotically, as goes to infinity. In this case, the possible values of become countably infinite on the interval .

Given that and are derived from samples drawn IID from the distributions defined in (4), we prove that for each statistic, as for each , the expected value of the statistic converges to a deterministic function of the following form,

(5)

Here, is a constant for a given , and , while is only a function of . From (5) we can conclude two main properties. First, the time dependent component of the response; that is , is independent of the distributions and . Second, and impact the response only through a constant scale factor. These two properties allow us to construct a matched filter whose shape is distribution-free and peak-preserving which we precisely define in Sec. III-B.

Moving back to the case where is finite, (5) suggests that the expected value of the test statistic around a change point at for be approximated as,

(6)

Since is symmetric in its arguments, (5) holds when are reversed, which corresponds to the analogous setup with the windows slid to the left. Therefore, the response of the statistic is mirrored about . For shorthand, change point statistic is denoted as .

Therefore, with a slight abuse of notation, we can define the matched filter ,

(7)
Test WQT SWQT W1-DT KS MMD
Type of Test Q-Q Q-Q QF, EDF EDF N/A
Distribution
Under Null
DF [ramdas_wasserstein_2015] DF not DF [sommerfeld_inference_2016] DF [vaart_asymptotic_1998] not DF [gretton_kernel_2012]
Data Dimension
Matched filter
,
(our contribution)
TABLE I: Summary table comparing the two-sample tests discussed in this paper. EDF: empirical distribution function, QF: quantile function, Q-Q: quantile-quantile. DF: distribution-free

Iii-B Peak-Preserving, Distribution-Free Matched Filters

The matched filter defined above can be applied to the test statistic signal to produce a peak-preserving filtered signal suitable for CPD:

(8)

Here, is a scalar factor ensuring that the filter is in fact peak-preserving and denotes the convolution operation.111

boundary conditions handled through zero-padding

We note that this filtering process is distribution-free since (8) does not depend on or . This means that is the matched filter regardless of the distributions defining the signal change and can be applied globally without any knowledge of probabilistic model of the change point.

It also follows from (8) that since the expected value of the test statistic in the region of a change point at reflects (5), then . Thus the resulting peak value of the statistic at the change point is preserved in expectation through the filtering process. This peak-preservation property is important for statistical tests where the resulting values are compared to a threshold in order to reject of the null hypothesis with a certain statistical confidence.

With the application of the matched filter, change points are detected at local maxima above a threshold of the filtered statistic, where no further post-processing of the local maxima is required. The proposed algorithm is detailed in full in Alg. 1.

In the next few sections, we state the main theorem regarding the asymptotic expected value around a change point for each of the statistical tests of concern in this paper. Proofs of these theorems are provided in Appxs. -B-F.

Input :  data,
window size
detection threshold
statistical test, with
corresponding matched filter
Output : : change point statistic
: change points
for  do
      
      
      
end for
Algorithm 1 Matched Filtered Statistical CPD

Iii-C Wasserstein-1 Distance Test (W1-DT)

Given two probability distributions on the Wasserstein-p distance is defined as,

(9)

where

denotes the set of all joint distributions with marginals

. For , the Wasserstein-1 distance has the closed form [santambrogio_optimal_2015, peyre_computational_2018],

(10)

We note the following theorem.

Theorem.

1: Let , be derived from IID samples drawn from , and respectively, where are continuous on a compact domain, and constant , then

(11)

Assuming that the samples live on a compact set, the test statistic is bounded. Thus it follows from the Portmanteau theorem [billingsley_convergence_1999] that , which has the form of (5). Thus by the process described in Sec. III-A the matched filter for the operational case when is sufficiently large (but finite) is a piecewise linear function of , .

Iii-D Wasserstein Quantile Test (WQT)

The WQT is a distribution-free variant of the Wasserstein distance that measures the Wasserstein distance of the Q-Q function to the uniform measure [ramdas_wasserstein_2015],

(12)

For the WQT we note the following theorem.

Theorem.

2: Let , be derived from IID samples drawn from , and respectively, where are continuous on a compact domain, and , then

(13)

Analogous to Sec. III-C, assuming that the samples live on a compact set, the test statistic is bounded. Thus it follows from the Portmanteau theorem that , which has the form of (5). Then, by process described in Sec. III-A, the matched filter for the WQT in the operational case when is finite is .

Thm. 2  adds a factor to the definition of the WQT in (III-D) that removes a distribution dependent term in the WQT. In the operational case, we approximate this term by considering the case where , where the term would have the most impact. This acts as a constant bias term that is removed from the signal prior to matched filter convolution when using the WQT. Details of this can be found in Appx. -C.

Iii-E Sliced Wasserstein Quantile Test (SWQT)

Since the WQT is only defined in one dimension, the naive approach for an extension to multiple dimensions is to average the WQT across each dimension independently. Alternatively, we propose to use the sliced Wasserstein quantile test (SWQT) using a similar approach to the sliced Wasserstein distance [bonneel_sliced_2015], that averages the WQT over one-dimensional projections of the data. Given a two sets of samples, the SWQT is,

(14)

where is the uniform measure on - the unit sphere in , and ,

are the respective EDFs computed from the projections of the samples on the unit vector

.

With this definition we state the following,

Theorem.

3: Let sets each consisting of samples drawn IID from and respectively, where are continuous on a compact domain, and . Then,

(15)

Assuming that the samples live on a compact set, the test statistic is bounded. Thus it follows from the Portmanteau theorem that , which has the form of (5). Then, by process described in Sec. III-A, the matched filter for the SWQT in the operational case when is finite is .

As with the WQT in Sec. III-D, Thm. 3  is stated with a factor that removes an term. This term is approximated to by considering the case of where . This acts as a constant bias (and is identical to the bias of the WQT) which is removed from the signal prior to matched filter convolution. Details of this can be found in Appx. -D.

Iii-F Kolmogorov-Smirnov (KS)

The two-sample KS test [massey_kolmogorov-smirnov_1951] computes the maximum deviation between the respective empirical distribution functions,

(16)

Under continuity assumptions on the distributions , it is known to be distribution-free under the null hypothesis [pratt_concepts_1981].

We note the following theorem for the KS test.

Theorem.

4: Let , be derived from IID samples drawn from , and respectively, where are continuous on a compact domain, and . Then,

(17)

Assuming that the samples live on a compact set, the test statistic is bounded. Thus it follows from the Portmanteau theorem that , which has the form of (5). Then, by process described in Sec. III-A, the matched filter for the KS test in the operational case when is finite is .

Iii-G Maximum Mean Discrepancy Squared (MMD)

The MMD between two distributions , represents the largest difference in expectations over functions in the unit ball of a Reproducing Kernel Hilbert Space (RKHS) with kernel222We will assume that the RKHS is universal. In this case the MMD is a metric [gretton_kernel_2012]. ,

In this work we will consider MMD statistic for CPD. Given two sets of samples, sampled IID from and respectively, the MMD

has an unbiased estimator

[gretton_kernel_2012] given by,

(18)

Note that under the null hypothesis, the (appropriately scaled) limiting distribution of this unbiased estimator is not distribution free [gretton_kernel_2012]. We have the following theorem.

Theorem.

5: Let sets each consisting of samples drawn IID from and respectively, where are continuous on a compact domain, and where and . Then,

(19)

Since (19) has the form of (5), by the process described in Sec. III-A, the matched filter for the MMD distance in the operational case when is finite is .

Fig. 3: Empirical results of simulating the filter for the W1-DT (top left), WQT scaled by (top middle), SWQT scaled by (top right), KS (bottom left) and MMD (bottom right) test statistics as a function of mixture parameter for various window sizes , along with expected asymptotic result (black).

Iv Evaluation

The only algorithmic hyperparameters required for the non-parametric statistical change point methods described in this paper are the window size and the detection threshold parameter . For each experiment, we compare filtered and unfiltered version of each statistical test using under the same window parameter. For all real data experiments, we use domain knowledge of the frequency at which change points should be detected to set the window size. Since not all tests have the same statistical guarantees, we evaluate using metrics that vary the threshold parameter over all possible values.

Early applications of change point detection in failure detection focused on metrics such as average run length and detection delay as key metrics of CPD [tartakovsky_asymptotically_2018, veeravalli_quickest_2012]. Recent applications frame the problem of multiple CPD as a classification problem at each time step (change point vs. no change point) where there is a severe class imbalance as only a small fraction of time steps is regarded as true change points. Here we follow the work of [burg_evaluation_2020]

moving toward metrics based on the confusion matrix. We also consider performance over a range of thresholds to allow the end-user maximum control of these trade-offs. To these ends, when possible, we report the full Precision-Recall (PR) curve, as well as the Area Under the PR Curve (AU-PRC), and best-F1 score (harmonic mean of precision and recall) over across all threshold values.

One motivation for the matched filter approach is to disambiguate multiple local maxima near potential change points. Prior CPD works address this issue by removing from consideration duplicate peaks that are within a pre-specified distance of one another[liu_change-point_2013], keeping only the highest peak. For a fair comparison to unfiltered methods including prior work, we apply this post-processing to all unfiltered methods, but not the matched filtered methods. For clarity, all methods where duplicate local maxima are removed are labeled with (-), and all matched filtered methods where no such post-processing is applied are labeled as (-).

Thus, for evaluation purposes, we include two additional parameters; is the minimum distance between detected change points applied to the unfiltered methods, and defines the tolerated distance for scoring detected change points. Specifically, a detected change point is a True Positive (TP) if there exists a true change point within samples, otherwise it is considered a False Positive (FP). False Negatives (FN) are true change points that do not contain any detected change point within samples. Precision (P), Recall (R), and F1 are then defined as,

(20)

Iv-a Simulation Data

First, we verify our proposed matched filters with simulated data. Given two known distributions and , the exact mixture scenario in Fig. 2 is simulated where and . The test statistic is computed for all values of the mixture parameter and for various window sizes . Each test was averaged over 100 repetitions using different random seeds.

Next, we validate the benefits of matched filters for change point detection on simulated sequences using AU-PRC and best-F1 metrics. Separate evaluations are performed for tests defined for scalar and vector valued data. For the scalar case, we generate 40 IID data sequences of length 800 with a single change point uniformly distributed between 300 and 500. Samples prior to the change point are drawn from distribution , whereas samples after the change point are drawn from .

For the multivariate case, an identical simulation setup is used but and are defined with a shared covariance and a difference in mean:

The SWQT is computed via Monte Carlo simulations by randomly sampling vectors

, and averaging the results over each linear projection. The Gaussian kernel with unit variance is used for computation of the MMD. With these datasets, we compare the performance between the filtered and unfiltered test statistics for various window lengths and evaluate with parameters

.

Finally, the difference in the regions of sensitivity of the Q-Q tests versus the scalability of non-Q-Q tests is illustrated. First we compute the WQT and W1-DT for two uniform distributions, and for , modeling the behavior of these tests for distributions with shifting supports. Then, we verify the invariance of the WQT to order-preserving transformations by simulating a time series with a sequence of 4 distributions, , , , each with 500 samples. We note that by construction, the relative scale at each change point is equal, where each successive segment models the data being scaled by a factor of 2. Thus, if

represents a random variable in the first segment, a random variable in the following three segments would be

respectively. Two additional scenarios are considered; one leaves the data as is, and the other transforms the data by a cubic , which is a monotonically increasing, order-preserving function. We then compute the filtered change point statistic using a window of samples comparing the WQT and the W1-DT. The test results were aggregated over 10 independent iterations using AU-PRC as a measure of change point performance.

Iv-B Simulated Results

The plots in Fig. 3 confirm the results from our theorems and show the convergence of the signature for each of the statistical tests to the expected functional form. Generally speaking, even for sample sizes on the order of show convergence towards the expected signature.

In the simulated change point tests on (Tab. II), and (Tab. III) show that the application of our proposed matched filters to the corresponding test statistic yields consistent improvement in the AU-PRC and best-F1 metrics thus resulting in an improved true positive to false positive ratio. As expected, when window size increases, overall detection performance also increases. In this controlled setting, the performance across all four possible statistical tests is comparable in the univariate case. However, in the two-dimensional case the SWQT has a better overall performance compared to the MMD across all window sizes.

Since the data matches all of our assumptions, we are also able to empirically verify the peak-preserving property of the matched filter as shown in Fig. 1, where the data is generated by sampling IID alternating between two normal distributions with a window size of .

The difference between the WQT (Q-Q based), with the W1-DT (EDF, QF based), for CPD are highlighted using simulated data in Fig. 4

. In the simulated time series, at each successive segment where the mean and standard deviation of the data is doubled, the WQT detects each change with essentially the same magnitude, thus a single threshold suffices to detect these three change points as the relative change is constant. Conversely, the change point response of the W1-DT scales with the absolute magnitude of the change in the signal. Furthermore, since the Q-Q is invariant to order-preserving transformations, the change point statistic of the WQT under the cubic transform on the data remains identical but the W1-DT statistics shifts drastically.

AU-PRC Best-F1
n=50 100 150 50 100 150
-WQT 0.52 0.76 0.90 0.46 0.69 0.82
F-WQT 0.54 0.80 0.93 0.49 0.73 0.87
-MMD 0.47 0.75 0.88 0.45 0.67 0.83
F-MMD 0.53 0.78 0.89 0.50 0.70 0.84
-MW1 0.51 0.78 0.89 0.49 0.70 0.83
F-MW1 0.54 0.89 0.94 0.46 0.75 0.84
-KS 0.53 0.70 0.86 0.46 0.66 0.79
F-KS 0.54 0.88 0.98 0.46 0.72 1.0
TABLE II: Simulated matched filter results for statistical tests on for filtered (denoted by F-) and unfiltered statistics on a series of single change point simulated time series as described in IV-A. Both AU-PRC and best-F1 scores increase with the inclusion of the matched filter. As expected, performance also improves with increased window length .
AU-PRC Best-F1
n=50 100 150 50 100 150
-MMD 0.19 0.67 0.85 0.36 0.65 0.88
F-MMD 0.27 0.85 1.0 0.48 0.86 1.0
-SWQT 0.52 0.95 0.97 0.56 0.95 1.0
F-SWQT 0.73 1.0 1.0 0.72 1.0 1.0
TABLE III: Simulated performance of matched filter on given the experimental setup described in IV-A. Both AU-PRC and best-F1 scores increase with the inclusion of matched filtering. SWQT generally outperforms MMD for this simulated example.

When comparing the WQT and W1-DT on uniform distributions of shifting supports, Fig. 4 shows both the bounded property of the WQT, saturating when the supports of the two distributions become disjoint (), and the difference in sensitivity (slope of the response) of the two tests. Straightforward calculations show that the W1-DT is equally sensitive (constant slope) regardless of the shift in . In contrast, the WQT shows different regions of sensitivity. Specifically, the WQT it is more sensitive to small changes in support but is insensitive to any additional change once the supports are disjoint. While the scales of these two tests differ as evident from the left and right axes labels, all evaluation in this paper is based on local maxima and precision-recall curves the structure of which is independent of the absolute amplitude of the test statistic. We return to this point in the discussion of the real-world data below.


AU-PRC F-WQT 0.865 0.846 F-W1DT 0.349 0.254

Fig. 4: Simulated CPD comparing the W1-DT (EDF/QF test) with the WQT (Q-Q test). Sample output (top) of matched filtered WQT and W1-DT output for simulated data with and without cubic transformation. We note the two traces for the F-WQT are coincident. AU-PRC values (bottom left) averaged over 10 runs. Clearly, as all 3 change points are labeled as true changes, the WQT vastly out-performs the W1-DT. Comparison of WQT and W1-DT (bottom right) on uniform distributions with shifting supports computed based on their definitions in Sec. III-C, Sec. III-D exhibiting the differences in sensitivity of the two tests in different regions.

In summary, two properties of Q-Q tests for CPD are (1) the ability to use a single threshold to detect changes at different scales, and (2) the high sensitivity of these tests to small changes in data support. In some applications where these characteristics are desirable, change point methods built on Q-Q tests will provide better results as seen from the clear benefits in AU-PRC in Fig. 4. However, in cases where perhaps the absolute magnitude of the change significant, tests based on the EDF or QF would produce better results. For comparison, the reported results in the table of Fig. 4 uses all three change points as true change points. If only the change with the greatest magnitude (that is, the third one) was considered a “true” change, the performance would be reverse with an AU-PRC of 0.256 for the F-WQT and 0.687 for F-W1DT.

Fig. 5: Precision vs recall evaluated on HASC (left column, window size ), MASTRE (middle column, ), and Beedance (right column, ) benchmark tasks. For each dataset we compare (top row), unfiltered (dashed) and matched filtered (solid) statistics using SWQT, WDT, MMD, and KS statistical tests. We also compare (bottom row) the matched filtered statistic with results from using the M-Statistic, RulSIF, BOCPD-MS, and KL-CPD (supervised). Our proposed matched filtered versions (denoted with prefix “-”) do not require duplicate detection post-processing. The unfiltered versions (denoted with prefix “”) are post-processed to remove any detections within samples (HASC , MASTRE , Beedance ).

Iv-C Real World Data

We compare the filtered and unfiltered versions of the statistical tests described in this paper with prior work using identical windowing parameters where applicable. The M-statistic [liu_change-point_2013] is a sliding window CPD algorithm based on the MMD. BOCPD-MS [knoblauch_spatio-temporal_nodate], is a parametric Bayesian method that extends [adams_bayesian_2007] through model selection. Here we utilize the algorithms default parameters333Code from https://github.com/alan-turing-institute/bocpdms. and set the change point statistic as the log probability that the run length equals zero. RuLSIF [liu_change-point_2013] uses direct density ratio estimates between sliding windows444Code from https://riken-yamada.github.io/RuLSIF.html.. KL-CPD [chang_kernel_2019]

applies the MMD on sliding windows with a kernel trained as a neural network in a supervised setting. We note that KL-CPD is the only supervised method included in our evaluation. In this supervised setting, since data is required for training and validation, KL-CPD is tested on a subset of the available evaluation data. The default setup

555Code from https://github.com/OctoberChang/klcpd_code. is used where 60% of the each sequence is used for training, 20% for validation, and 20% for testing. Therefore, comparison of KL-CPD to all other unsupervised methods should be considered carefully. To each of these methods, we remove duplicate peaks within samples of each other.

To provide a detailed analysis of performance, we plot the PR curve for each method evaluated on the following datasets:

HASC-PAC2016 [ichino_hasc-pac2016:_2016] : A raw dataset that consists of over 700 three-axis accelerometer sequences sampled at 100 Hz of subjects performing six actions: ’stay’, ’walk’, ’jog’, ’skip’, ’stairs up’, and ’stairs down’. The 92 longest sequences where each of the six actions are represented are used for evaluation. Time series have an average length of 17,775 samples and 15.2 change points.

MASTRE[hussey_monitoring_2020] : In this proprietary dataset, soldiers move between a series of stations to perform various physical tasks. The nature of the tasks varies from marksmanship to aerobic exercises. Subjects are instrumented with a three-axis accelerometer sampled at 100 Hz, and change points are labeled from video as the subject transitions into and out of tasks. A total of three time sequences were evaluated with an average length of 92,097 samples and 65.3 change points

Beedance [oh_learning_2008] , , : A dataset containing movements of dancing honeybees who communicate through three actions: ”turn left”, ”turn right” and ”waggle”. A total of 6 time sequences are evaluated, each one containing 3-dimensional signal of the X,Y location and heading angle of the bee as captured in an overhead image. The time series have on average a length of 787 samples, and 18.8 change points. We obtained the positions and angles from the original data release.

For datasets with multiple dimensions, methods inherently defined for univariate signals () are extended to higher dimensions by averaging their respective test statistic over each dimension. This applies to WQT, W1-DT, KS, and RuLSIF.

Fig. 6: Sample output for HASC-PAC2016 (left, middle) and MASTRE (right) human activity accelerometer data sequence (grey) and the ground truth change points (yellow), with the filtered (solid) and unfiltered (dashed) SWQT (blue), and W1-DT (purple). For comparison, each statistic is normalized based on the maximum value of their respective unfiltered statistic over the entire sequence. While it appears that the SWQT false alarms frequently (left), zooming into one such region (middle) shows a small shift in support resulting in a change in the data at a smaller scale which HASC does not consider a true change point. The example sequence from the MASTRE data (right) shows a similar small shift that is a true change point that are detected by the SWQT but not the W1-DT.

Iv-D Real World Results

For HASC dataset, where the window size is large (), the left column of Fig. 5 shows the improvement in performance provided by the matched filter. At each possible fixed recall, there is consistently about a 5% increase in precision with the matched filter applied up to a recall of about , indicating that in this region, the matched filter decreases false positives without increasing false negatives. However, recall values past a certain point are not achievable by the matched filtered methods under the given algorithm parameter and evaluation parameter . This is especially true as the threshold becomes small. In this regime, the stochastic nature of the unfiltered test statistics often produces what are essentially spurious, local maxima someplace within of a true change point. The matched filter however serves as a low-pass filter, removing these peaks resulting in recall values less than unity. In these cases, the higher achievable recall of the unfiltered statistical tests is attributed to the prevalence of local maxima due to randomness rather than the inherent properties of the test. Interestingly, the best-F1 scores are comparable between the filtered and unfiltered methods in the HASC dataset, and are achieved around where the two curves cross. Thus, up to a certain recall, the application of matched filtering improves detection precision and F1 score. Past this recall, only achievable by unfiltered methods, the use of matched filtering does not provide benefits in terms of F1 score.

For MASTRE, the performance of matched filtered statistics in Fig. 5 (middle column) compared to unfiltered methods shows similar trends to HASC where for lower recall values, matched filtered statistics consistently produce higher precision values. However unlike HASC, the recall value at which the filtered methods fall off differ significantly between each of the test statistics. This discrepancy is due to the difference in how true change points are labeled between the two datasets, discussed in depth below.

Referring to the right column in Fig. 5, for the Beedance dataset, the matched filter does not seem to offer clear benefits compared to the baseline test statistic signals. In fact, we might expect this because the small window size required for this dataset () likely means we are far from the asymptotic regime of in which the matched filters are derived. We thus include this result as a known limitation as we expect matched filters to show benefits only for large window sizes.

As seen in the bottom row of Fig. 5, compared to prior work, the relative performance of sliding window statistical tests discussed in this work varies depending on the dataset. For the HASC data, all statistical tests show better results compared to all other evaluated methods. In the cases of MASTRE, RuLSIF shows the overall best results. Notably, mStat performs comparably to the unfiltered SWQT for both HASC and MASTRE. For Beedance, KL-CPD (supervised) shows the best results. Of the prior methods evaluated, RuLSIF performs most consistently over the three datasets.

Furthermore, we note that for the MASTRE data set, there is a significant improvement in the SWQT compared to the WQT averaged across each dimension whereas for HASC, there is only a slight improvement. From this we can deduce that the change points of HASC can be observed through the analysis of each dimension independently, whereas for MASTRE performance is improved by considering vector valued methods.

The HASC and MASTRE datasets represent human activity measured through accelerometry under different contexts. The HASC experiment measures human activity in a controlled setting where subjects are instructed to hold an action until switching to another one of the six allowed actions. On the other hand, MASTRE is collected in a task-oriented setting where transitions between individual actions are more fluid, which is one reason why overall performance on MASTRE is lower. Furthermore, while there are running and walking tasks similar to that of HASC, the MASTRE data also encompass changes where the individual is not moving their feet, such as standing-to-kneeling posture changes.

The characteristics of the HASC and MASTRE data sets noted in the previous paragraph lead to differences in performance among the tests. In the HASC precision-recall curve, the performance of the statistical tests varies. Given a certain recall value, W1-DT and MMD have the highest precision, KS is in the middle, and the WQT and SWQT generally have the lowest precision. The WQT and SWQT results arise from the fact that the Q-Q tests on the HASC data false alarm more often compare to the other tests. Although the WQT and SWQT achieve a slightly higher recall (for example at a precision level of 0.45), this benefit does not outweigh the loss in precision. However, for the MASTRE PR curves, the story is different. For low recall values (0.2-0.6) there is a similar trend where the MMD and KS have higher precision values than the SWQT. However, the discrepancy between the recall values (for example at a precision level of 0.3) is much more pronounced where SWQT achieves a recall significantly higher than that of the other statistical tests, especially the W1-DT which overall performs very poorly on this dataset. These differences in performance of the W1-DT and the WQT/SWQT between HASC and MASTRE can be explained by two factors: (1) the properties of Q-Q tests as discussed in the context of simulated data and (2) how true labels are assigned in the respective datasets.

As seen from the sample HASC time series (Fig. 6 left) the WQT generally has peaks of equal height whereas the peaks of the W1-DT scale with the observed magnitude of changes. This behavior is consistent with the discussion surrounding Fig. 4 concerning the manner in which these tests respond to changes that vary in scale. Furthermore, while it appears that the WQT false alarms in the stationary regions where the subject is motionless, closer inspection into one of these regions (Fig. 6, middle) shows that the signal has a slight shift in mean perhaps due to a shift in posture. As discussed above (Fig. 4), the WQT is highly sensitive to these small change in the support of the data resulting in a clear peak in the Q-Q test statistic in Fig. 6. This change in the data is small relative to others across the full time series. Thus, consistent with the relative insensitivity of EDF/QF tests, it is not reflected in the W1-DT. The ground truth change point labels of HASC focuses more on the large-scale activity changes, therefore change points are not labeled for these small changes, which contributes to the poor precision of the Q-Q tests.

Change points in the MASTRE dataset correspond to entering and exiting stations where tasks are performed, not necessarily based on specific action the subject is taking. Therefore, true change point labels correspond to both large changes in action and subtle changes in posture (Fig. 6, right); the latter of which would not be labeled as a true change point in HASC setting. As shown earlier and seen in Fig. 4, the WQT is sensitive to both these changes, and therefore achieves a higher precision and recall.

These two human activity datasets provide an example of how the application dictates the suitability Q-Q tests. We have shown how the WQT is particularly sensitive to small changes and support, and that Q-Q tests equally detect changes at different scales. In applications where these properties reflect true change points, as is in the case of MASTRE, Q-Q tests will yield better results. However, in applications similar to the HASC dataset, where subtle shifts in posture are considered false alarms, the W1-DT would be preferred as they would be dwarfed by the larger changes in the time series.

V Conclusion and Future Work

While many methods of change point detection have been proposed over the years, the issue of change point localization for a noisy distribution-free statistic has not been thoroughly considered. To address this issue, we introduce asymptotically matched filters. For various non-parametric tests that have been used as the foundation of multiple CPD algorithms, we derive these filters under the simple observation that sliding windows over a change point will cause samples from one window to be drawn from a mixture distribution. With asymptotic analysis, we are able to derive the expected response of the test statistic in the region of a change point which is then used to compute the matched filter in the operational (non-asymptotic) case. While in this paper we only consider a subset of tests, the proposed analysis methodology for deriving matched filters can be applied to other methods in change point detection.

The discussed framework for change point detection through the use of a two-sample test of sliding windows is both simple and easily deployed in practice. Once a test statistic is chosen, the only hyperparameters required are the window size and detection threshold. We build on this methodology by applying matched filtering which results in improved change point precision, and also simplifies the of the process of identifying change points given a statistic, removing any need for ad-hoc processing to remove duplicate peaks. Furthermore, if the statistical test is distribution-free, the peak-preserving property of the matched filter ensures that statistical guarantees are preserved with a constant threshold. While simple, this method of detecting changes points by testing for changes in distribution through two sample tests demonstrates competitive performance with other state-of-the-art approaches.

In understanding the trade-offs between various CPD methods, we build on two properties of Q-Q based statistical tests; namely their invariance to order-preserving transformations and their sensitivity especially to small changes in support of the data (or equivalently, small changes in the mean). For CPD applications these properties result in differences in response. Specifically, Q-Q tests can detect changes at different scales of the data using a single threshold while tests based on quantile functions or empirical distribution functions tend to be “tuned” to changes of a specific magnitude. As evidenced by our real-world data examples, these differences can be leveraged to properly select the appropriate test for an application, and certainly motivate further rigorous investigation.

Despite the fact that the derivations for the filters in this paper assume that the data is IID, based on real-world results, we see that the benefits still hold on non-IID data when the window size is sufficiently large. Nonetheless, in future work, we hope to consider analysis under non-IID conditions and evaluate if matched filters can be applied to other change point methods.

References