I Introduction
Given a timevarying signal, the problem of change point detection (CPD) is to identify specific points in time where the signal exhibits a significant change either in its deterministic content or underlying stochastic distribution. Having been studied for nearly a century by researchers, CPD was originally motivated in problems of fault detection and quality control [shewhart_economic_1931]. Since then, a wide range of CPD methods has been developed and applied across a diverse set of applications including finance [adams_bayesian_2007], human activity analysis [liu_changepoint_2013], ECG and EEG processing [qi_novel_2014, dizaji_changepoint_2017], speech [li_mstatistic_2015], and climate change [reeves_review_2007].
Change point detection methods are classified along a number of different dimensions
[truong_selective_2020, aminikhanghahi_survey_2017], such as whether labels are available at training (supervised or unsupervised), the detection setting (offline or online), the number of change points assumed, the dimensionality of the signal, and the existence of modeling assumptions on the data (modelbased or similaritybased). Supervised CPD methods require an annotated training corpus where each timevarying signal has associated, labeled change points. Once the method is “trained” on these examples, it can be used to process new signals whose change points are unknown. In contrast, unsupervised CPD methods require only a set of signals to make detections. Online CPD methods make detections using only historical data and thus are useful in streaming or realtime settings. Offline methods process an entire sequence retrospectively. Single change point problems terminate after one change is detected while multiple CPD methods focus on time series where many changes may occur and are further divided based on whether or not the number of change points is known apriori. While recent benchmarks [burg_evaluation_2020] indicate that many competitive CPD methods are only capable of processing univariate signals, a growing number can handle multivariate signals.CPD methods can also be classified as either primarily modelbased or similaritybased. We define modelbased techniques as those that make specific probabilistic assumptions about underlying distributions of data, such as by characterizing changes in the values of known [chamroukhi_joint_2013] or learned [lee_time_2018] parameters. Modelbased efforts originate from the landmark works by [wald_sequential_1947, page_continuous_1954, shewhart_economic_1931]. The pioneering use of statespace dynamical models [willsky_generalized_1976] formed the basis for work in hierarchical Bayesian models such as switching lineardynamical systems (SLDS) [murphy_switching_1998]. A competitive modelbased method in recent benchmarks [burg_evaluation_2020] is “Bayesian Online Change Point Detection” (BOCPD) [adams_bayesian_2007]
, which estimates the probability of change at each time via a switching state model that assumes within a segment that data arises IID from an exponential family likelihood. Other modelbased efforts have pursued Bayesian
nonparametric approaches using Gaussian processes [saatci_gaussian_2010] to avoid this restrictive IID assumption and achieve more flexible withinsegment data models.Generally, modelbased methods for CPD are effective when the modelling assumptions hold and are able to capture the key characteristics of the signal change. In contrast, similaritybased methods employ a test statistic derived directly from computing some suitable similarity or “distance” between windows of data samples without underlying assumptions concerning the specific generating distribution of the data. Therefore, similaritybased methods can be applied even if the proper assumptions about transitions between segments or the data distribution within a single segment are unknown or not easily expressed in a tractable probabilistic model.
In this work, we pursue a similaritybased method capable of handling multivariate signals for unsupervised, multiple CPD where the number of changes is not assumed known. This setting is motivated by intended applications in human activity analysis in the modern era, where large volumes of highly sampled sensor data are affordable but annotation is prohibitively expensive and suitable probabilistic assumptions for this data are challenging. To address CPD in these contexts we propose a simple but effective method that has few hyperparameters and makes minimal assumptions on the data. In addition, though strictly noncausal, the proposed approach is appropriate for offline or online applications where some delay can be tolerated.
A common framework for similaritybased CPD is to compute a univariate statistic measuring the similarity between two windows of data to either side of a purported change point. Methods reliant on this framework process an entire signal by iteratively “sliding” the pair of adjacent windows forward in time over the data, computing a test statistic typically derived from the Empirical Distribution Functions (EDFs), Quantile Functions (QFs) or QuantileQuantile (QQ) functions associated with the data in a given pair of windows. We can then apply hypothesis testing to this statistic where the null hypothesis posits that no change point exists between the current two windows and thus the empirical distributions of the adjacent windows are drawn from the same distribution. If the null is true, the derived statistic should be small in some sense. At a change point, the data in the two windows come from different distributions and thus the expected value of the derived statistic peaks. Therefore, it is common under this framework to only consider local maxima as candidate change points for hypothesis testing. If the distribution of the statistic’s value under the null hypothesis is known, the null can be rejected and a change point declared with a certain confidence should the peak value exceed some corresponding threshold. In contexts where the distribution under the null cannot be attained, this peak thresholding method can still be used to make detections but lacks formal guarantees.
In this similaritybased context, three broad classes of tests have been applied for CPD. The first class is likelihood ratio tests based on the KLdivergence [sugiyama_direct_2008, liu_changepoint_2013]. The second class is nonparametric statistical tests, such as the KolmogorovSmirnov (KS), Cramervon Mises, and MannWhitney statistics [hawkins_nonparametric_2010, ross_two_2012]
. The last class is tests based on distance metrics between empirical probability distributions. Prominent among them is the family of integral probability metrics
[sriperumbudur_empirical_2012] which includes the Maximum Mean Discrepancy (MMD) [gretton_kernel_2012], and the Wasserstein distance [sommerfeld_inference_2016], which has a QQ variant, the Wasserstein Quantile Test (WQT) [ramdas_wasserstein_2015, cheng_optimal_2020].Two concepts central to the ideas in this paper are the notion of a distributionfree test and a peakpreserving transformation. Nonparametric tests that are distribution free have their probabilistic distribution of the test statistic under the null hypothesis independent of the distribution generating the data. Therefore, if a test is distributionfree, threshold values that correspond to rejection of the null at a fixed falsealarm rate can be applied regardless of the change in distribution to be detected; thus, one does not require methods such as density estimation to derive the distribution under the null. Peakpreservation applies to any method that preserves the expected value of the test statistic at a candidate change point. Since the change point statistic is expected to peak at a true change point, any peakpreserving postprocessing method will maintain the statistical significance of the test statistic at the change points.
One issue that is not well studied in multiple changepoint detection is the exact determination of change points once the appropriate statistic is computed. Because the test statistic time series is itself random (since it is a function of the underlying random observations), one tends to see multiple local maxima in the vicinity of a change resulting in either a large number of false alarms or the need for adhoc post processing to identify true change points. Current stateoftheart methods simply consider local maxima above a specified threshold as change points [truong_selective_2020], [li_mstatistic_2015], and remove duplicate detections within a specified window [sugiyama_direct_2008]. However, it is clear that the sliding window methods produce a correlated test statistic where the effects of the change at a given point in time are spread over an interval which contains the change point.
Motivated by this fact, we draw on the classical signal processing idea of a matched filter [oppenheim_signals_2016] as a tool to better localize change points. More specifically, the asymptotic (as window length goes to infinity) forms of the expected “signatures” produced by sliding window methods are derived for the Wasserstein1 Distance (W1DT), Wasserstein Quantile Test (WQT), Sliced Wasserstein Quantile Test (SWQT), Maximum Mean Discrepancy squared (MMD), and KolmogorovSmirnov (KS) distance. Under a commonlyused assumption that the data in each segment is independent and identically distributed (IID) [adams_bayesian_2007, truong_selective_2020], we prove that the expected signature of each statistical test converges to a function that is independent (up to a scale factor) of the data distribution prior to and following the change point. Thus, distinct from the tests themselves being distributionfree, these filters are shown to be distributionfree in that they can be applied without knowing the distribution of the data. Using the asymptotically derived signature, we construct finite length filters in a manner that is peakpreserving. In summary of our main findings, the filters for the KS and W1DT are piecewise linear while those of the WQT and MMD, which are based on a square distance, are quadratic.
Matched filters are generally known to be timereversed versions of the signal to be detected, and are considered optimal with respect to signal to noise ratio when, for example, the noise is additive, white, and Gaussian. We do not claim that the stochasticity in the test statistic for these sliding window methods is of the form for which the matched filtering process is in any sense optimal. However, despite the formal lack of optimality and the fact that for real times series the data are not IID, we demonstrate empirically that the use of the matched filter simplifies the peakdetection process and improves CPD performance on simulated and realworld activity data.
Finally, exploration of the performance of these filters across the five nonparametric test statistics identified above brings to light interesting properties of tests based on the QQ function for multiple change point problem. Specifically, it is well known that the QQ function is both invariant to orderpreserving transformations of the data [wasserman_all_2006] and we show that it is also is highly sensitive to small changes in the support of the data. In the context of CPD, these properties allow QQ tests to detect relatively small changes in a manner that is practically independent of the overall scale of the data. Whether this characteristic is useful or a source of false alarms depends heavily on the underlying application, an issue that is examined in this work using both simulated and realworld data.
In summary, the main contributions of this paper are as follows:

We develop a principled methodology for deriving and applying matched filters for similaritybased change point detection. We prove that our proposed matched filters are distributionfree and peakpreserving, which preserves the guarantees provided by standard hypothesis tests when used to detect change points given the filtered signal.

We offer formal proofs deriving the asymptotically matched filter for four common statistical tests: the KolmogorovSmirnov test (KS), Maximum Mean Discrepancy squared (MMD), Wasserstein1 distance (W1DT), Wasserstein quantile test (WQT), and propose the sliced Wasserstein quantile test (SWQT) as a multivariate extension of the WQT.

We demonstrate empirical benefits for the above theoretical contribution in both simulation studies and realworld multivariate change point benchmarks. We show how matched filters with suitable finitelength approximations can deliver improved precision in change point localization as well as reductions in false positives, and importantly, remove any need for additional postprocessing for duplicate detection removal common to other methods.

We provide insight into how the choice of test statistic impacts empirical performance, specifically highlighting differences in sensitivity between statistics based on the QuantileQuantile (QQ) function compared statistics derived from the Empirical Distribution Function (EDF) or Quantile Function (QF). These insights are justified through theory and demonstrated in simulated and realworld human activity datasets.
The remainder of this paper is organized as follows. Sec. II introduces the CPD problem and setup the basic framework for CPD with statistical tests on sliding windows. In addition, we outline the main properties that differentiate QQ tests from other statistical tests. Sec. IIIA motivates and outlines the simple approach of deriving and applying matched filters and state the main theorems for the asymptotically matched filters for WQT, MMD, KS, and W1DT tests. Details of each proof are left to the appendix. Sec. IV
shows that the empirically computed matched filters and properties of QQ based tests match our theoretical results. Furthermore we demonstrate the improvement in detection as shown through false alarm rates, and the related metrics of precision and recall evaluating on simulated data as well as realworld benchmarks based on human activity (HASC
[ichino_hascpac2016:_2016], MASTRE [hussey_monitoring_2020]) and honeybee activity (Beedance) [oh_learning_2008].Ii Change Point Detection
Iia Problem Statement
Assume a time series, where represents a compact set, is constructed with the following model.

The data consists of distinct time segments with , such that within each time segment, are IID samples from a fixed but unknown distribution.

The distributions in successive time segments are different but in general two nonadjacent segments can have samples from the same distribution.
The set of time points , are referred to as the change points. Given these conditions, the problem of unsupervised Change Point Detection (CPD) is to estimate the (possibly empty) set of change points purely from the provided data without any information or assumptions about the number or location of change points.
IiB Notation and Background
A set of total samples, , each drawn IID from some distribution , has an associated Empirical Distribution Function (EDF), and Quantile Function (QF) defined as,
(1) 
where the indicator function is,
(2) 
Given another set of IID samples, drawn from distribution , the QuantileQuantile (QQ) function is as follows,
(3) 
The EDF, QF and QQ functions represent stochastic processes on their respective domains [billingsley_convergence_1999]. Thus in this work, almost sure convergence () and weak convergence or convergence in distribution (), refers to convergence of stochastic processes. While the relevant background is covered in the appendix, we point the readers to additional works [vaart_asymptotic_1998, van_der_vaart_weak_1996, billingsley_convergence_1999, pollard_convergence_2012] for more comprehensive coverage in this area.
IiC QuantileQuantile Tests
On real valued data, many of the twosample tests discussed in this paper directly compare EDFs or QFs (Tab. I
). In the QQ case, the comparison is made to the uniform distribution. Indeed, when two distributions are equal, as under the CPD null, the resulting QQ function matches the uniform distribution on
. Thus, we designate statistics of this form as QQ tests.Since the QQ function is nondecreasing from , QQ tests are inherently bounded. The maximum is achieved when the open intervals covering the respective supports of and are disjoint, as shown in Appx. G for the WQT. QQ functions are also known to be invariant to any transformation on the data that is orderpreserving, or in other words, monotonically increasing [wasserman_all_2006], such as positive affine transforms.
The bounded property of the QQ function makes QQ based tests particularly sensitive to small shifts in support. In addition, the invariance of QQ functions to orderpreserving transformations makes QQ tests rather insensitive to such transformations of the data. More precisely, the QQ test statistic will be identical for a given a time series and where is an orderpreserving function. For QQ tests applied to multiple change point detection problems, these properties imply that a single threshold can be used to detect changes of widely varying magnitudes; at least those changes induced by shifts in support or orderpreserving transformations. We verify this claim using simulated data in Section IVB. In the case of real world data, we discuss how the empirical results in Section IVC may be interpreted in light of the properties of QQ test.
IiD Statistical Tests for Change Point Detection
Our general framework for CPD generates a test statistic by sliding adjacent windows of a constant size and computing a twosample test from samples within each window (Fig. 2).
At each time , we define two windows of samples of size ; one to the left of , with distribution and the other from the samples to the right , with distribution . A statistical test is then applied to these two windows. We use notation to represent a general statistical test. This can be substituted for the various specific statistical tests defined in Sec. III.
The nominal approach for identifying change points given a test statistic is to label local maxima of a computed statistic above some threshold parameter [li_mstatistic_2015]. However, as is evident in Fig. 1, randomness in the statistic can adversely affect CPD methods that use thresholds for detection. Furthermore, in the presence of change points, multiple peaks complicate the exact localization of change points. These problems, along with the fact that sliding windows produce a correlated statistic, motivate the application of matched filtering of the test statistic for CPD.
Iii Matched Filters for Statistical Tests
Iiia Asymptotically Matched Filters
As shown in Fig. 2, for the sliding window framework with constant window size , the effect of a change point located at will be reflected in the test statistic on the interval . Consequently, the matched filter is derived from the expected response of the test statistic on this interval.
In this change point scenario, samples are assumed to be drawn IID from a distribution for , and from another distribution for . We then generate the EDFs from adjacent sliding windows and the test statistic as described in section IID. Data in the windows that span the change point can be modeled as coming from a mixture distribution between and .
Without loss of generality (as long as is shown to be symmetric), we consider the case starting at and sliding the window to the right such that the change point is located in the left set of samples. In this setup, samples from the left window can be modeled as IID samples from the mixture and the distribution of the samples in the right window remains constant, . We redefine these distributions in terms of a mixture parameter where corresponds to . In this mixture view, our left window and right window distributions are,
(4) 
This reformulation allows us to analyze the expected response of the statistic asymptotically, as goes to infinity. In this case, the possible values of become countably infinite on the interval .
Given that and are derived from samples drawn IID from the distributions defined in (4), we prove that for each statistic, as for each , the expected value of the statistic converges to a deterministic function of the following form,
(5) 
Here, is a constant for a given , and , while is only a function of . From (5) we can conclude two main properties. First, the time dependent component of the response; that is , is independent of the distributions and . Second, and impact the response only through a constant scale factor. These two properties allow us to construct a matched filter whose shape is distributionfree and peakpreserving which we precisely define in Sec. IIIB.
Moving back to the case where is finite, (5) suggests that the expected value of the test statistic around a change point at for be approximated as,
(6) 
Since is symmetric in its arguments, (5) holds when are reversed, which corresponds to the analogous setup with the windows slid to the left. Therefore, the response of the statistic is mirrored about . For shorthand, change point statistic is denoted as .
Therefore, with a slight abuse of notation, we can define the matched filter ,
(7) 
Test  WQT  SWQT  W1DT  KS  MMD  
Type of Test  QQ  QQ  QF, EDF  EDF  N/A  

DF [ramdas_wasserstein_2015]  DF  not DF [sommerfeld_inference_2016]  DF [vaart_asymptotic_1998]  not DF [gretton_kernel_2012]  
Data Dimension  

IiiB PeakPreserving, DistributionFree Matched Filters
The matched filter defined above can be applied to the test statistic signal to produce a peakpreserving filtered signal suitable for CPD:
(8) 
Here, is a scalar factor ensuring that the filter is in fact peakpreserving and denotes the convolution operation.^{1}^{1}1
boundary conditions handled through zeropadding
We note that this filtering process is distributionfree since (8) does not depend on or . This means that is the matched filter regardless of the distributions defining the signal change and can be applied globally without any knowledge of probabilistic model of the change point.It also follows from (8) that since the expected value of the test statistic in the region of a change point at reflects (5), then . Thus the resulting peak value of the statistic at the change point is preserved in expectation through the filtering process. This peakpreservation property is important for statistical tests where the resulting values are compared to a threshold in order to reject of the null hypothesis with a certain statistical confidence.
With the application of the matched filter, change points are detected at local maxima above a threshold of the filtered statistic, where no further postprocessing of the local maxima is required. The proposed algorithm is detailed in full in Alg. 1.
IiiC Wasserstein1 Distance Test (W1DT)
Given two probability distributions on the Wassersteinp distance is defined as,
(9) 
where
denotes the set of all joint distributions with marginals
. For , the Wasserstein1 distance has the closed form [santambrogio_optimal_2015, peyre_computational_2018],(10) 
We note the following theorem.
Theorem.
1: Let , be derived from IID samples drawn from , and respectively, where are continuous on a compact domain, and constant , then
(11) 
Assuming that the samples live on a compact set, the test statistic is bounded. Thus it follows from the Portmanteau theorem [billingsley_convergence_1999] that , which has the form of (5). Thus by the process described in Sec. IIIA the matched filter for the operational case when is sufficiently large (but finite) is a piecewise linear function of , .
IiiD Wasserstein Quantile Test (WQT)
The WQT is a distributionfree variant of the Wasserstein distance that measures the Wasserstein distance of the QQ function to the uniform measure [ramdas_wasserstein_2015],
(12) 
For the WQT we note the following theorem.
Theorem.
2: Let , be derived from IID samples drawn from , and respectively, where are continuous on a compact domain, and , then
(13) 
Analogous to Sec. IIIC, assuming that the samples live on a compact set, the test statistic is bounded. Thus it follows from the Portmanteau theorem that , which has the form of (5). Then, by process described in Sec. IIIA, the matched filter for the WQT in the operational case when is finite is .
Thm. 2 adds a factor to the definition of the WQT in (IIID) that removes a distribution dependent term in the WQT. In the operational case, we approximate this term by considering the case where , where the term would have the most impact. This acts as a constant bias term that is removed from the signal prior to matched filter convolution when using the WQT. Details of this can be found in Appx. C.
IiiE Sliced Wasserstein Quantile Test (SWQT)
Since the WQT is only defined in one dimension, the naive approach for an extension to multiple dimensions is to average the WQT across each dimension independently. Alternatively, we propose to use the sliced Wasserstein quantile test (SWQT) using a similar approach to the sliced Wasserstein distance [bonneel_sliced_2015], that averages the WQT over onedimensional projections of the data. Given a two sets of samples, the SWQT is,
(14) 
where is the uniform measure on  the unit sphere in , and ,
are the respective EDFs computed from the projections of the samples on the unit vector
.With this definition we state the following,
Theorem.
3: Let sets each consisting of samples drawn IID from and respectively, where are continuous on a compact domain, and . Then,
(15) 
Assuming that the samples live on a compact set, the test statistic is bounded. Thus it follows from the Portmanteau theorem that , which has the form of (5). Then, by process described in Sec. IIIA, the matched filter for the SWQT in the operational case when is finite is .
As with the WQT in Sec. IIID, Thm. 3 is stated with a factor that removes an term. This term is approximated to by considering the case of where . This acts as a constant bias (and is identical to the bias of the WQT) which is removed from the signal prior to matched filter convolution. Details of this can be found in Appx. D.
IiiF KolmogorovSmirnov (KS)
The twosample KS test [massey_kolmogorovsmirnov_1951] computes the maximum deviation between the respective empirical distribution functions,
(16) 
Under continuity assumptions on the distributions , it is known to be distributionfree under the null hypothesis [pratt_concepts_1981].
We note the following theorem for the KS test.
Theorem.
4: Let , be derived from IID samples drawn from , and respectively, where are continuous on a compact domain, and . Then,
(17) 
IiiG Maximum Mean Discrepancy Squared (MMD)
The MMD between two distributions , represents the largest difference in expectations over functions in the unit ball of a Reproducing Kernel Hilbert Space (RKHS) with kernel^{2}^{2}2We will assume that the RKHS is universal. In this case the MMD is a metric [gretton_kernel_2012]. ,
In this work we will consider MMD statistic for CPD. Given two sets of samples, sampled IID from and respectively, the MMD
has an unbiased estimator
[gretton_kernel_2012] given by,(18) 
Note that under the null hypothesis, the (appropriately scaled) limiting distribution of this unbiased estimator is not distribution free [gretton_kernel_2012]. We have the following theorem.
Theorem.
5: Let sets each consisting of samples drawn IID from and respectively, where are continuous on a compact domain, and where and . Then,
(19) 
Iv Evaluation
The only algorithmic hyperparameters required for the nonparametric statistical change point methods described in this paper are the window size and the detection threshold parameter . For each experiment, we compare filtered and unfiltered version of each statistical test using under the same window parameter. For all real data experiments, we use domain knowledge of the frequency at which change points should be detected to set the window size. Since not all tests have the same statistical guarantees, we evaluate using metrics that vary the threshold parameter over all possible values.
Early applications of change point detection in failure detection focused on metrics such as average run length and detection delay as key metrics of CPD [tartakovsky_asymptotically_2018, veeravalli_quickest_2012]. Recent applications frame the problem of multiple CPD as a classification problem at each time step (change point vs. no change point) where there is a severe class imbalance as only a small fraction of time steps is regarded as true change points. Here we follow the work of [burg_evaluation_2020]
moving toward metrics based on the confusion matrix. We also consider performance over a range of thresholds to allow the enduser maximum control of these tradeoffs. To these ends, when possible, we report the full PrecisionRecall (PR) curve, as well as the Area Under the PR Curve (AUPRC), and bestF1 score (harmonic mean of precision and recall) over across all threshold values.
One motivation for the matched filter approach is to disambiguate multiple local maxima near potential change points. Prior CPD works address this issue by removing from consideration duplicate peaks that are within a prespecified distance of one another[liu_changepoint_2013], keeping only the highest peak. For a fair comparison to unfiltered methods including prior work, we apply this postprocessing to all unfiltered methods, but not the matched filtered methods. For clarity, all methods where duplicate local maxima are removed are labeled with (), and all matched filtered methods where no such postprocessing is applied are labeled as ().
Thus, for evaluation purposes, we include two additional parameters; is the minimum distance between detected change points applied to the unfiltered methods, and defines the tolerated distance for scoring detected change points. Specifically, a detected change point is a True Positive (TP) if there exists a true change point within samples, otherwise it is considered a False Positive (FP). False Negatives (FN) are true change points that do not contain any detected change point within samples. Precision (P), Recall (R), and F1 are then defined as,
(20) 
Iva Simulation Data
First, we verify our proposed matched filters with simulated data. Given two known distributions and , the exact mixture scenario in Fig. 2 is simulated where and . The test statistic is computed for all values of the mixture parameter and for various window sizes . Each test was averaged over 100 repetitions using different random seeds.
Next, we validate the benefits of matched filters for change point detection on simulated sequences using AUPRC and bestF1 metrics. Separate evaluations are performed for tests defined for scalar and vector valued data. For the scalar case, we generate 40 IID data sequences of length 800 with a single change point uniformly distributed between 300 and 500. Samples prior to the change point are drawn from distribution , whereas samples after the change point are drawn from .
For the multivariate case, an identical simulation setup is used but and are defined with a shared covariance and a difference in mean:
The SWQT is computed via Monte Carlo simulations by randomly sampling vectors
, and averaging the results over each linear projection. The Gaussian kernel with unit variance is used for computation of the MMD. With these datasets, we compare the performance between the filtered and unfiltered test statistics for various window lengths and evaluate with parameters
.Finally, the difference in the regions of sensitivity of the QQ tests versus the scalability of nonQQ tests is illustrated. First we compute the WQT and W1DT for two uniform distributions, and for , modeling the behavior of these tests for distributions with shifting supports. Then, we verify the invariance of the WQT to orderpreserving transformations by simulating a time series with a sequence of 4 distributions, , , , each with 500 samples. We note that by construction, the relative scale at each change point is equal, where each successive segment models the data being scaled by a factor of 2. Thus, if
represents a random variable in the first segment, a random variable in the following three segments would be
respectively. Two additional scenarios are considered; one leaves the data as is, and the other transforms the data by a cubic , which is a monotonically increasing, orderpreserving function. We then compute the filtered change point statistic using a window of samples comparing the WQT and the W1DT. The test results were aggregated over 10 independent iterations using AUPRC as a measure of change point performance.IvB Simulated Results
The plots in Fig. 3 confirm the results from our theorems and show the convergence of the signature for each of the statistical tests to the expected functional form. Generally speaking, even for sample sizes on the order of show convergence towards the expected signature.
In the simulated change point tests on (Tab. II), and (Tab. III) show that the application of our proposed matched filters to the corresponding test statistic yields consistent improvement in the AUPRC and bestF1 metrics thus resulting in an improved true positive to false positive ratio. As expected, when window size increases, overall detection performance also increases. In this controlled setting, the performance across all four possible statistical tests is comparable in the univariate case. However, in the twodimensional case the SWQT has a better overall performance compared to the MMD across all window sizes.
Since the data matches all of our assumptions, we are also able to empirically verify the peakpreserving property of the matched filter as shown in Fig. 1, where the data is generated by sampling IID alternating between two normal distributions with a window size of .
The difference between the WQT (QQ based), with the W1DT (EDF, QF based), for CPD are highlighted using simulated data in Fig. 4
. In the simulated time series, at each successive segment where the mean and standard deviation of the data is doubled, the WQT detects each change with essentially the same magnitude, thus a single threshold suffices to detect these three change points as the relative change is constant. Conversely, the change point response of the W1DT scales with the absolute magnitude of the change in the signal. Furthermore, since the QQ is invariant to orderpreserving transformations, the change point statistic of the WQT under the cubic transform on the data remains identical but the W1DT statistics shifts drastically.
AUPRC  BestF1  
n=50  100  150  50  100  150  
WQT  0.52  0.76  0.90  0.46  0.69  0.82 
FWQT  0.54  0.80  0.93  0.49  0.73  0.87 
MMD  0.47  0.75  0.88  0.45  0.67  0.83 
FMMD  0.53  0.78  0.89  0.50  0.70  0.84 
MW1  0.51  0.78  0.89  0.49  0.70  0.83 
FMW1  0.54  0.89  0.94  0.46  0.75  0.84 
KS  0.53  0.70  0.86  0.46  0.66  0.79 
FKS  0.54  0.88  0.98  0.46  0.72  1.0 
AUPRC  BestF1  
n=50  100  150  50  100  150  
MMD  0.19  0.67  0.85  0.36  0.65  0.88 
FMMD  0.27  0.85  1.0  0.48  0.86  1.0 
SWQT  0.52  0.95  0.97  0.56  0.95  1.0 
FSWQT  0.73  1.0  1.0  0.72  1.0  1.0 
When comparing the WQT and W1DT on uniform distributions of shifting supports, Fig. 4 shows both the bounded property of the WQT, saturating when the supports of the two distributions become disjoint (), and the difference in sensitivity (slope of the response) of the two tests. Straightforward calculations show that the W1DT is equally sensitive (constant slope) regardless of the shift in . In contrast, the WQT shows different regions of sensitivity. Specifically, the WQT it is more sensitive to small changes in support but is insensitive to any additional change once the supports are disjoint. While the scales of these two tests differ as evident from the left and right axes labels, all evaluation in this paper is based on local maxima and precisionrecall curves the structure of which is independent of the absolute amplitude of the test statistic. We return to this point in the discussion of the realworld data below.
In summary, two properties of QQ tests for CPD are (1) the ability to use a single threshold to detect changes at different scales, and (2) the high sensitivity of these tests to small changes in data support. In some applications where these characteristics are desirable, change point methods built on QQ tests will provide better results as seen from the clear benefits in AUPRC in Fig. 4. However, in cases where perhaps the absolute magnitude of the change significant, tests based on the EDF or QF would produce better results. For comparison, the reported results in the table of Fig. 4 uses all three change points as true change points. If only the change with the greatest magnitude (that is, the third one) was considered a “true” change, the performance would be reverse with an AUPRC of 0.256 for the FWQT and 0.687 for FW1DT.
IvC Real World Data
We compare the filtered and unfiltered versions of the statistical tests described in this paper with prior work using identical windowing parameters where applicable. The Mstatistic [liu_changepoint_2013] is a sliding window CPD algorithm based on the MMD. BOCPDMS [knoblauch_spatiotemporal_nodate], is a parametric Bayesian method that extends [adams_bayesian_2007] through model selection. Here we utilize the algorithms default parameters^{3}^{3}3Code from https://github.com/alanturinginstitute/bocpdms. and set the change point statistic as the log probability that the run length equals zero. RuLSIF [liu_changepoint_2013] uses direct density ratio estimates between sliding windows^{4}^{4}4Code from https://rikenyamada.github.io/RuLSIF.html.. KLCPD [chang_kernel_2019]
applies the MMD on sliding windows with a kernel trained as a neural network in a supervised setting. We note that KLCPD is the only supervised method included in our evaluation. In this supervised setting, since data is required for training and validation, KLCPD is tested on a subset of the available evaluation data. The default setup
^{5}^{5}5Code from https://github.com/OctoberChang/klcpd_code. is used where 60% of the each sequence is used for training, 20% for validation, and 20% for testing. Therefore, comparison of KLCPD to all other unsupervised methods should be considered carefully. To each of these methods, we remove duplicate peaks within samples of each other.To provide a detailed analysis of performance, we plot the PR curve for each method evaluated on the following datasets:
HASCPAC2016 [ichino_hascpac2016:_2016] : A raw dataset that consists of over 700 threeaxis accelerometer sequences sampled at 100 Hz of subjects performing six actions: ’stay’, ’walk’, ’jog’, ’skip’, ’stairs up’, and ’stairs down’. The 92 longest sequences where each of the six actions are represented are used for evaluation. Time series have an average length of 17,775 samples and 15.2 change points.
MASTRE[hussey_monitoring_2020] : In this proprietary dataset, soldiers move between a series of stations to perform various physical tasks. The nature of the tasks varies from marksmanship to aerobic exercises. Subjects are instrumented with a threeaxis accelerometer sampled at 100 Hz, and change points are labeled from video as the subject transitions into and out of tasks. A total of three time sequences were evaluated with an average length of 92,097 samples and 65.3 change points
Beedance [oh_learning_2008] , , : A dataset containing movements of dancing honeybees who communicate through three actions: ”turn left”, ”turn right” and ”waggle”. A total of 6 time sequences are evaluated, each one containing 3dimensional signal of the X,Y location and heading angle of the bee as captured in an overhead image. The time series have on average a length of 787 samples, and 18.8 change points. We obtained the positions and angles from the original data release.
For datasets with multiple dimensions, methods inherently defined for univariate signals () are extended to higher dimensions by averaging their respective test statistic over each dimension. This applies to WQT, W1DT, KS, and RuLSIF.
IvD Real World Results
For HASC dataset, where the window size is large (), the left column of Fig. 5 shows the improvement in performance provided by the matched filter. At each possible fixed recall, there is consistently about a 5% increase in precision with the matched filter applied up to a recall of about , indicating that in this region, the matched filter decreases false positives without increasing false negatives. However, recall values past a certain point are not achievable by the matched filtered methods under the given algorithm parameter and evaluation parameter . This is especially true as the threshold becomes small. In this regime, the stochastic nature of the unfiltered test statistics often produces what are essentially spurious, local maxima someplace within of a true change point. The matched filter however serves as a lowpass filter, removing these peaks resulting in recall values less than unity. In these cases, the higher achievable recall of the unfiltered statistical tests is attributed to the prevalence of local maxima due to randomness rather than the inherent properties of the test. Interestingly, the bestF1 scores are comparable between the filtered and unfiltered methods in the HASC dataset, and are achieved around where the two curves cross. Thus, up to a certain recall, the application of matched filtering improves detection precision and F1 score. Past this recall, only achievable by unfiltered methods, the use of matched filtering does not provide benefits in terms of F1 score.
For MASTRE, the performance of matched filtered statistics in Fig. 5 (middle column) compared to unfiltered methods shows similar trends to HASC where for lower recall values, matched filtered statistics consistently produce higher precision values. However unlike HASC, the recall value at which the filtered methods fall off differ significantly between each of the test statistics. This discrepancy is due to the difference in how true change points are labeled between the two datasets, discussed in depth below.
Referring to the right column in Fig. 5, for the Beedance dataset, the matched filter does not seem to offer clear benefits compared to the baseline test statistic signals. In fact, we might expect this because the small window size required for this dataset () likely means we are far from the asymptotic regime of in which the matched filters are derived. We thus include this result as a known limitation as we expect matched filters to show benefits only for large window sizes.
As seen in the bottom row of Fig. 5, compared to prior work, the relative performance of sliding window statistical tests discussed in this work varies depending on the dataset. For the HASC data, all statistical tests show better results compared to all other evaluated methods. In the cases of MASTRE, RuLSIF shows the overall best results. Notably, mStat performs comparably to the unfiltered SWQT for both HASC and MASTRE. For Beedance, KLCPD (supervised) shows the best results. Of the prior methods evaluated, RuLSIF performs most consistently over the three datasets.
Furthermore, we note that for the MASTRE data set, there is a significant improvement in the SWQT compared to the WQT averaged across each dimension whereas for HASC, there is only a slight improvement. From this we can deduce that the change points of HASC can be observed through the analysis of each dimension independently, whereas for MASTRE performance is improved by considering vector valued methods.
The HASC and MASTRE datasets represent human activity measured through accelerometry under different contexts. The HASC experiment measures human activity in a controlled setting where subjects are instructed to hold an action until switching to another one of the six allowed actions. On the other hand, MASTRE is collected in a taskoriented setting where transitions between individual actions are more fluid, which is one reason why overall performance on MASTRE is lower. Furthermore, while there are running and walking tasks similar to that of HASC, the MASTRE data also encompass changes where the individual is not moving their feet, such as standingtokneeling posture changes.
The characteristics of the HASC and MASTRE data sets noted in the previous paragraph lead to differences in performance among the tests. In the HASC precisionrecall curve, the performance of the statistical tests varies. Given a certain recall value, W1DT and MMD have the highest precision, KS is in the middle, and the WQT and SWQT generally have the lowest precision. The WQT and SWQT results arise from the fact that the QQ tests on the HASC data false alarm more often compare to the other tests. Although the WQT and SWQT achieve a slightly higher recall (for example at a precision level of 0.45), this benefit does not outweigh the loss in precision. However, for the MASTRE PR curves, the story is different. For low recall values (0.20.6) there is a similar trend where the MMD and KS have higher precision values than the SWQT. However, the discrepancy between the recall values (for example at a precision level of 0.3) is much more pronounced where SWQT achieves a recall significantly higher than that of the other statistical tests, especially the W1DT which overall performs very poorly on this dataset. These differences in performance of the W1DT and the WQT/SWQT between HASC and MASTRE can be explained by two factors: (1) the properties of QQ tests as discussed in the context of simulated data and (2) how true labels are assigned in the respective datasets.
As seen from the sample HASC time series (Fig. 6 left) the WQT generally has peaks of equal height whereas the peaks of the W1DT scale with the observed magnitude of changes. This behavior is consistent with the discussion surrounding Fig. 4 concerning the manner in which these tests respond to changes that vary in scale. Furthermore, while it appears that the WQT false alarms in the stationary regions where the subject is motionless, closer inspection into one of these regions (Fig. 6, middle) shows that the signal has a slight shift in mean perhaps due to a shift in posture. As discussed above (Fig. 4), the WQT is highly sensitive to these small change in the support of the data resulting in a clear peak in the QQ test statistic in Fig. 6. This change in the data is small relative to others across the full time series. Thus, consistent with the relative insensitivity of EDF/QF tests, it is not reflected in the W1DT. The ground truth change point labels of HASC focuses more on the largescale activity changes, therefore change points are not labeled for these small changes, which contributes to the poor precision of the QQ tests.
Change points in the MASTRE dataset correspond to entering and exiting stations where tasks are performed, not necessarily based on specific action the subject is taking. Therefore, true change point labels correspond to both large changes in action and subtle changes in posture (Fig. 6, right); the latter of which would not be labeled as a true change point in HASC setting. As shown earlier and seen in Fig. 4, the WQT is sensitive to both these changes, and therefore achieves a higher precision and recall.
These two human activity datasets provide an example of how the application dictates the suitability QQ tests. We have shown how the WQT is particularly sensitive to small changes and support, and that QQ tests equally detect changes at different scales. In applications where these properties reflect true change points, as is in the case of MASTRE, QQ tests will yield better results. However, in applications similar to the HASC dataset, where subtle shifts in posture are considered false alarms, the W1DT would be preferred as they would be dwarfed by the larger changes in the time series.
V Conclusion and Future Work
While many methods of change point detection have been proposed over the years, the issue of change point localization for a noisy distributionfree statistic has not been thoroughly considered. To address this issue, we introduce asymptotically matched filters. For various nonparametric tests that have been used as the foundation of multiple CPD algorithms, we derive these filters under the simple observation that sliding windows over a change point will cause samples from one window to be drawn from a mixture distribution. With asymptotic analysis, we are able to derive the expected response of the test statistic in the region of a change point which is then used to compute the matched filter in the operational (nonasymptotic) case. While in this paper we only consider a subset of tests, the proposed analysis methodology for deriving matched filters can be applied to other methods in change point detection.
The discussed framework for change point detection through the use of a twosample test of sliding windows is both simple and easily deployed in practice. Once a test statistic is chosen, the only hyperparameters required are the window size and detection threshold. We build on this methodology by applying matched filtering which results in improved change point precision, and also simplifies the of the process of identifying change points given a statistic, removing any need for adhoc processing to remove duplicate peaks. Furthermore, if the statistical test is distributionfree, the peakpreserving property of the matched filter ensures that statistical guarantees are preserved with a constant threshold. While simple, this method of detecting changes points by testing for changes in distribution through two sample tests demonstrates competitive performance with other stateoftheart approaches.
In understanding the tradeoffs between various CPD methods, we build on two properties of QQ based statistical tests; namely their invariance to orderpreserving transformations and their sensitivity especially to small changes in support of the data (or equivalently, small changes in the mean). For CPD applications these properties result in differences in response. Specifically, QQ tests can detect changes at different scales of the data using a single threshold while tests based on quantile functions or empirical distribution functions tend to be “tuned” to changes of a specific magnitude. As evidenced by our realworld data examples, these differences can be leveraged to properly select the appropriate test for an application, and certainly motivate further rigorous investigation.
Despite the fact that the derivations for the filters in this paper assume that the data is IID, based on realworld results, we see that the benefits still hold on nonIID data when the window size is sufficiently large. Nonetheless, in future work, we hope to consider analysis under nonIID conditions and evaluate if matched filters can be applied to other change point methods.
Comments
There are no comments yet.