Linear-time Detection of Non-linear Changes in Massively High Dimensional Time Series

10/28/2015
by   Hoang-Vu Nguyen, et al.
Max Planck Society
0

Change detection in multivariate time series has applications in many domains, including health care and network monitoring. A common approach to detect changes is to compare the divergence between the distributions of a reference window and a test window. When the number of dimensions is very large, however, the naive approach has both quality and efficiency issues: to ensure robustness the window size needs to be large, which not only leads to missed alarms but also increases runtime. To this end, we propose LIGHT, a linear-time algorithm for robustly detecting non-linear changes in massively high dimensional time series. Importantly, LIGHT provides high flexibility in choosing the window size, allowing the domain expert to fit the level of details required. To do such, we 1) perform scalable PCA to reduce dimensionality, 2) perform scalable factorization of the joint distribution, and 3) scalably compute divergences between these lower dimensional distributions. Extensive empirical evaluation on both synthetic and real-world data show that LIGHT outperforms state of the art with up to 100 improvement in both quality and efficiency.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/15/2020

High-Dimensional Changepoint Detection via a Geometrically Inspired Mapping

High-dimensional changepoint analysis is a growing area of research and ...
05/30/2019

Efficient Covariance Estimation from Temporal Data

Estimating the covariance structure of multivariate time series is a fun...
07/19/2021

OnlineSTL: Scaling Time Series Decomposition by 100x

Decomposing a complex time series into trend, seasonality, and remainder...
01/12/2021

Change-point detection using spectral PCA for multivariate time series

We propose a two-stage approach Spec PC-CP to identify change points in ...
01/23/2021

A Change-Point Based Control Chart for Detecting Sparse Changes in High-Dimensional Heteroscedastic Data

Because of the curse-of-dimensionality, high-dimensional processes prese...
12/07/2018

seq2graph: Discovering Dynamic Dependencies from Multivariate Time Series with Multi-level Attention

Discovering temporal lagged and inter-dependencies in multivariate time ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Change detection in time series is an important task in data mining. It has applications in many areas, including health care and network monitoring [1, 13, 18]. In principle, it is concerned with detecting time points at which important statistical properties of the time series change. To do so, a common approach is to compare the divergence between data distributions of a reference window and a test window. Traditionally, this works well for univariate time series [13, 3, 11]. Time series nowadays, however, are massively high dimensional. For instance, time series from human physical activities have more than 5 000 dimensions [25], while those from online URLs have more than 50 000 dimensions [19]. In these very high dimensional settings, existing work has both quality and efficiency issues.

In particular, techniques working directly with distributions of all dimensions, such as [6, 10, 12]

, are prone to the curse of dimensionality. Naïvely, this issue seems to be resolved by using a large window size. A large window, however, in general increases runtime, causes high delay, and misses alarms 

[1]. More recent work [14, 24]

alleviates the issue by principal component analysis (

pca[23]. Using pca they map data in reference and test windows to a lower dimensional space on which divergence assessment is carried out per individualeigenvectors. While this in theory can avoid high dimensionality, it has three pitfalls. First, pca is cubic in the number of dimensions. Second, to ensure the stability of covariance matrices that pca uses, these methods need a large window size [15]. Third, by assuming statistical independence in the pca space, they may miss complex changes over correlations of dimensions. Thus, detecting changes in massively high dimensional time series still is an open problem.

In this paper, we aim at addressing this issue. We do so by proposing Light, for linear-time change detection in high dimensional time series. In short, Light overcomes the drawbacks of existing techniques by a three-step approach. First, we perform scalable pca mapping of the two windows using matrix sampling [7]. With matrix sampling, we allow for medium to small window sizes. After pca mapping, we obtain data embedded in a much lower-dimensional space. To further increase robustness of divergence assessment and flexibility in choosing the window size, we factorize the joint distribution in the transformed space to lower dimensional distributions. Finally, we propose a divergence measure that computes divergence score using these distributions. Importantly, our measure has the non-negativity property of well-known divergence measures. Further, it allows non-parametric computation on empirical data in closed form. It also permits incremental calculation and hence is suited to the setting of time series. Our analysis and extensive experiments on both synthetic and real-world high dimensional time series show that Light scales linearly in the number of dimensions while achieving high quality.

The road map is as follow. In Sect. 2, we give an overview of Light. In Sect. 3, we provide its details and analyze its scalability. In Sect. 4, we review related work. In Sect. 5, we report empirical study. We round up with a discussion in Sect. 6 and conclude in Sect. 7. For readability, we put all proofs in the appendix.

2 Overview

Consider an -dimensional real-valued time series . At each time instant the sample consists of values where corresponds to dimension (). We assume that for all . Note that when the dimension scales are very different, we can normalize to bring them to comparable domains. Here we use different scales to provide more flexibility.

An overview of Light is given in Algorithm 1. It works as follows. First, we form a reference window with size (Line 2). Next, we transform to of lower dimensionality and extract structure from (Line 3). We will explain the rationale of this step shortly. For each new sample of the time series, we form a test window (Lines 4 and 7). Next, we apply to the transformation that we perform on to obtain ; we also extract structure based on . Both steps are described in Lines 5 and 7. We define the change score as the divergence between and (Line 8). When a change indeed happens (Line 9), we report it (Line 10) and restart the process. Note that in this work, we do not focus on handling time-dependent information, e.g. auto-correlation. Such information can be captured by patching multiple windows as done in [12, 18].

In Light, we transform by mapping it to another space of dimensions. We achieve this with scalable pca using matrix sampling [7]. The purpose of this step is multi-fold. First, in many domains the the default dimensions may not reflect the true physical dimensions [16]; hence a transformation is done to alleviate this issue. Second, when is very large we need to choose a large window size to ensure the robustness of divergence computation. A large however causes high delay and misses alarms [1]. Hence, dimension reduction is necessary to ensure quality. Third, by reducing the number of dimensions and allowing a smaller window size we are able to boost efficiency.

The usual next step would be to perform divergence assessment on . Working with -dimensional joint distributions however may still expose us to the curse of dimensionality [22]. For instance, when and it is true that ; yet, to effectively process a joint distribution with dimensions we may still require a moderate . To tackle this we propose to factorize the joint distribution of to a combination of selected lower dimensional distributions, preserving non-linear structures, e.g. non-linear correlations among dimensions of . The set of lower-dimensional distributions makes up structure in our algorithm. The change score at time instant is computed based on these low dimensional distributions. Hence, we achieve three goals at the same time: (1) more flexibility in setting

, (2) maintaining robustness of divergence assessment, and (3) detecting changes in complex non-linear structures. Note that the linear transformation with scalable

pca and the construction of could be perceived as compression of the data preserving first-order and higher-order dependencies, respectively.

For divergence assessment, we propose a new divergence measure that judiciously combines component lower dimensional distributions to produce the change score for the original distributions. Our measure has the important non-negativity property of popular divergence measures. In addition, it allows computation on empirical data in closed form. It also facilitates incremental computation and hence is suitable for time series application. Through the development of our measure, we show that when the dimensions in are indeed statistically independent, Light reduces to existing pca-based change detection methods [14, 24], which focus exclusively on linear relationships. Thus, Light is more general, i.e. it can uncover different types of change, be them linear or non-linear.

Having given an overview of Light, in the next section we provide its details and explain why it is a highly scalable solution for very high dimensional time series.

1:  Set
2:  Set
3:  Set
4:  Set
5:  Set
6:  for each new sample of the time series do
7:     Update , , and by replacing by
8:     Set
9:     if  then
10:        Report a change at time instant and set
11:        Repeat from step 2
12:     end if
13:  end for
Algorithm 1 Light

3 Details of Light

In this section we provide the details of Light, namely the data transformation, the distribution factorization, the divergence measure, and the change score thresholding.

3.1 Data Transformation

Following the previous section, we perform pca on the reference window with points and dimensions. For simplicity we use to denote the data of ; it has rows and columns. W.l.o.g., we assume that is centered. The usual procedure would be to (1) solve the eigendecomposition problem on

, and (2) use the top eigenvectors corresponding to the top eigenvalues to transform

. This costs . When is very large this complexity is prohibitive for large scale processing. Further, when is large we need to choose a large window size ; otherwise becomes very unstable making pca results unreliable [15].

We overcome this by matrix sampling. In short, matrix sampling is concerned with sampling rows/columns of matrices to reduce the computational cost of common procedures, e.g. matrix multiplication and svd. These approximations come with guaranteed error bounds. Further, as the sampled submatrix has fewer columns the results are also more stable [7]. Matrix sampling has been recently applied in the context of data mining [17]. Here, we employ the technique in [7] as it allows us to avoid fixing a priori the number of dimensions

to be kept. This method essentially performs approximate singular value decomposition (

svd). Recall that svd finds matrices , , and such that . Here, and contain the eigenvectors of and , respectively. is a diagonal matrix of size where . The singular values of A are also the non negative square roots of the eigenvalues of both and . Finding the exact solution of svd, like pca, costs . With matrix sampling we obtain high quality approximate solution of svd in much less time. The details are as follows.

We write the column of as for . First, we sample with replacement columns of

according to their relative variance

where

is the squared norm of vector

and is the Frobenius norm. This forms a matrix . Second, we perform pca on to extract its top eigenvectors corresponding to its top eigenvalues . The value of is determined by how much variance of to preserve; usually 90% to 99% suffices [16, 24]. Next, we form matrix where each column , and matrix of size . Let be a matrix whose columns contain the top eigenvectors of . According to [7], where is the best rank approximation of w.r.t. both and (spectral norm). By definition, the pca-transformed data is given by . We also need to compute to transform the test window later. It holds that . The original algorithm is stochastic. We can make it deterministic by picking columns of with largest relative variance; to break ties we use a canonical order.

The cost of approximating top eigenvectors of is . The cost of computing is . The cost of computing is . So the total complexity is with , which can be simplified to as we choose . With reasonable and small the accuracy is increased [7], i.e. small window size is favored. The complexity of this sampling algorithm is on par with other sampling methods, such as [26, 17]. While these techniques require us to provide the final dimensionality , our method permits to adapt to the current reference window . Overall, by matrix sampling we are able to boost efficiency and gain stability, as we perform pca on where instead of . In other words, besides guaranteed error bounds the approximate solutions will tend to be more stable to high dimensionality.

3.2 Distribution Factorization

To further increase robustness of divergence assessment and flexibility in choosing the window size, we factorize joint distribution in the transformed space to low dimensional distributions. We accomplish this with graphical modeling [30, 31], a powerful approach for this purpose. We denote the dimensions of as . Assume that we have a graph where ; two dimensions and not connected by an edge are regarded as conditionally independent given all other dimensions . Given such a graph , one can factorize the joint distribution of by first finding special structures (e.g. cliques or connected components) of

, and then using the distributions of these structures to estimate

. Of course, the more complex the structures the better the estimation. Nevertheless, complex structures tend to contain many dimensions, causing quality and efficiency issues for divergence computation. We achieve a more balanced solution by using the edges of , which is also a good alternative to factorize  [30, 31]. Under this model

where is the degree of in . One can see that the joint distribution is now represented by 1-D and 2-D distributions, further easing the ‘pressure’ on picking window size . So far, we assume that is available. Now we describe how to obtain it from .

Our solution consists of three steps: 1) computing pairwise correlation score of and with – the higher the score the more correlated, 2) initializing containing all dimensions and having an edge between every two dimensions – the weight of each edge is their correlation score, and 3) simplifying to one of its maximum spanning tree. Note that the simplified has edges.

Recall that our goal is to detect non-linear changes. To this end, for the first step we choose the quadratic measure of dependency [28] as it captures non-linear correlations, is non-parametric, and permits computation on empirical data in closed form. In short, under this measure the correlation between and is given by

where

denotes cumulative distribution function (cdf). Computing this measure for all dimension pairs takes

. To boost efficiency, similar to [20] we apply AMS Sketch [2]. In short, to compute we pre-compute the sketches of both and by projecting their realizations onto multiple random vectors . We then utilize the sketch values to estimate . The estimate is unbiased and its error is bounded [2]. The more random vectors used the better the estimate. We find that using vectors suffices. The time complexity of computing pairwise correlation scores is thus reduced to  [20]. To make our method deterministic we generate the vectors in advance and reuse whenever needed.

For the third step, as initially is dense we employ the method proposed in [8]. It is deterministic and costs . The outcome of this step is the set of 1-D and 2-D distributions taken from the maximum spanning tree. These distributions also constitute the structure of .

The overall complexity of distribution factorization is , i.e. linear in and independent of .

3.3 Divergence Computation

For each test window we first map it to using (see Section 3.1). Then, we use to extract , i.e. the set of 1-D and 2-D distributions of corresponding to those in . Our goal now is to compute the divergence score

where and are the joint distributions of and , respectively. One important question to address here is: How to do this using two sets of distributions and ? We answer this question based on the following observations.

Let

be the Kullback-Leibler divergence between

and . Using and we have

We postpone the proof to Appendix A.

Lemma 3.3 tells us that w.r.t. KL measure is equal to the sum of divergence scores of 2-D distributions (called 2-D divergence scores) offset by those of 1-D distributions (called 1-D divergence scores). Here, stands for the magnitude of changes in , , and their dependency. Thus, the sum of 2-D divergence scores stands for the magnitude of changes in the involved dimensions and their dependencies. The subtraction by 1-D divergence scores is to reduce the impact of dimensions contributing to more than one 2-D score term.

Though KL divergence features a nice computation of based on 1-D and 2-D distributions in and

, these distributions need to be estimated, e.g. parametrically or by kernel density estimation 

[27]. Here, we aim at a purely non-parametric approach to maintain the exploratory nature of Light. Thus, we propose to generalize the result in Lemma 3.3 to divergence measures other than KL. In particular, for any measure we propose to set to

where is a regularization factor, which is to guarantee that is non-negative. With our generalization we give way to applying other non-parametric instantiations for , e.g. the one in [21] with empirical computation in closed form, while still upholding the non-negativity property of KL divergence. An additional benefit of is that it can make the influence of with on more prominent, i.e. more impact is given to the dimensions correlated to multiple other dimensions.

Before introducing the setting of we show that pca-based change detection methods [14, 24] – using the previous two steps of our framework – also generalize from Lemma 3.3, yet in a more restrictive manner. In particular, these methods estimate by or ; note that the two forms are similar. From Lemma 3.3 we see that if and for , then under KL is equal to . Thus, pca-based methods [14, 24] also generalize from KL divergence; however, they impose two additional assumptions that and where are statistically independent under both the data of and . We in turn do not impose these restrictions and can capture correlation in the pca space , i.e. we provide a more general solution.

Choosing . We use the quadratic measure of distribution divergence [21]. It is purely non-parametric and its empirical computation is in closed form. Under this measure

and is defined similarly. We set following our below lemma.

Setting to the measure in [21] and as above, it holds that with equality iff and under the factorization model (cf., Section 3.2) are equal.

We postpone the proof to Appendix A.

Complexity analysis. The cost of computing for initial and or after every change is . The cost of computing divergence score for each new sample of the time series is ; more details on this are in the Appendix B. In case a large window size is required, e.g. due to the application scenario, we can further boost the efficiency of our method by sampling with replacement the data of and . The details are in Appendix C.

3.4 Change Score Thresholding

We perform an adaptive thresholding scheme to decide when a change indeed happens. The scheme we consider is the Page-Hinkley test, which has been used in [24]. Under this test we keep track of the past change scores corresponding to time instants without change. For a new time instant, we assess how much its change score deviates from these historical scores and raise the flag when the deviation is large (w.r.t. an easy-to-adapt threshold). More details are in [24].

3.5 Summing Up

On a time series with changes, Light costs . In our experiment , which simplifies the complexity to . Thus, Light is linear in the window size , the original number of dimensions , and the length of the time series. Being linear in , it is suited to very high dimensional time series. Further, it can avoid large window size while still achieving high quality, which is verified by our experiments in Section 5.

4 Related Work

pca-based change detection is studied in [14, 24]. Kuncheva and Faithfull [14] propose to use eigenvectors with small eigenvalues for transformation. Qahtan et al. [24] in turn show theoretically and empirically that eigenvectors corresponding to large eigenvalues instead are more relevant. Both methods apply pca on original reference windows and have cubic runtime in the number of dimensions . Further, they may miss complex changes due to the assumption that dimensions in the transformed space are independent, i.e. dimensions in the original space have linear correlations only.

Change detection based on estimating the ratio between distributions of reference and test windows is introduced in [12, 18]. The main idea is to directly approximate this ratio by kernel models without

approximating the two distributions. This procedure implicitly assumes that data is uniformly distributed in the

-dimensional space. When is large, we need a large window size to fill in this space to uphold this assumption. Computing the distribution ratio analytically costs . For efficiency purposes the window size must hence be kept small.

Song et al. [29] propose a divergence test based on kernel density estimation for change detection. By performing density estimation on the original -dimensional space, this test is susceptible to the curse of dimensionality [27]. While one could apply this test after distribution factorization, proving the non-negativity of the resulting score is non-trivial. Other change detection techniques on multivariate time series [1, 6, 10, 5] also work on the -dimensional space, and hence, are also prone to the curse of dimensionality.

Our method in contrast alleviates the high dimensionality issue by scalable pca mapping using matrix sampling. Furthermore, it does not make any assumption on data distributions nor that dimensions are linearly correlated. Lastly, it permits to set the windows size to match the level of details required by the domain expert.

5 Experiments

In this section, we empirically evaluate Light. In particular, we study its effectiveness in detecting known change points on both synthetic and real-world data sets. We implemented Light in Java, and make our code available for research purposes.111http://eda.mmci.uni-saarland.de/light/ All experiments were performed single-threaded on an Intel(R) Core(TM) i7-4600U CPU with 16GB RAM. We report wall-clock running times.

We compare to pind [24] and spll [14], two state of the art methods for pca-based change detection. Both apply traditional pca, assuming that dimensions in the pca space are statistically independent. In addition, we consider rsif [18], which measures divergence scores by directly approximating density ratio of distributions. For each competitor, we optimize parameter settings according to their respective papers. Light has 4 parameters, namely the number of sampled dimensions which is used in scalable pca mapping; the percentage of variance preserved which is used in scalable pca mapping; the number of sketches and the number of average sketch values which are used in distribution factorization. Note that the last two parameters apply for any method using AMS Sketch [2]. The default setting for the parameters is: , percentage , , and .

We experiment on both synthetic and real data sets. For the latter, we draw 7 data sets from the UCI Machine Learning Repository: Amazon, EMG Actions 1, EMG Actions 2, Human Activities, Human Postural Transitions, Sport Activities, and Youtube Video Games. All are high dimensional time series. As change points, we use existing class labels – a common practice in change detection 

[12, 29]. Table 1 summarizes the characteristics of these data sets.

Data
Amazon 30 000 20 000
EMG Actions 1 1 800 3 000
EMG Actions 2 3 600 2 500
Human Activities 10 299 561
Human Postural Transitions 10 929 561
Sport Activities 9 120 5 625
Youtube Video Games 120 000 50 000
Table 1: Characteristics of the real data sets. is the length of the time series and is its number of dimensions.

5.1 Synthetic Data Generation

Each -dimensional time series we generate contains 100 segments, each having 2000 time steps. We create the change point between every two consecutive segments by varying either distributions or correlation patterns of some dimensions. This means that each time series has 99 change points. We evaluate change detection techniques based on how well they retrieve these known change points. In line with previous work [14, 24], a change point detected by a method is considered to be a true positive if it is flagged correctly before points from the new distribution arrive. As performance metric, we use the F1 measure, which is defined as . Note that the higher the F1 score the better. Below we describe three different ways to create change points.

Gaussian Distributions. In this setting, vector

has multivariate Gaussian distribution (

). The mean vector and covariance matrix are initialized randomly. We consider three types of change: 1) change in mean vector, 2) change in individual variance terms, and 3) change in covariance terms. When creating segments, we switch among those three types of change in a round robin fashion. Each remaining dimension where in turn has its distribution fixed to .

Linear Correlations. We embed linear correlations in dimensions where . To model correlation in each segment, we first generate where and is fixed with initially drawn from . Here, and are two sets, each containing dimensions. We let . Next, we generate where is fixed with initially drawn from . Then, using a function we generate where , and ; we fix . We use two linear instantiations of :

When creating time series segments, we alternatively pick and . In this way, in every two consecutive segments the correlations between and are different. Each dimension where is drawn from .

Non-linear Correlations. To embed non-linear correlations among dimensions where , we follow the procedure as in linear correlations except for that we here use four non-linear and complex instantiations of :

When creating time series segments, we switch among , , , and in a round robin fashion. Each dimension where is drawn from .

5.2 Quantitative Results on Synthetic Data

We assess quality of the methods tested under different values of dimensionality and window sizes . For quality against we fix . For quality against we fix . The results are in Figures 1(a)1(f).

We find that Light consistently achieves the best performance across different values of and , and different types of change – from linear to highly non-linear. Its quality improvement over pind and spll is up to 100%.

pind and spll in turn do not perform well, most likely because they use unstable covariance matrices for pca transformation (note that in all cases tested). At , we have to stop the execution of pind and spll due to excessive runtime (more than 12 hours).

rsif does not perform so well (especially on non-linear correlations), most likely as it assumes that data is uniformly distributed in high dimensional spaces.

Light in contrast reliably detects changes in high dimensional time series with small window sizes. By not making assumptions on data distributions nor that dimensions are linearly correlated, it copes well with different types of change and yields high quality results.

(a) Gaussian distributions: F1 vs. dimensionality
(b) Gaussian distributions: Runtime vs. dimensionality
(c) Linear correlations: F1 vs. dimensionality
(d) Linear correlations: Runtime vs. dimensionality
(e) Non-linear correlations: F1 vs. dimensionality
(f) Non-linear correlations: Runtime vs. dimensionality
Figure 1: [Higher is better] Comparison with competitors: F1 scores on synthetic data sets. Overall, Light yields the best quality across different types of change, values of dimensionality , and window sizes .

5.3 Efficiency Results using Synthetic Data

Here we test efficiency against window size and dimensionality . The setup is as above. We show representative results on data sets with non-linear correlations in Figures 2(a) and 2(b). Results on other types of data sets are similar and hence skipped for brevity.

We see that in both cases, Light is more efficient than all competitors. The performance improvement over pind and spll is more than 100% for high values of . Further, the experiments show that Light has linear scalability to and , which corroborates our analysis in Section 3.5.

(a) Runtime vs. dimensionality
(b) Runtime vs. dimensionality
Figure 2: [Lower is better] Comparison with competitors: Runtime on synthetic data sets with non-linear correlations. Overall, Light has the best scalability across different values of dimensionality and window sizes .

5.4 Parameter Sensitivity

To assess the sensitivity of Light to its parameters, we re-use the synthetic data sets above, fixing and . The default setting for the parameters is: , percentage , , and . That is, when testing against one parameter we use the default setting for the other parameters.

The representative results on data sets with non-linear correlations are in Figures 3(a)3(d). We see that Light is very stable to different values of its parameters, which facilitates easy parameterization. The results also suggest that our default setting makes a reasonable choice in terms of both quality and efficiency.

(a) F1 and Runtime vs. 
(b) F1 and Runtime vs. % variance
(c) F1 and Runtime vs. %
(d) F1 and Runtime vs. %
Figure 3: Sensitivity of Light to parameter setting on synthetic data sets with non-linear correlations. Overall, in terms of quality Light is very stable to parameter setting.

5.5 Ablation Study

To study the impact of our design choices, we consider three variants of Light, each of which we create by switching off one of its properties. The first one is Light, for Light with independence assumption. Essentially, Light applies our scalable pca mapping (see Section 3.1) but assumes that dimensions in pca spaces are statistically independent. It could be seen as an extension of pind. The second variant is Light, for Light without factorization. That is, Light applies our scalable pca mapping and then computing change score using joint distributions in full pca spaces. For Light, we use the quadratic measure of divergence [21] with computation on empirical data in closed form. The third variant is Light, for Light without pca mapping, i.e. factorization is performed on original -dimensional spaces.

We show representative results on data sets with non-linear correlations in Figures 4(a) and 4(b). We can see that Light outperforms all of its variants. That Light is better than Light highlights the importance of taking into account correlations in pca spaces. That Light outperforms Light shows the effectiveness of our factorization step. Finally, Light tries to approximate very high dimensional distributions. This is much harder than approximating lower dimensional ones, as Light does. This explains why Light achieves a better performance than Light. In terms of runtime, Light and Light are faster than Light since they do not spend time to factorize the joint distribution of each pca space. The difference however is negligible.

(a) F1 vs. dimensionality
(b) F1 vs. dimensionality
Figure 4: [Higher is better] Comparison with variants: F1 scores on synthetic data sets with non-linear correlations. Overall, Light outperforms all of its variants.

5.6 Results on Real Data

We now study the performance of Light on real data. We give an overview of the data sets in Table 1. Each of them is a labeled time series. As change points, we use these class labels, a common practice in change detection literature [12, 29]. For Human Activities (HAR) and Human Postural Transitions (HAPT), as their numbers of dimensions are only in the hundreds, we set and for Light. For the other time series, we use its default parameter setting. We evaluate quality using the F1 measure.

The results are in Table 2 and 3. Overall, we can see that Light consistently yields the best quality and is the most efficient across all time series. Notice that EMG1, EMG2, and Sport are relatively small in length while having many dimensions. For an effective change detection on such time series, it is necessary to use small window sizes. This is an issue for pind and spll. In particular, they have to perform pca transformation on very unstable covariance matrices, which could be an explanation why they do not perform well on EMG1, EMG2, and Sport. For Amazon and Youtube data, the runtime of pind and spll exceeds 12 hours. Light in contrast achieves high accuracy on all high dimensional time series tested. Further, it finishes within 1.5 hours even on the 20 000 dimensional Amazon and 50 000 dimensional Youtube data, and has very high accuracy on both data sets.

Data Light pind spll rsif
Amazon 0.91 - - 0.64
EMG1 0.77 0.48 0.45 0.72
EMG2 0.84 0.41 0.44 0.67
HAR 0.83 0.62 0.55 0.70
HAPT 0.85 0.68 0.62 0.71
Sport 0.94 0.51 0.46 0.84
Youtube 0.93 - - 0.76
Average 0.87 0.54 0.50 0.72
Table 2: [Higher is better] F1 scores on real data sets. Best values are in bold. ‘-’ means excessive runtime (more than 12 hours). Overall, Light consistently yields the best quality across all data sets.
Data Light pind spll rsif
Amazon 1273.6 1944.5
EMG1 1.2 92.6 98.1 3.1
EMG2 1.1 345.7 341.5 2.3
HAR 2.9 11.2 12.8 3.3
HAPT 2.4 12.5 12.4 3.3
Sport 5.6 1295.7 1280.4 11.9
Youtube 4863.5 7338.4
Average 878.6 1329.5
Table 3: [Lower is better] Runtime (in seconds) on real data sets. Best values are in bold. ‘’ means excessive runtime (more than 12 hours). Overall, Light consistently is the most efficient across all data sets.

6 Discussion

The experiments show that Light is both very efficient and yields high quality for change detection in very high dimensional time series containing different types of change, be them linear or non-linear. Furthermore, it allows small window sizes to be used. This makes it applicable to different types of time series, e.g. those where the dimensionality is even larger than the series length, such as EMG1. Its good performance over the competition suggests the following. First, our scalable pca transformation is much more effective than traditional pca mapping when it comes to high dimensionality. The benefits scalable pca mapping brings here lie in both quality and efficiency. Second, our distribution factorization yields better quality than using joint distributions or assuming statistical independence in pca spaces. That Light outperforms competitors and its variants could also be attributed to our new divergence measure, which can capture changes in both linear and non-linear structures.

Yet, there is room for alternative methods as well as further improvements. For instance, in this paper we use pca for dimension reduction. Other methods, e.g. canonical correlation analysis [4], can also be used. Besides the matrix sampling method we employ, it also is interesting to explore other related techniques, such as [26, 9]. The details, however, are beyond the scope of this work. In addition, we here pursue the non-parametric setting. As long as the knowledge on data distributions is known, one can resort to parametric methods to compute other divergence measures, e.g. Kullback-Leibler divergence or Jensen-Shannon divergence. As future work, we plan to extend Light to time series with mixed data types, e.g. those with numeric and categorical dimensions. This will help us to enrich the capability of Light in real-world applications.

7 Conclusion

In this paper, we studied the problem of change detection on very high dimensional time series. This setting poses both efficiency and quality issues. To address these, we proposed Light. In short, it works in three steps: 1) scalable pca mapping to reduce dimensionality, 2) scalable factorization of joint distributions in pca spaces to increase robustness, and 3) scalable computation of divergence scores on factorized distributions. Experiments on both synthetic and real-world data show Light outperforms state of the art with up to 100% improvement in both quality and efficiency.

Acknowledgements

The authors are supported by the Cluster of Excellence “Multimodal Computing and Interaction” within the Excellence Initiative of the German Federal Government.

References

  • [1] C. C. Aggarwal. A framework for change diagnosis of data streams. In SIGMOD Conference, pages 575–586, 2003.
  • [2] N. Alon, Y. Matias, and M. Szegedy.

    The space complexity of approximating the frequency moments.

    In STOC, pages 20–29, 1996.
  • [3] A. Bifet and R. Gavaldà. Learning from time-changing data with adaptive windowing. In SDM, pages 443–448, 2007.
  • [4] B. Chang, U. Krüger, R. Kustra, and J. Zhang. Canonical correlation analysis based on hilbert-schmidt independence criterion and centered kernel target alignment. In ICML (2), pages 316–324, 2013.
  • [5] T. Dasu, S. Krishnan, D. Lin, S. Venkatasubramanian, and K. Yi. Change (detection) you can believe in: Finding distributional shifts in data streams. In IDA, pages 21–34, 2009.
  • [6] F. Desobry, M. Davy, and C. Doncarli. An online kernel change detection algorithm. IEEE Trans. Signal Process., 53(8):2961–2974, 2005.
  • [7] P. Drineas, R. Kannan, and M. W. Mahoney. Fast monte carlo algorithms for matrices II: Computing a low-rank approximation to a matrix. SIAM J. Comput., 36(1):158–183, 2006.
  • [8] M. L. Fredman and R. E. Tarjan. Fibonacci heaps and their uses in improved network optimization algorithms. Journal of the ACM, 34(3):596–615, 1987.
  • [9] N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288, 2011.
  • [10] Z. Harchaoui, F. R. Bach, and E. Moulines. Kernel change-point analysis. In NIPS, pages 609–616, 2008.
  • [11] J. ichi Takeuchi and K. Yamanishi.

    A unifying framework for detecting outliers and change points from time series.

    IEEE Trans. Knowl. Data Eng., 18(4):482–492, 2006.
  • [12] Y. Kawahara and M. Sugiyama. Change-point detection in time-series data by direct density-ratio estimation. In SDM, pages 389–400, 2009.
  • [13] D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in data streams. In VLDB, pages 180–191, 2004.
  • [14] L. I. Kuncheva and W. J. Faithfull.

    PCA feature extraction for change detection in multidimensional unlabeled data.

    IEEE Trans. Neural Netw. Learning Syst., 25(1):69–80, 2014.
  • [15] O. Ledoit and M. Wolf. A well-conditioned estimator for large-dimensional covariance matrices.

    Journal of Multivariate Analysis

    , 88(2):365–411, 2004.
  • [16] J. A. Lee and M. Verleysen. Nonlinear Dimensionality Reduction. Springer, New York, 2007.
  • [17] E. Liberty. Simple and deterministic matrix sketching. In KDD, pages 581–588, 2013.
  • [18] S. Liu, M. Yamada, N. Collier, and M. Sugiyama. Change-point detection in time-series data by relative density-ratio estimation. Neural Networks, 43:72–83, 2013.
  • [19] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker. Identifying suspicious urls: an application of large-scale online learning. In ICML, pages 681–688, 2009.
  • [20] H. V. Nguyen, E. Müller, and K. Böhm. 4S: Scalable subspace search scheme overcoming traditional apriori processing. In BigData Conference, pages 359–367, 2013.
  • [21] H. V. Nguyen, E. Müller, J. Vreeken, and K. Böhm. Unsupervised interaction-preserving discretization of multivariate data. Data Min. Knowl. Discov., 28(5-6):1366–1397, 2014.
  • [22] H. V. Nguyen and J. Vreeken. Non-parametric jensen-shannon divergence. In ECML/PKDD, pages 173–189, 2015.
  • [23] K. Pearson. On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11):559–572, 1901.
  • [24] A. A. Qahtan, B. Harbi, S. Wang, and X. Zhang. A pca-based change detection framework for multidimensional data streams. In KDD, pages 935–944, 2015.
  • [25] J. L. Reyes-Ortiz, L. Oneto, A. Ghio, A. Samá, D. Anguita, and X. Parra. Human activity recognition on smartphones with awareness of basic activities and postural transitions. In ICANN, pages 177–184, 2014.
  • [26] T. Sarlós. Improved approximation algorithms for large matrices via random projections. In FOCS, pages 143–152, 2006.
  • [27] D. W. Scott. Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons Inc, New York, 1992.
  • [28] S. Seth, M. Rao, I. Park, and J. C. Príncipe. A unified framework for quadratic measures of independence. IEEE Transactions on Signal Processing, 59(8):3624–3635, 2011.
  • [29] X. Song, M. Wu, C. M. Jermaine, and S. Ranka. Statistical change detection for multi-dimensional data. In KDD, pages 667–676, 2007.
  • [30] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1–305, 2008.
  • [31] J. Whittaker. Graphical Models in Applied Multivariate Statistics. John Wiley & Sons, 1990.

Appendix A Proofs

[Lemma 3.3] By definition we have that

From Section 3.2 and based on the convention of our framework, we have and . Using these information we deduct that

Thus, we arrive at the result.

[Lemma 3.3] First, we prove that

In particular we have that and similarly for . Thus,

The inequality is in fact the Cauchy-Schwarz inequality. Hence,

The right hand side of the above inequality is indeed .

As s are obtained by pca it holds that

We now need to prove that for all , we can choose for each a set of terms of the form such that and different s will not share any common term. First, as the number of edges of the maximum spanning tree is , we have that

Now we pick the terms for all as follows. We consider edges whose one end-point is a leaf. When it must be the case that either or . Assuming that the former holds we remove edge and assign term to . We carry on until all edges are removed. Note that under this procedure a node is not removed from the tree until becomes 1. This also means that when is removed there have been terms assigned to it. This completes the second part of the proof.

With the results in the first part and the second part, we arrive at . Equality happens when for and for . This means that and under the factorization model are equal.

When and are equal, we have that for and for . Thus, . We complete our proof.

Appendix B Incremental Computation of Divergence Score

We illustrate the idea on incrementally computing where . Incrementally computing where follows straightforwardly.

Assume that the empirical data forming contains data points . Analogously, we denote as the data points forming . According to [21],

Thus, can be factorized per data point. By storing the contribution of each point to , we can incrementally update this score in time. We have score terms in total. Thus, the cost to compute divergence score for each new sample of the time series is .

Appendix C Scaling to Large Window Sizes

The idea is that is computed based on terms of the form where and are estimated from the data of and , respectively. An implicit assumption here is that the data of (similarly for ) are i.i.d. samples of . By definition, i.i.d. samples are obtained by randomly sampling from an infinite population or by randomly sampling with replacement from a finite population. In both cases, the distribution of i.i.d. samples are assumed to be identical to the distribution of the population. This is especially true when the sample size is very large [27]. Thus, when is very large the empirical distribution formed by approaches the true distribution . Assume now that we randomly draw with replacement samples of where . As mentioned above, these subsamples contain i.i.d. samples of . As with any set of i.i.d. samples with a reasonable size, we can assume that the empirical distribution formed by the subsamples is identical to . Thus, we can use them to approximate . In that case, the complexity of computing for initial and or after every change is reduced to . When subsampling is needed we only use it once for each .