1 Introduction
Change detection in time series is an important task in data mining. It has applications in many areas, including health care and network monitoring [1, 13, 18]. In principle, it is concerned with detecting time points at which important statistical properties of the time series change. To do so, a common approach is to compare the divergence between data distributions of a reference window and a test window. Traditionally, this works well for univariate time series [13, 3, 11]. Time series nowadays, however, are massively high dimensional. For instance, time series from human physical activities have more than 5 000 dimensions [25], while those from online URLs have more than 50 000 dimensions [19]. In these very high dimensional settings, existing work has both quality and efficiency issues.
In particular, techniques working directly with distributions of all dimensions, such as [6, 10, 12]
, are prone to the curse of dimensionality. Naïvely, this issue seems to be resolved by using a large window size. A large window, however, in general increases runtime, causes high delay, and misses alarms
[1]. More recent work [14, 24]alleviates the issue by principal component analysis (
pca) [23]. Using pca they map data in reference and test windows to a lower dimensional space on which divergence assessment is carried out per individualeigenvectors. While this in theory can avoid high dimensionality, it has three pitfalls. First, pca is cubic in the number of dimensions. Second, to ensure the stability of covariance matrices that pca uses, these methods need a large window size [15]. Third, by assuming statistical independence in the pca space, they may miss complex changes over correlations of dimensions. Thus, detecting changes in massively high dimensional time series still is an open problem.In this paper, we aim at addressing this issue. We do so by proposing Light, for lineartime change detection in high dimensional time series. In short, Light overcomes the drawbacks of existing techniques by a threestep approach. First, we perform scalable pca mapping of the two windows using matrix sampling [7]. With matrix sampling, we allow for medium to small window sizes. After pca mapping, we obtain data embedded in a much lowerdimensional space. To further increase robustness of divergence assessment and flexibility in choosing the window size, we factorize the joint distribution in the transformed space to lower dimensional distributions. Finally, we propose a divergence measure that computes divergence score using these distributions. Importantly, our measure has the nonnegativity property of wellknown divergence measures. Further, it allows nonparametric computation on empirical data in closed form. It also permits incremental calculation and hence is suited to the setting of time series. Our analysis and extensive experiments on both synthetic and realworld high dimensional time series show that Light scales linearly in the number of dimensions while achieving high quality.
The road map is as follow. In Sect. 2, we give an overview of Light. In Sect. 3, we provide its details and analyze its scalability. In Sect. 4, we review related work. In Sect. 5, we report empirical study. We round up with a discussion in Sect. 6 and conclude in Sect. 7. For readability, we put all proofs in the appendix.
2 Overview
Consider an dimensional realvalued time series . At each time instant the sample consists of values where corresponds to dimension (). We assume that for all . Note that when the dimension scales are very different, we can normalize to bring them to comparable domains. Here we use different scales to provide more flexibility.
An overview of Light is given in Algorithm 1. It works as follows. First, we form a reference window with size (Line 2). Next, we transform to of lower dimensionality and extract structure from (Line 3). We will explain the rationale of this step shortly. For each new sample of the time series, we form a test window (Lines 4 and 7). Next, we apply to the transformation that we perform on to obtain ; we also extract structure based on . Both steps are described in Lines 5 and 7. We define the change score as the divergence between and (Line 8). When a change indeed happens (Line 9), we report it (Line 10) and restart the process. Note that in this work, we do not focus on handling timedependent information, e.g. autocorrelation. Such information can be captured by patching multiple windows as done in [12, 18].
In Light, we transform by mapping it to another space of dimensions. We achieve this with scalable pca using matrix sampling [7]. The purpose of this step is multifold. First, in many domains the the default dimensions may not reflect the true physical dimensions [16]; hence a transformation is done to alleviate this issue. Second, when is very large we need to choose a large window size to ensure the robustness of divergence computation. A large however causes high delay and misses alarms [1]. Hence, dimension reduction is necessary to ensure quality. Third, by reducing the number of dimensions and allowing a smaller window size we are able to boost efficiency.
The usual next step would be to perform divergence assessment on . Working with dimensional joint distributions however may still expose us to the curse of dimensionality [22]. For instance, when and it is true that ; yet, to effectively process a joint distribution with dimensions we may still require a moderate . To tackle this we propose to factorize the joint distribution of to a combination of selected lower dimensional distributions, preserving nonlinear structures, e.g. nonlinear correlations among dimensions of . The set of lowerdimensional distributions makes up structure in our algorithm. The change score at time instant is computed based on these low dimensional distributions. Hence, we achieve three goals at the same time: (1) more flexibility in setting
, (2) maintaining robustness of divergence assessment, and (3) detecting changes in complex nonlinear structures. Note that the linear transformation with scalable
pca and the construction of could be perceived as compression of the data preserving firstorder and higherorder dependencies, respectively.For divergence assessment, we propose a new divergence measure that judiciously combines component lower dimensional distributions to produce the change score for the original distributions. Our measure has the important nonnegativity property of popular divergence measures. In addition, it allows computation on empirical data in closed form. It also facilitates incremental computation and hence is suitable for time series application. Through the development of our measure, we show that when the dimensions in are indeed statistically independent, Light reduces to existing pcabased change detection methods [14, 24], which focus exclusively on linear relationships. Thus, Light is more general, i.e. it can uncover different types of change, be them linear or nonlinear.
Having given an overview of Light, in the next section we provide its details and explain why it is a highly scalable solution for very high dimensional time series.
3 Details of Light
In this section we provide the details of Light, namely the data transformation, the distribution factorization, the divergence measure, and the change score thresholding.
3.1 Data Transformation
Following the previous section, we perform pca on the reference window with points and dimensions. For simplicity we use to denote the data of ; it has rows and columns. W.l.o.g., we assume that is centered. The usual procedure would be to (1) solve the eigendecomposition problem on
, and (2) use the top eigenvectors corresponding to the top eigenvalues to transform
. This costs . When is very large this complexity is prohibitive for large scale processing. Further, when is large we need to choose a large window size ; otherwise becomes very unstable making pca results unreliable [15].We overcome this by matrix sampling. In short, matrix sampling is concerned with sampling rows/columns of matrices to reduce the computational cost of common procedures, e.g. matrix multiplication and svd. These approximations come with guaranteed error bounds. Further, as the sampled submatrix has fewer columns the results are also more stable [7]. Matrix sampling has been recently applied in the context of data mining [17]. Here, we employ the technique in [7] as it allows us to avoid fixing a priori the number of dimensions
to be kept. This method essentially performs approximate singular value decomposition (
svd). Recall that svd finds matrices , , and such that . Here, and contain the eigenvectors of and , respectively. is a diagonal matrix of size where . The singular values of A are also the non negative square roots of the eigenvalues of both and . Finding the exact solution of svd, like pca, costs . With matrix sampling we obtain high quality approximate solution of svd in much less time. The details are as follows.We write the column of as for . First, we sample with replacement columns of
according to their relative variance
whereis the squared norm of vector
and is the Frobenius norm. This forms a matrix . Second, we perform pca on to extract its top eigenvectors corresponding to its top eigenvalues . The value of is determined by how much variance of to preserve; usually 90% to 99% suffices [16, 24]. Next, we form matrix where each column , and matrix of size . Let be a matrix whose columns contain the top eigenvectors of . According to [7], where is the best rank approximation of w.r.t. both and (spectral norm). By definition, the pcatransformed data is given by . We also need to compute to transform the test window later. It holds that . The original algorithm is stochastic. We can make it deterministic by picking columns of with largest relative variance; to break ties we use a canonical order.The cost of approximating top eigenvectors of is . The cost of computing is . The cost of computing is . So the total complexity is with , which can be simplified to as we choose . With reasonable and small the accuracy is increased [7], i.e. small window size is favored. The complexity of this sampling algorithm is on par with other sampling methods, such as [26, 17]. While these techniques require us to provide the final dimensionality , our method permits to adapt to the current reference window . Overall, by matrix sampling we are able to boost efficiency and gain stability, as we perform pca on where instead of . In other words, besides guaranteed error bounds the approximate solutions will tend to be more stable to high dimensionality.
3.2 Distribution Factorization
To further increase robustness of divergence assessment and flexibility in choosing the window size, we factorize joint distribution in the transformed space to low dimensional distributions. We accomplish this with graphical modeling [30, 31], a powerful approach for this purpose. We denote the dimensions of as . Assume that we have a graph where ; two dimensions and not connected by an edge are regarded as conditionally independent given all other dimensions . Given such a graph , one can factorize the joint distribution of by first finding special structures (e.g. cliques or connected components) of
, and then using the distributions of these structures to estimate
. Of course, the more complex the structures the better the estimation. Nevertheless, complex structures tend to contain many dimensions, causing quality and efficiency issues for divergence computation. We achieve a more balanced solution by using the edges of , which is also a good alternative to factorize [30, 31]. Under this modelwhere is the degree of in . One can see that the joint distribution is now represented by 1D and 2D distributions, further easing the ‘pressure’ on picking window size . So far, we assume that is available. Now we describe how to obtain it from .
Our solution consists of three steps: 1) computing pairwise correlation score of and with – the higher the score the more correlated, 2) initializing containing all dimensions and having an edge between every two dimensions – the weight of each edge is their correlation score, and 3) simplifying to one of its maximum spanning tree. Note that the simplified has edges.
Recall that our goal is to detect nonlinear changes. To this end, for the first step we choose the quadratic measure of dependency [28] as it captures nonlinear correlations, is nonparametric, and permits computation on empirical data in closed form. In short, under this measure the correlation between and is given by
where
denotes cumulative distribution function (cdf). Computing this measure for all dimension pairs takes
. To boost efficiency, similar to [20] we apply AMS Sketch [2]. In short, to compute we precompute the sketches of both and by projecting their realizations onto multiple random vectors . We then utilize the sketch values to estimate . The estimate is unbiased and its error is bounded [2]. The more random vectors used the better the estimate. We find that using vectors suffices. The time complexity of computing pairwise correlation scores is thus reduced to [20]. To make our method deterministic we generate the vectors in advance and reuse whenever needed.For the third step, as initially is dense we employ the method proposed in [8]. It is deterministic and costs . The outcome of this step is the set of 1D and 2D distributions taken from the maximum spanning tree. These distributions also constitute the structure of .
The overall complexity of distribution factorization is , i.e. linear in and independent of .
3.3 Divergence Computation
For each test window we first map it to using (see Section 3.1). Then, we use to extract , i.e. the set of 1D and 2D distributions of corresponding to those in . Our goal now is to compute the divergence score
where and are the joint distributions of and , respectively. One important question to address here is: How to do this using two sets of distributions and ? We answer this question based on the following observations.
We postpone the proof to Appendix A.
Lemma 3.3 tells us that w.r.t. KL measure is equal to the sum of divergence scores of 2D distributions (called 2D divergence scores) offset by those of 1D distributions (called 1D divergence scores). Here, stands for the magnitude of changes in , , and their dependency. Thus, the sum of 2D divergence scores stands for the magnitude of changes in the involved dimensions and their dependencies. The subtraction by 1D divergence scores is to reduce the impact of dimensions contributing to more than one 2D score term.
Though KL divergence features a nice computation of based on 1D and 2D distributions in and
, these distributions need to be estimated, e.g. parametrically or by kernel density estimation
[27]. Here, we aim at a purely nonparametric approach to maintain the exploratory nature of Light. Thus, we propose to generalize the result in Lemma 3.3 to divergence measures other than KL. In particular, for any measure we propose to set towhere is a regularization factor, which is to guarantee that is nonnegative. With our generalization we give way to applying other nonparametric instantiations for , e.g. the one in [21] with empirical computation in closed form, while still upholding the nonnegativity property of KL divergence. An additional benefit of is that it can make the influence of with on more prominent, i.e. more impact is given to the dimensions correlated to multiple other dimensions.
Before introducing the setting of we show that pcabased change detection methods [14, 24] – using the previous two steps of our framework – also generalize from Lemma 3.3, yet in a more restrictive manner. In particular, these methods estimate by or ; note that the two forms are similar. From Lemma 3.3 we see that if and for , then under KL is equal to . Thus, pcabased methods [14, 24] also generalize from KL divergence; however, they impose two additional assumptions that and where are statistically independent under both the data of and . We in turn do not impose these restrictions and can capture correlation in the pca space , i.e. we provide a more general solution.
Choosing . We use the quadratic measure of distribution divergence [21]. It is purely nonparametric and its empirical computation is in closed form. Under this measure
and is defined similarly. We set following our below lemma.
Setting to the measure in [21] and as above, it holds that with equality iff and under the factorization model (cf., Section 3.2) are equal.
We postpone the proof to Appendix A.
Complexity analysis. The cost of computing for initial and or after every change is . The cost of computing divergence score for each new sample of the time series is ; more details on this are in the Appendix B. In case a large window size is required, e.g. due to the application scenario, we can further boost the efficiency of our method by sampling with replacement the data of and . The details are in Appendix C.
3.4 Change Score Thresholding
We perform an adaptive thresholding scheme to decide when a change indeed happens. The scheme we consider is the PageHinkley test, which has been used in [24]. Under this test we keep track of the past change scores corresponding to time instants without change. For a new time instant, we assess how much its change score deviates from these historical scores and raise the flag when the deviation is large (w.r.t. an easytoadapt threshold). More details are in [24].
3.5 Summing Up
On a time series with changes, Light costs . In our experiment , which simplifies the complexity to . Thus, Light is linear in the window size , the original number of dimensions , and the length of the time series. Being linear in , it is suited to very high dimensional time series. Further, it can avoid large window size while still achieving high quality, which is verified by our experiments in Section 5.
4 Related Work
pcabased change detection is studied in [14, 24]. Kuncheva and Faithfull [14] propose to use eigenvectors with small eigenvalues for transformation. Qahtan et al. [24] in turn show theoretically and empirically that eigenvectors corresponding to large eigenvalues instead are more relevant. Both methods apply pca on original reference windows and have cubic runtime in the number of dimensions . Further, they may miss complex changes due to the assumption that dimensions in the transformed space are independent, i.e. dimensions in the original space have linear correlations only.
Change detection based on estimating the ratio between distributions of reference and test windows is introduced in [12, 18]. The main idea is to directly approximate this ratio by kernel models without
approximating the two distributions. This procedure implicitly assumes that data is uniformly distributed in the
dimensional space. When is large, we need a large window size to fill in this space to uphold this assumption. Computing the distribution ratio analytically costs . For efficiency purposes the window size must hence be kept small.Song et al. [29] propose a divergence test based on kernel density estimation for change detection. By performing density estimation on the original dimensional space, this test is susceptible to the curse of dimensionality [27]. While one could apply this test after distribution factorization, proving the nonnegativity of the resulting score is nontrivial. Other change detection techniques on multivariate time series [1, 6, 10, 5] also work on the dimensional space, and hence, are also prone to the curse of dimensionality.
Our method in contrast alleviates the high dimensionality issue by scalable pca mapping using matrix sampling. Furthermore, it does not make any assumption on data distributions nor that dimensions are linearly correlated. Lastly, it permits to set the windows size to match the level of details required by the domain expert.
5 Experiments
In this section, we empirically evaluate Light. In particular, we study its effectiveness in detecting known change points on both synthetic and realworld data sets. We implemented Light in Java, and make our code available for research purposes.^{1}^{1}1http://eda.mmci.unisaarland.de/light/ All experiments were performed singlethreaded on an Intel(R) Core(TM) i74600U CPU with 16GB RAM. We report wallclock running times.
We compare to pind [24] and spll [14], two state of the art methods for pcabased change detection. Both apply traditional pca, assuming that dimensions in the pca space are statistically independent. In addition, we consider rsif [18], which measures divergence scores by directly approximating density ratio of distributions. For each competitor, we optimize parameter settings according to their respective papers. Light has 4 parameters, namely the number of sampled dimensions which is used in scalable pca mapping; the percentage of variance preserved which is used in scalable pca mapping; the number of sketches and the number of average sketch values which are used in distribution factorization. Note that the last two parameters apply for any method using AMS Sketch [2]. The default setting for the parameters is: , percentage , , and .
We experiment on both synthetic and real data sets. For the latter, we draw 7 data sets from the UCI Machine Learning Repository: Amazon, EMG Actions 1, EMG Actions 2, Human Activities, Human Postural Transitions, Sport Activities, and Youtube Video Games. All are high dimensional time series. As change points, we use existing class labels – a common practice in change detection
[12, 29]. Table 1 summarizes the characteristics of these data sets.Data  

Amazon  30 000  20 000 
EMG Actions 1  1 800  3 000 
EMG Actions 2  3 600  2 500 
Human Activities  10 299  561 
Human Postural Transitions  10 929  561 
Sport Activities  9 120  5 625 
Youtube Video Games  120 000  50 000 
5.1 Synthetic Data Generation
Each dimensional time series we generate contains 100 segments, each having 2000 time steps. We create the change point between every two consecutive segments by varying either distributions or correlation patterns of some dimensions. This means that each time series has 99 change points. We evaluate change detection techniques based on how well they retrieve these known change points. In line with previous work [14, 24], a change point detected by a method is considered to be a true positive if it is flagged correctly before points from the new distribution arrive. As performance metric, we use the F1 measure, which is defined as . Note that the higher the F1 score the better. Below we describe three different ways to create change points.
Gaussian Distributions. In this setting, vector
has multivariate Gaussian distribution (
). The mean vector and covariance matrix are initialized randomly. We consider three types of change: 1) change in mean vector, 2) change in individual variance terms, and 3) change in covariance terms. When creating segments, we switch among those three types of change in a round robin fashion. Each remaining dimension where in turn has its distribution fixed to .Linear Correlations. We embed linear correlations in dimensions where . To model correlation in each segment, we first generate where and is fixed with initially drawn from . Here, and are two sets, each containing dimensions. We let . Next, we generate where is fixed with initially drawn from . Then, using a function we generate where , and ; we fix . We use two linear instantiations of :
When creating time series segments, we alternatively pick and . In this way, in every two consecutive segments the correlations between and are different. Each dimension where is drawn from .
Nonlinear Correlations. To embed nonlinear correlations among dimensions where , we follow the procedure as in linear correlations except for that we here use four nonlinear and complex instantiations of :
When creating time series segments, we switch among , , , and in a round robin fashion. Each dimension where is drawn from .
5.2 Quantitative Results on Synthetic Data
We assess quality of the methods tested under different values of dimensionality and window sizes . For quality against we fix . For quality against we fix . The results are in Figures 1(a)—1(f).
We find that Light consistently achieves the best performance across different values of and , and different types of change – from linear to highly nonlinear. Its quality improvement over pind and spll is up to 100%.
pind and spll in turn do not perform well, most likely because they use unstable covariance matrices for pca transformation (note that in all cases tested). At , we have to stop the execution of pind and spll due to excessive runtime (more than 12 hours).
rsif does not perform so well (especially on nonlinear correlations), most likely as it assumes that data is uniformly distributed in high dimensional spaces.
Light in contrast reliably detects changes in high dimensional time series with small window sizes. By not making assumptions on data distributions nor that dimensions are linearly correlated, it copes well with different types of change and yields high quality results.
5.3 Efficiency Results using Synthetic Data
Here we test efficiency against window size and dimensionality . The setup is as above. We show representative results on data sets with nonlinear correlations in Figures 2(a) and 2(b). Results on other types of data sets are similar and hence skipped for brevity.
We see that in both cases, Light is more efficient than all competitors. The performance improvement over pind and spll is more than 100% for high values of . Further, the experiments show that Light has linear scalability to and , which corroborates our analysis in Section 3.5.
5.4 Parameter Sensitivity
To assess the sensitivity of Light to its parameters, we reuse the synthetic data sets above, fixing and . The default setting for the parameters is: , percentage , , and . That is, when testing against one parameter we use the default setting for the other parameters.
The representative results on data sets with nonlinear correlations are in Figures 3(a)—3(d). We see that Light is very stable to different values of its parameters, which facilitates easy parameterization. The results also suggest that our default setting makes a reasonable choice in terms of both quality and efficiency.
5.5 Ablation Study
To study the impact of our design choices, we consider three variants of Light, each of which we create by switching off one of its properties. The first one is Light, for Light with independence assumption. Essentially, Light applies our scalable pca mapping (see Section 3.1) but assumes that dimensions in pca spaces are statistically independent. It could be seen as an extension of pind. The second variant is Light, for Light without factorization. That is, Light applies our scalable pca mapping and then computing change score using joint distributions in full pca spaces. For Light, we use the quadratic measure of divergence [21] with computation on empirical data in closed form. The third variant is Light, for Light without pca mapping, i.e. factorization is performed on original dimensional spaces.
We show representative results on data sets with nonlinear correlations in Figures 4(a) and 4(b). We can see that Light outperforms all of its variants. That Light is better than Light highlights the importance of taking into account correlations in pca spaces. That Light outperforms Light shows the effectiveness of our factorization step. Finally, Light tries to approximate very high dimensional distributions. This is much harder than approximating lower dimensional ones, as Light does. This explains why Light achieves a better performance than Light. In terms of runtime, Light and Light are faster than Light since they do not spend time to factorize the joint distribution of each pca space. The difference however is negligible.
5.6 Results on Real Data
We now study the performance of Light on real data. We give an overview of the data sets in Table 1. Each of them is a labeled time series. As change points, we use these class labels, a common practice in change detection literature [12, 29]. For Human Activities (HAR) and Human Postural Transitions (HAPT), as their numbers of dimensions are only in the hundreds, we set and for Light. For the other time series, we use its default parameter setting. We evaluate quality using the F1 measure.
The results are in Table 2 and 3. Overall, we can see that Light consistently yields the best quality and is the most efficient across all time series. Notice that EMG1, EMG2, and Sport are relatively small in length while having many dimensions. For an effective change detection on such time series, it is necessary to use small window sizes. This is an issue for pind and spll. In particular, they have to perform pca transformation on very unstable covariance matrices, which could be an explanation why they do not perform well on EMG1, EMG2, and Sport. For Amazon and Youtube data, the runtime of pind and spll exceeds 12 hours. Light in contrast achieves high accuracy on all high dimensional time series tested. Further, it finishes within 1.5 hours even on the 20 000 dimensional Amazon and 50 000 dimensional Youtube data, and has very high accuracy on both data sets.
Data  Light  pind  spll  rsif 

Amazon  0.91      0.64 
EMG1  0.77  0.48  0.45  0.72 
EMG2  0.84  0.41  0.44  0.67 
HAR  0.83  0.62  0.55  0.70 
HAPT  0.85  0.68  0.62  0.71 
Sport  0.94  0.51  0.46  0.84 
Youtube  0.93      0.76 
Average  0.87  0.54  0.50  0.72 
Data  Light  pind  spll  rsif 

Amazon  1273.6  1944.5  
EMG1  1.2  92.6  98.1  3.1 
EMG2  1.1  345.7  341.5  2.3 
HAR  2.9  11.2  12.8  3.3 
HAPT  2.4  12.5  12.4  3.3 
Sport  5.6  1295.7  1280.4  11.9 
Youtube  4863.5  7338.4  
Average  878.6  1329.5 
6 Discussion
The experiments show that Light is both very efficient and yields high quality for change detection in very high dimensional time series containing different types of change, be them linear or nonlinear. Furthermore, it allows small window sizes to be used. This makes it applicable to different types of time series, e.g. those where the dimensionality is even larger than the series length, such as EMG1. Its good performance over the competition suggests the following. First, our scalable pca transformation is much more effective than traditional pca mapping when it comes to high dimensionality. The benefits scalable pca mapping brings here lie in both quality and efficiency. Second, our distribution factorization yields better quality than using joint distributions or assuming statistical independence in pca spaces. That Light outperforms competitors and its variants could also be attributed to our new divergence measure, which can capture changes in both linear and nonlinear structures.
Yet, there is room for alternative methods as well as further improvements. For instance, in this paper we use pca for dimension reduction. Other methods, e.g. canonical correlation analysis [4], can also be used. Besides the matrix sampling method we employ, it also is interesting to explore other related techniques, such as [26, 9]. The details, however, are beyond the scope of this work. In addition, we here pursue the nonparametric setting. As long as the knowledge on data distributions is known, one can resort to parametric methods to compute other divergence measures, e.g. KullbackLeibler divergence or JensenShannon divergence. As future work, we plan to extend Light to time series with mixed data types, e.g. those with numeric and categorical dimensions. This will help us to enrich the capability of Light in realworld applications.
7 Conclusion
In this paper, we studied the problem of change detection on very high dimensional time series. This setting poses both efficiency and quality issues. To address these, we proposed Light. In short, it works in three steps: 1) scalable pca mapping to reduce dimensionality, 2) scalable factorization of joint distributions in pca spaces to increase robustness, and 3) scalable computation of divergence scores on factorized distributions. Experiments on both synthetic and realworld data show Light outperforms state of the art with up to 100% improvement in both quality and efficiency.
Acknowledgements
The authors are supported by the Cluster of Excellence “Multimodal Computing and Interaction” within the Excellence Initiative of the German Federal Government.
References
 [1] C. C. Aggarwal. A framework for change diagnosis of data streams. In SIGMOD Conference, pages 575–586, 2003.

[2]
N. Alon, Y. Matias, and M. Szegedy.
The space complexity of approximating the frequency moments.
In STOC, pages 20–29, 1996.  [3] A. Bifet and R. Gavaldà. Learning from timechanging data with adaptive windowing. In SDM, pages 443–448, 2007.
 [4] B. Chang, U. Krüger, R. Kustra, and J. Zhang. Canonical correlation analysis based on hilbertschmidt independence criterion and centered kernel target alignment. In ICML (2), pages 316–324, 2013.
 [5] T. Dasu, S. Krishnan, D. Lin, S. Venkatasubramanian, and K. Yi. Change (detection) you can believe in: Finding distributional shifts in data streams. In IDA, pages 21–34, 2009.
 [6] F. Desobry, M. Davy, and C. Doncarli. An online kernel change detection algorithm. IEEE Trans. Signal Process., 53(8):2961–2974, 2005.
 [7] P. Drineas, R. Kannan, and M. W. Mahoney. Fast monte carlo algorithms for matrices II: Computing a lowrank approximation to a matrix. SIAM J. Comput., 36(1):158–183, 2006.
 [8] M. L. Fredman and R. E. Tarjan. Fibonacci heaps and their uses in improved network optimization algorithms. Journal of the ACM, 34(3):596–615, 1987.
 [9] N. Halko, P.G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288, 2011.
 [10] Z. Harchaoui, F. R. Bach, and E. Moulines. Kernel changepoint analysis. In NIPS, pages 609–616, 2008.

[11]
J. ichi Takeuchi and K. Yamanishi.
A unifying framework for detecting outliers and change points from time series.
IEEE Trans. Knowl. Data Eng., 18(4):482–492, 2006.  [12] Y. Kawahara and M. Sugiyama. Changepoint detection in timeseries data by direct densityratio estimation. In SDM, pages 389–400, 2009.
 [13] D. Kifer, S. BenDavid, and J. Gehrke. Detecting change in data streams. In VLDB, pages 180–191, 2004.

[14]
L. I. Kuncheva and W. J. Faithfull.
PCA feature extraction for change detection in multidimensional unlabeled data.
IEEE Trans. Neural Netw. Learning Syst., 25(1):69–80, 2014. 
[15]
O. Ledoit and M. Wolf.
A wellconditioned estimator for largedimensional covariance
matrices.
Journal of Multivariate Analysis
, 88(2):365–411, 2004.  [16] J. A. Lee and M. Verleysen. Nonlinear Dimensionality Reduction. Springer, New York, 2007.
 [17] E. Liberty. Simple and deterministic matrix sketching. In KDD, pages 581–588, 2013.
 [18] S. Liu, M. Yamada, N. Collier, and M. Sugiyama. Changepoint detection in timeseries data by relative densityratio estimation. Neural Networks, 43:72–83, 2013.
 [19] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker. Identifying suspicious urls: an application of largescale online learning. In ICML, pages 681–688, 2009.
 [20] H. V. Nguyen, E. Müller, and K. Böhm. 4S: Scalable subspace search scheme overcoming traditional apriori processing. In BigData Conference, pages 359–367, 2013.
 [21] H. V. Nguyen, E. Müller, J. Vreeken, and K. Böhm. Unsupervised interactionpreserving discretization of multivariate data. Data Min. Knowl. Discov., 28(56):1366–1397, 2014.
 [22] H. V. Nguyen and J. Vreeken. Nonparametric jensenshannon divergence. In ECML/PKDD, pages 173–189, 2015.
 [23] K. Pearson. On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11):559–572, 1901.
 [24] A. A. Qahtan, B. Harbi, S. Wang, and X. Zhang. A pcabased change detection framework for multidimensional data streams. In KDD, pages 935–944, 2015.
 [25] J. L. ReyesOrtiz, L. Oneto, A. Ghio, A. Samá, D. Anguita, and X. Parra. Human activity recognition on smartphones with awareness of basic activities and postural transitions. In ICANN, pages 177–184, 2014.
 [26] T. Sarlós. Improved approximation algorithms for large matrices via random projections. In FOCS, pages 143–152, 2006.
 [27] D. W. Scott. Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons Inc, New York, 1992.
 [28] S. Seth, M. Rao, I. Park, and J. C. Príncipe. A unified framework for quadratic measures of independence. IEEE Transactions on Signal Processing, 59(8):3624–3635, 2011.
 [29] X. Song, M. Wu, C. M. Jermaine, and S. Ranka. Statistical change detection for multidimensional data. In KDD, pages 667–676, 2007.
 [30] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(12):1–305, 2008.
 [31] J. Whittaker. Graphical Models in Applied Multivariate Statistics. John Wiley & Sons, 1990.
Appendix A Proofs
[Lemma 3.3] By definition we have that
From Section 3.2 and based on the convention of our framework, we have and . Using these information we deduct that
Thus, we arrive at the result.
[Lemma 3.3] First, we prove that
In particular we have that and similarly for . Thus,
The inequality is in fact the CauchySchwarz inequality. Hence,
The right hand side of the above inequality is indeed .
As s are obtained by pca it holds that
We now need to prove that for all , we can choose for each a set of terms of the form such that and different s will not share any common term. First, as the number of edges of the maximum spanning tree is , we have that
Now we pick the terms for all as follows. We consider edges whose one endpoint is a leaf. When it must be the case that either or . Assuming that the former holds we remove edge and assign term to . We carry on until all edges are removed. Note that under this procedure a node is not removed from the tree until becomes 1. This also means that when is removed there have been terms assigned to it. This completes the second part of the proof.
With the results in the first part and the second part, we arrive at . Equality happens when for and for . This means that and under the factorization model are equal.
When and are equal, we have that for and for . Thus, . We complete our proof.
Appendix B Incremental Computation of Divergence Score
We illustrate the idea on incrementally computing where . Incrementally computing where follows straightforwardly.
Assume that the empirical data forming contains data points . Analogously, we denote as the data points forming . According to [21],
Thus, can be factorized per data point. By storing the contribution of each point to , we can incrementally update this score in time. We have score terms in total. Thus, the cost to compute divergence score for each new sample of the time series is .
Appendix C Scaling to Large Window Sizes
The idea is that is computed based on terms of the form where and are estimated from the data of and , respectively. An implicit assumption here is that the data of (similarly for ) are i.i.d. samples of . By definition, i.i.d. samples are obtained by randomly sampling from an infinite population or by randomly sampling with replacement from a finite population. In both cases, the distribution of i.i.d. samples are assumed to be identical to the distribution of the population. This is especially true when the sample size is very large [27]. Thus, when is very large the empirical distribution formed by approaches the true distribution . Assume now that we randomly draw with replacement samples of where . As mentioned above, these subsamples contain i.i.d. samples of . As with any set of i.i.d. samples with a reasonable size, we can assume that the empirical distribution formed by the subsamples is identical to . Thus, we can use them to approximate . In that case, the complexity of computing for initial and or after every change is reduced to . When subsampling is needed we only use it once for each .
Comments
There are no comments yet.