Given multiple time series data (e.g., measurements from multiple sensors) and a time range (e.g., 1:00 am - 3:00 am yesterday), how can we efficiently discover latent factors of the time series in the range? Revealing hidden factors in time series is important for analysis of patterns and tendencies encoded in the time series data. Singular value decomposition (SVD) effectively finds hidden factors in data, and has been extensively utilized in many data mining applications such as dimensionality reduction (Ravi Kanth et al., 1998), principal component analysis (PCA) (Jolliffe, 2002; Wall et al., 2003), data clustering (Simek et al., 2004; Osiński et al., 2004)
, tensor analysis(Sael et al., 2015; Jeon et al., 2015; Jeon et al., 2016b; Jeon et al., 2016a; Park et al., 2016; Oh et al., 2018), graph mining (Kang et al., 2012; Tong et al., 2006; Kang et al., 2011, 2014) and recommender systems (Koren et al., 2009; Park et al., 2017). SVD has been also successfully applied to stream mining tasks (Wall et al., 2003; Spiegel et al., 2011) in order to analyze time series data.
However, methods based on standard SVD (Brand, 2003; Ross et al., 2008; Zadeh et al., 2016; Halko et al., 2011) are not suitable for finding latent factors in an arbitrary time range since the methods have an expensive computational cost, and they have to store all the raw data. This limitation makes it difficult to investigate patterns of a time range in stream environment even if it is important to analyze a specific past event or find recurring patterns in time series (Papadimitriou and Yu, 2006). A naive approach for a time range query on time series is to store all of the arrived data and apply SVD to the data, but this approach is inefficient since it requires huge storage space, and the computational cost of SVD for a long time range query is expensive.
In this paper, we propose Zoom-SVD (Zoomable SVD), an efficient method for revealing hidden factors of multiple time series in an arbitrary time range. With Zoom-SVD, users can zoom-in to find patterns in a specific time range of interest, or zoom-out to extract patterns in a wider time range. Zoom-SVD comprises two phases: storage phase and query phase. Zoom-SVD considers multiple time series as a set of blocks of a fixed length. In the storage phase, Zoom-SVD carefully compresses each block using SVD and low-rank approximation to reduce storage cost and incrementally updates the most recent block of a newly arrived data. In the query phase, Zoom-SVD efficiently computes the SVD results in a given time range based on the compressed blocks. Through extensive experiments with real-world multiple time series data, we demonstrate the effectiveness and the efficiency of Zoom-SVD compared to other methods as shown in Figure 1. The main contributions of this paper are summarized as follows:
Algorithm. We propose Zoom-SVD, an efficient method for extracting key patterns from multiple time series data in an arbitrary time range.
Analysis. We theoretically analyze the time and the space complexities of our proposed method Zoom-SVD.
Experiment. We present experimental results showing that Zoom-SVD computes time range queries up to faster, and requires up to less space than other methods. We also confirm that our proposed method Zoom-SVD provides the best trade-off between efficiency and accuracy.
The codes and datasets for this paper are available at http://datalab.snu.ac.kr/zoomsvd. In the rest of this paper, we describe the preliminaries and formally define the problem in Section 2, propose our method Zoom-SVD in Section 3, present experimental results in Section 4, demonstrate the case study in Section 5, discuss related works in Section 6, and conclude in Section 7.
|Initial block size|
|Threshold for low-rank approximation|
|Number of singular values|
|Number of singular values in -th block|
|Vertical concatenation of two matrices and|
|Raw multiple time series data|
|i-th block of|
Left singular vector matrix of
|Singular value matrix of|
|Right singular vector matrix of|
|Left singular vector matrix computed in query phase|
|Singular value matrix computed in query phase|
|Right singular vector matrix computed in query phase|
|Set of left singular vector matrix|
|Set of singular value matrix|
|Set of right singular vector matrix|
|Time range query|
|Starting point of time range query|
|Ending point of time range query|
|Index of block matrix corresponding to|
|Index of block matrix corresponding to|
We describe preliminaries on singular value decomposition (SVD) and incremental SVD (Sections 2.1 and 2.2). We then define the problem handled in this paper (Section 2.3). Table 1 lists the symbols used in this paper.
2.1. Singular Value Decomposition (SVD)
SVD is a decomposition method for finding latent factors in a matrix . Suppose the rank of the matrix is . Then, SVD of is represented as where is an diagonal matrix whose diagonal entries are singular values. The -th singular value is located in where . is called the left singular vector matrix (or a set of left singular vectors) of ;
is a column orthogonal matrix where, ,
are the eigenvectors of. is the right singular vector matrix of ; is a column orthogonal matrix where , , are the eigenvectors of . Note that the singular vectors in and are used as hidden factors to analyze the data matrix .
Low-rank approximation. Low-rank approximation effectively approximates the original data matrix based on SVD. The key idea of the low-rank approximation is to keep top- highest singular values and corresponding singular vectors where is a number smaller than the rank of the original matrix. The low-rank approximation of is represented as follows:
where the reconstruction data is the low-rank approximation of , , , and . The error of the low-rank approximation is represented as follows:
where is the Frobenius norm of a matrix, and is the rank of the original matrix. The parameter for low-rank approximation is determined by the following equation:
where is a threshold between and .
2.2. Incremental SVD
Incremental SVD dynamically calculates the SVD result of a matrix with newly arrived data rows. Suppose that we have the SVD result , , and of a data matrix at time . When an matrix arrives at time , the purpose of incremental SVD is to efficiently obtain the SVD result of based on the previous result , , and . Note that denotes the vertical concatenation of two matrices and . Incremental SVD is used to analyze patterns in time series data (Sarwar et al., 2002), and several efficient methods for incremental SVD were proposed (Brand, 2003; Ross et al., 2008). This incremental SVD technique is exploited in our method to incrementally compress and store the data (see Algorithm 1 in Section 3.2).
2.3. Problem Definition
We formally define the time range query problem as follows:
Problem 1 ().
(Time Range Query on Multiple Time Series)
Given: a time range , and multiple time series data represented by a matrix where is the length of the time dimension, and is the number of the time series,
Find: the SVD result of the sub-matrix of in the time range quickly, without storing all of . The SVD result includes , , and where is the rank of the sub-matrix.
Applying the standard SVD or incremental SVD for the time range query is impractical for the following reasons. Standard SVD needs to extract the sub-matrix corresponding to the time range before performing decomposition. Iwen et al. (Iwen and Ong, 2016) proposed a hierarchical and distributed approach for computing and except for of a whole matrix . Zadeh et al. (Zadeh et al., 2016) introduce Tall and Skinny SVD which obtains and by computing eigen-decomposition of , and then computes using , , and . Halko et al. (Halko et al., 2011) propose Randomized SVD which computes SVD of using randomized approximation techniques. However, such methods are inefficient because they need to compute SVDs from scratch for multiple overlapping queries. Furthermore, those methods need to keep the entire time series data , which is practically infeasible in many streaming applications. Incremental SVD considers updates only on newly added data, and thus cannot perform SVD on a specific time range.
To address these limitations, we propose an efficient method for the time range query in Section 3.
3. Proposed Method
We propose Zoom-SVD, a fast and space-efficient method for extracting key patterns from multiple time series data in an arbitrary time range. We first give an overview of Zoom-SVD in Section 3.1. We describe details of Zoom-SVD in Sections 3.2 and 3.3. Finally, we analyze Zoom-SVD’s time and space complexities in Section 3.4.
Zoom-SVD efficiently extracts key patterns from multiple time series data in an arbitrary time range using SVD. The main challenges for the time range query problem (Problem 1) are as follows:
Minimize the space cost. The amount of multiple time series data increases over time. How can we reduce the space while supporting time range queries?
Minimize the time cost. How can we quickly compute SVD of multiple time series data in an arbitrary time range?
We address the above challenges with the following ideas:
Compress multiple time series data (Section 3.2). Zoom-SVD compresses the raw data using incremental SVD, and discards the raw data in the storage phase.
Optimize the computational time of Stitched-SVD (Section 3.3.2). We optimize the performance of Stitched-SVD by reducing numerical computations using a block matrix structure.
Zoom-SVD comprises two phases: storage phase and query phase. In the storage phase (Algorithm 1), Zoom-SVD stores the SVD results corresponding to length- blocks in the time series data in order to support time range queries as shown in Figure 2. When a new data arrives, Zoom-SVD incrementally updates the SVD result with the newly arrived data, block by block. In the query phase (Algorithms 2), Zoom-SVD returns the SVD result for a given time range . The query phase utilizes our proposed Partial-SVD and Stitched-SVD modules to process the time range query. Partial SVD (Algorithm 3) manipulates the SVD result containing (or ) to match the query time range as shown in Figure 2. Stitched-SVD (Algorithms 2) efficiently computes the SVD result between and by stitching the SVD results for blocks in the time range.
3.2. Storage Phase of Zoom-SVD
Given multiple time series stream , the objective of the storage phase is to incrementally compress the input data and discard the original input data to achieve space efficiency. A naive incremental SVD would update one large SVD result when the data are newly added. However, this approach is impractical because the processing cost for the newly added data increases over time. Also, the naive incremental SVD does not support a time range query quickly in the query phase because it manipulates the large SVD result stored for the total time regardless of the query time range.
The storage phase of Zoom-SVD (Algorithm 1) is designed to efficiently process newly added data and quickly support time range queries. Given multiple time series data , the storage phase of Zoom-SVD (Algorithm 1) incrementally compresses the input data block by block using incremental SVD, and discards the original input data to reduce space cost. Assume the multiple time series data are represented by a matrix where is time length, and is the number of time series (e.g., sensors). We conceptually divide the matrix into length- blocks represented by as shown in Figure 2. We then store the low-rank approximation result of each block matrix , where we exploit an incremental SVD method in the process. We formally define the block matrix in Definition 1.
Definition 0 (Block matrix ).
Suppose a multivariate time series is where is the -th row vector of , and denotes the vertical concatenation of vectors. The -th block matrix is then represented as follows:
where is a block size. In addition, denotes the -th block matrix at time where indicates the index of the most recent block as shown in Figure 2. Note that the number of rows in is less than or equal to .
The computed SVD result , , and of each block matrix are stored as follows.
Definition 0 (Sets of SVD results , , and ).
The sets , , and store the SVD results , , and for all , respectively.
Note that the original time series data are discarded, and we store only the SVD results which occupy less space than the original data. The SVD results for block matrices are used in the query phase (Algorithm 2). Now we are ready to describe the details of the storage phase.
The storage phase (Algorithm 1) compresses the multiple time series data block by block using incremental SVD to support time range queries. When new multiple time series data are given at time (line 2), we generate the new SVD result of for the next block matrix if the SVD result are stored in , , and at time (lines 3 and 4). If not, we have the SVD result , , and of the most recent block matrix which is the -th block matrix at time . Assume that we have the SVD result , , and of (i.e., the block matrix from time to as seen in Figure 2). We then update the SVD result into , , and for the new data using an incremental SVD method (line 6). If the number of rows of is , we put the SVD result , , and into , , and , respectively (lines 811). Equations (2) and (3) represent the details of how to update the SVD result of for the new incoming data , when contains rows. is represented by and the SVD result of in Equation (2):
where , , and . Note that is a column orthogonal matrix since it is the product of two orthogonal matrices. is also column orthogonal, and is a diagonal matrix whose diagonal entries are sorted in the descending order. Hence, , , and are considered as the SVD result of by the definition of SVD (Trefethen and Bau III, 1997). The time index can be omitted as in , , , and , as described in Definitions 1 and 2, if the number of rows of is .
3.3. Query Phase of Zoom-SVD
Given the starting point and the ending point of a time range query, the goal of the query phase of Zoom-SVD is to obtain the SVD result from to . A naive approach would reconstruct the time series data from the SVD results of the block matrices ranged between and , and perform SVD on the reconstructed data in the range. However, this approach requires heavy computations especially for a long time range query, and thus is not appropriate for serving time range queries quickly.
We propose two sub-modules, Partial-SVD and Stitched-SVD, which are used in the query phase of our proposed method (Algorithm 2) to efficiently process time range queries by avoiding reconstruction of the raw data. Let be the index of the block matrix including , and be the index of the block matrix including . Partial-SVD (Algorithm 3) adjusts the time range of the SVD results for and as seen in the red-colored boxes of Figure 3 (line 1 of Algorithm 2). Stitched-SVD combines the SVD results of Partial-SVD and those of block matrices from to (lines 2 to 5 in Algorithm 2). We describe the details of Partial-SVD and Stitched-SVD in Sections 3.3.1 and 3.3.2, respectively.
This module manipulates the SVD results of block matrices and to return the SVD results in a given time range . As seen in Figure 2, may contain the time range before , and may include the time range after . Note that those time ranges are out of the time range of the given query; thus, our goal for this module is to extract SVD results from and according to the time range query without reconstructing raw data. Figure 3 depicts the operation of Partial-SVD. For the block matrix and its SVD , Partial-SVD first eliminates rows of left singular vector matrix which are out of the query time range. After that, Partial-SVD multiplies the remaining left singular vector matrix with the singular value matrix , and performs SVD of the resulting matrix. The resulting singular vector matrix and the singular value matrix constitute the output of Partial-SVD. The remaining right singular vector matrix output of Partial-SVD is computed by multiplying the right singular vector matrix with . Similar operations are performed for the block matrix and its SVD .
Now, we describe the details of this module (Algorithm 3). We first introduce elimination matrices which are used in Partial-SVD to adjust the time range.
Definition 0 (Elimination matrices).
Suppose is the number of rows to be eliminated in according to . Then is the number of remaining rows in . Similarly, let be the number of rows to be eliminated in according to ; then is the number of remaining rows in . The elimination matrices and for and are defined as follows:
The matrices and are multiplied to the elimination matrices, and the time ranges of the resulting matrices and are within the query time range . Partial-SVD constructs those elimination matrices based on and (line 3 of Algorithm 3). The filtered block matrix is given by
where was computed at the storage phase.
Partial-SVD decomposes into via SVD and low-rank approximation with threshold since is not a column orthogonal matrix, and is not a form of the SVD result; then, Equation (5) is written as follows:
This module combines the Partial-SVD of and , and the stored SVD results of blocks matrices in the query time range to return the final SVD result corresponding to the query range as shown in Figure 2. A naive approach is to reconstruct the data blocks using the stored SVD results and perform SVD on the reconstructed data of the given query time range. However, this approach cannot provide fast query speed for a long time range due to heavy computations induced by the reconstruction and the following SVD. The goal of Stitched-SVD is to efficiently stitch the SVD results in the query time range by avoiding reconstruction and minimizing the numerical computation of matrix multiplication.
Specifically, Stitched-SVD stitches several consecutive block SVD results together to compute the SVD corresponding to the query time range: is.e., it combines the SVD result , , and of the th block matrix , for , to compute the SVD , , and . The main idea is 1) to carefully decouple the matrices from , 2) construct a stacked matrix containing for , 3) perform SVD on the stacked matrix to get the singular value matrix and the right singular vector matrix of the final SVD result, and 4) carefully combine with the left singular matrix of SVD of the stacked matrix to get the left singular vector matrix of the final SVD result.
Lines 2 to 5 of Algorithm 2 present how stitched SVD matrices are computed. First, we construct based on the block matrix structure where is equal to . After organizing the block matrix structure of , we define block diagonal matrix as follows.
Definition 0 (Block diagonal matrix).
Suppose and are the left singular vector matrices produced by Partial-SVD. Let , , , be the left singular vector matrices in . The block diagonal matrix is defined as follows:
Then, the matrix corresponding to the time range query is represented as follows:
where and are elimination matrices of Partial-SVD, and is equal to . As we apply SVD and low-rank approximation to , Equation (7) becomes as follows:
where is computed by low-rank approximation and SVD, , , and . To avoid matrix multiplication between and zero sub-matrices of , we split block by block as follows:
where and correspond to and , respectively, and correspond to for . Then of Equation (8) is computed as follows:
The column orthogonality of is established as it is the product of two column orthogonal matrices; also, is column orthogonal. Note that we perform Partial-SVD to satisfy column orthogonal condition before performing Stitched-SVD.
3.4. Theoretical Analysis
We theoretically analyze our proposed method Zoom-SVD in terms of time and memory cost. Note that a collection of multiple time series data is a dense matrix, and the time complexity to compute SVD of is .
Theorem 5 ().
When a vector is given at time , the computation cost of storage phase in Zoom-SVD is , where is the number of singular values.
Performing SVD of takes , and multiplication of and takes since the row length of is always smaller than or equal to the block size . Assume and are equal to . The total computational cost of storage phase in Zoom-SVD is O(+ ). We simply express the computational cost of storing the incoming data at each time tick as since the number of columns is generally greater than . ∎
In Theorem 5, the computation of storing the incoming data at each time tick takes constant time since and are constants and is smaller than .
Theorem 6 ().
Given a time range query , the time cost of query phase (Algorithm 2) is .
It takes to compute Partial-SVD where and are the number of singular values computed by Partial-SVD (line 1 in Algorithm 2).
The computational time to perform SVD of depends on , in Stitched-SVD since horizontal and vertical length of the matrix are and , respectively (lines 2 4 in Algorithm 2).
Also, the computational time of block matrix multiplication for (line 5 in Algorithm 2) takes O(
()) where is the number of singular values with respect to the SVD result of a given time range . Let all ’s be in query phase, be larger than ; also, replace with block size since is always greater than . Then, the computational time of Partial-SVD and Stitched-SVD takes and , respectively. We can simply express the computational cost of Zoom-SVD as . ∎
Theorem 6 implies that the computational time of Zoom-SVD in query phase linearly depends on the time range .
Space Complexity. We analyze the space complexity of the storage phase in Theorem 7.