A Periodicity-based Parallel Time Series Prediction Algorithm in Cloud Computing Environments

10/17/2018 ∙ by Jianguo Chen, et al. ∙ New Paltz University of Illinois at Chicago 0

In the era of big data, practical applications in various domains continually generate large-scale time-series data. Among them, some data show significant or potential periodicity characteristics, such as meteorological and financial data. It is critical to efficiently identify the potential periodic patterns from massive time-series data and provide accurate predictions. In this paper, a Periodicity-based Parallel Time Series Prediction (PPTSP) algorithm for large-scale time-series data is proposed and implemented in the Apache Spark cloud computing environment. To effectively handle the massive historical datasets, a Time Series Data Compression and Abstraction (TSDCA) algorithm is presented, which can reduce the data scale as well as accurately extracting the characteristics. Based on this, we propose a Multi-layer Time Series Periodic Pattern Recognition (MTSPPR) algorithm using the Fourier Spectrum Analysis (FSA) method. In addition, a Periodicity-based Time Series Prediction (PTSP) algorithm is proposed. Data in the subsequent period are predicted based on all previous period models, in which a time attenuation factor is introduced to control the impact of different periods on the prediction results. Moreover, to improve the performance of the proposed algorithms, we propose a parallel solution on the Apache Spark platform, using the Streaming real-time computing module. To efficiently process the large-scale time-series datasets in distributed computing environments, Distributed Streams (DStreams) and Resilient Distributed Datasets (RDDs) are used to store and calculate these datasets. Extensive experimental results show that our PPTSP algorithm has significant advantages compared with other algorithms in terms of prediction accuracy and performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 25

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Motivation

With the rapid development of the Internet, sensor network, Internet of Things (IoT), mobile Internet and other media, a large number of datasets are continuously generated in various fields, such as large commercial, medical, engineering, and social sciences [1, 19, 27, 32]. Time-series data are collections of datasets arranged in a time sequence, such as stock price, exchange rate, sales volume, production capacity, weather data, ocean engineering [6, 35, 37]

. As important and complex data objects, massive time-series data truly record valuable information and knowledge about the applications, playing an important role in different application fields. Abundant data mining and analysis technologies have been provided to seek the potentially available knowledge from these datasets. Based on the previously observed time-series data, we can forecast the probable data in the coming periods. It is interesting to seek high-performance approaches to handle the large-scale and streaming arrivals of time-series data. In addition, the accuracy and robustness of time-series data processing methods are also hot topics in the academic and industrial fields.

The era of big data has brought both opportunities and challenges to the processing of large-scale time-series datasets. On the one hand, in the era of big data, data generation and collection are becoming easier and less costly. Massive datasets are continuously generated through various means, providing rich data sources for big data analysis and mining [21, 30]. On the other hand, for time-series prediction, the emergence of the big data era also posed serious problems and challenges besides the obvious benefits.

  • Periodic pattern recognition of time-series data is essential for time series prediction. The periodic pattern of time-series data in the real world does not always keep a constant length (e.g. one day or one month) and may show dynamic length over time [35]. In addition, many time-series data have the characteristic of multi-layer periods. Most of the existing periodic pattern recognition work calculate and analyze the single-layer period patterns. It is necessary to adaptively identify time periodic patterns based on data-driven to discover the potential multi-layer periodic patterns.

  • To achieve accurate prediction, massive historical and real-time datasets are required for combination and analysis, which costs a lot of time to thoroughly excavate the historical data [17]. Therefore, it is an important challenge that how to quickly process and analyze the massive historical data in the real-time prediction process. The volume of massive datasets is usually much larger than the storage capacity of hard disks and memory on a single computer. Therefore, we need to use distributed computing clusters to store and calculate these datasets. This raises issues, such as data communication, synchronization waiting, and workload balancing, which need further consideration and resolution.

  • The performance of data analysis and prediction is also essential for large-scale time-series data. There are increasingly strict time requirements for real-time time series prediction in various application fields, such as stock market, real-time pricing, and online applications [36]. Rapidly developed cloud computing and distributed computing provide high-performance computing capabilities for big data mining. We need to propose efficient prediction algorithms for time-series data and execute these algorithms in high-performance computing environments. In such a case, these algorithms can take full advantage of high-performance computing capabilities and increase their performance and scalability, while keeping lower data communication costs.

1.2 Our Contributions

In this paper, we focus on the periodic pattern recognition and prediction of large-scale time-series data with periodic characteristics, and a Periodicity-based Parallel Time Series Prediction (PPTSP) algorithm for time-series data in cloud computing environments. A data compression and abstraction method is proposed for time-series data to effectively reduce the scale of massive historical datasets and extract the core information. Fourier Spectrum Analysis (FSA) method is introduced to detect potential single-layer or multi-layer periodic patterns from the compressed time-series data. The prediction algorithm is parallelized in the Apache Spark cloud platform, which effectively improves the performance of the algorithm and maintains high scalability and low data communication. Extensive experimental results show that our PPTSP algorithm has significant advantages compared with other algorithms in terms of accuracy and performance. Our contributions in this paper are summarized as follows.

  • To effectively handle the massive historical datasets, a Time Series Data Compression and Abstraction (TSDCA) algorithm is presented, which can reduce the data scale as well as accurately extracting the characteristics.

  • We propose a Multi-layer Time Series Periodic Pattern Recognition (MTSPPR) algorithm using the FSA method. The first-layer periodic pattern is identified adaptively with the FSA method and morphological similarity measure. Then, potential multi-layer periodic patterns are discovered in the same way.

  • Based on the detected periodic patterns, a Periodicity-based Time Series Prediction (PTSP) algorithm is proposed to predict data values in subsequent time periods. An exponential attenuation factor is defined to control the impact of each previous periodic model on the prediction results.

  • To improve the performance of the proposed algorithms, we propose a parallel solution on the Apache Spark platform, using the Streaming real-time computing module. Distributed-Streams (DStreams) objects and Resilient Distributed Datasets (RDDs) are used to store and calculate these datasets in distributed computing environments.

The rest of the paper is organized as follows. Section 2 reviews the related work. Section 3 gives the multi-layer period prediction algorithm for time-series data, including the data compression and abstraction, FSA-based periodic pattern recognition, and periodicity-based time series prediction methods. Parallel implementation of the periodic pattern recognition algorithm with Spark Streaming is developed in Section 4. Experimental results and evaluations are shown in Section 5 from the aspects of prediction accuracy and performance. Finally, Section 6 concludes the paper with a discussion of future work and research directions.

2 Related Work

In this section, we review the related work about time-series data mining from the perspectives of data compression and representation, periodic pattern recognition, time-series data prediction, and performance acceleration.

Focusing on large-scale time-series data compression and representation, various effective methods were proposed in [5, 15, 23, 29]. In [5], the Chebyshev polynomials (CHEB) method was used to approximate and index the

-dimensional Spatio-Temporal trajectory, and the best extraction solution was obtained by minimizing the maximum deviation from the true value (termed the minimax polynomial). However, CHEB is a global technique and requires expensive computational overhead for the large eigenvalue and eigenvector matrices. As an approximation technique, the Piecewise Linear Approximation (PLA) algorithm was proposed in

[25] to approximate a time-series with line segments. The representation consists of piecewise linear segments to represent the shape of the original time series. In addition, an Indexable PLA (IPLA) algorithm was proposed in [9] for efficient similarity search on time-series datasets. Focusing on dimensionality reduction technique, Eamonn et al. introduced a Piecewise Aggregate Approximation (PAA) algorithm [23] for high-dimensional time-series datasets. In [24]

, a locally adaptive dimensionality reduction technique - Adaptive Piecewise Constant Approximation (APCA) algorithm was explored for indexing large-scale time-series databases. There are other dimensionality reduction techniques, such as Singular Value Decomposition (SVD)

[14]

, Discrete Fourier transform (DFT)

[15], and Discrete Wavelet Transform (DWT) [29]. Detail experiments were performed in [10] to compare the above time-series data representation methods and test their effectiveness on various time-series datasets. However, most of the existing algorithms are implemented by dimensionality reduction or approximation, where DWT, PAA, and APCA are approximation methods with a discontinuous piecewise function. The TSDCA algorithm proposed in this work falls in the category of approximation technique. Different from the existing studies, TSDCA can extract the critical characteristics in each dimension to form a data abstraction without reducing the data dimensions. It can guarantee the invariability of the data structure between the data abstraction and the raw dataset. Similarity measurements, periodic pattern recognition, and prediction methods can be applied indiscriminately to the compressed dataset without any modification.

In the field of periodic pattern recognition of time series, various methods have been proposed [4, 26, 28], such as the complete periodic pattern, partial periodic pattern, period association rule, synchronous periodic pattern, and asynchronous periodic pattern. In [26], Loh et al. proposed an efficient method to mine temporal patterns in the popularity of web items, where the popularity of web items is treated as time series and a gap measurement method was proposed to quantify the difference between the popularity of two web items. They further proposed a density-based clustering algorithm using the gap measure to find clusters of web items and illustrated the effectiveness of the proposed approach using real-world datasets on the Google Trends website. In [12, 13], Elfeky et al. defined two types of periodicities: segment periodicity and symbol periodicity, and then proposed the corresponding algorithms (CONV and WARP) to discover the periodic patterns of unknown periods. However, based on the convolution technique, the CONV algorithm works well on datasets with perfect periodicity, but faces limitations on noisy time series datasets. The WARP algorithm uses the time warping technique to overcome the problem of noisy time series. However, both CONV and WARP can only detect segment periodicity rather than symbol or sequence periodicity, and limited in detecting partial periodic patterns. In [34], Sheng et al. developed a ParPer-based algorithm to detect periodic patterns in time series datasets, where the dense periodic areas in the time series are detected using optimization steps. However, this method requires pre-set expected period values. In such a case, users should have the specific domain knowledge to generate patterns. Rasheed et al. proposed a Suffix-Tree-based Noise-Resilient (STNR) algorithm to generate patterns and detect periodicity from time series datasets [31]. The STNR algorithm can overcome the problem of finding periodicity without user specification and interaction. However, the limitation of STNR is that it only works well in detecting fixed-length rigid periodic patterns, and it is poor effectiveness in tapping variable-length flexible patterns. To overcome this limitation, Chanda et al. introduced a Flexible Periodic Pattern Mining (FPPM) algorithm, which uses a suffix tree data structure and Discrete Fourier Transform (DFT) to detect flexible periodic patterns by ignoring unimportant or undesired events and only considering the important events [26]. However, in practical time series mining, the definition of the events of unimportant and important is difficult and infeasible. In addition, most of the existing studies focused on static time-series database and the periodic pattern recognition in a single layer. Considering that there are multiple nested periods on some real-world time-series datasets, i.e., the temperature shows periods both daily and seasonally, we focus on the potential multi-layer periodicity pattern recognition in this work. In addition, to effectively detect flexible periodic patterns without user preparation knowledge, we propose a novel morphological similarity measurement and introduce the Fourier Spectrum Analysis (FSA) method for multi-layer periodicity pattern detection. The morphological similarity is measured by a five-tuple (, , , , ), which refer to the angular similarity, time-length similarity, maximum similarity, minimum similarity, and the value-interval similarity, respectively. The combination of the FSA and morphological similarity measurement can efficiently calculate the compressed time series from incremental online time series streams. Moreover, the morphological similarity measurement can be further applied to various periodic pattern recognition algorithms.

Over the past several decades, various time series prediction algorithms were proposed in existing studies, such as seasonal autoregressive differential sliding average, Holt-Winters index [20, 3, 35, 22]. In [20], a novel high-order weighted fuzzy time series model was proposed and applied in nonlinear time series prediction. George et al.

used an online sequential learning algorithm for time-series prediction, where a feed-forward neural network was introduced as an online sequential learning model

[16]. Focus on local modeling, Marcin et al. proposed a period-aware local modeling and data selection for time series prediction [3], where the period of time series is determined by using autocorrelation function and moving average filter. Shi et al

. proposed an offline seasonal adjustment factor plus GARCH model to model the seasonal heteroscedasticity in traffic flow series

[35]. However, this model faces limitations in real-world transportation time-series processing. In [18], Huang et al

. introduced an online seasonal adjustment factors plus adaptive Kalman filter (OSAF+AKF) algorithm for the prediction of the seasonal heteroscedasticity in traffic flow datasets. Considering the seasonal patterns in traffic time-series datasets, four types of online seasonal adjustment factors are introduced in the OSAF+AKF algorithm. In addition, Tan

et al. defined a time-decaying online convex optimization problem and explored a Time-Decaying Adaptive Prediction (TDAP) algorithm for time series prediction [38]. In the biomedical field, time-series forward prediction algorithms were used for real-time brain oscillation detection and phase-locked stimulation in [8].

With the emergence of big data, the processing performance and real-time response requirements of large-scale time series applications have received increasing attention. Various acceleration and parallel methods were proposed for massive time-series data processing [17, 38, 33]. In [17], a GP-GPU parallelization solution was introduced for fast knowledge discovery from time-series datasets, where a General Programming (GP) framework was presented using the CUDA platform. Efforts on distributed and parallel time-series data mining based on high-performance computing and cloud computing have achieved abundant favorable achievements [40, 11]. Apache Spark [2] is another good cloud platform that is suitable for data mining. It allows us to store a data cache in memory and to perform computations and iteration of the same data directly from memory. The Spark platform saves huge amounts of disk I/O operation time. Spark Streaming is a real-time computing framework based on the Spark cloud environment. It provides many rich APIs and high-speed engines based on memory computing. Users can combine the Spark Streaming with applications such as flowing computing, batch processing, and interactive queries. In [30], the Spark Streaming module was used to implement the nearest neighbor classification algorithm for high-speed big data streams. In [36], an effective prediction algorithm was proposed based on the Apache Spark for missing data over multi-variable time series.

3 Periodicity-based Time Series Prediction Algorithm

In this section, we propose a Multi-layer Time Series Periodic Pattern Recognition (MTSPPR) algorithm for time-series data with periodic characteristics. In Section 3.1, to accelerate the periodic pattern recognition process of large-scale time-series datasets, a data compression and abstraction method is proposed, which can effectively extract the characteristics of data while reducing the scale of massive datasets. In Section 3.2, the Fourier Spectrum Analysis (FSA) method is used to identify periodic patterns from the compressed time-series dataset. On these bases, Section 3.3 describes the multi-layer periodic pattern recognition algorithm. Each potential senior-layer period model is constructed successively based on the periods in the previous low-layer models.

3.1 Time-series Data Compression and Abstraction

In many actual applications, time-series datasets grow at high speed over time. Although various storage technologies continue to be improved and storage costs are declining, it is still difficult to cope with the rapid development of large-scale datasets. To process large-scale and continuous time-series datasets using limited storage and computing resources, we propose a Time-Series Data Compression and Abstraction (TSDCA) algorithm to effectively reduce the data volume and extract key knowledge.

Given a big data processing application, let be the raw time-series dataset with temporal and periodic attributes, where is the data point at the time stamp . In this way, the raw dataset can be compressed by a series of data points and the slopes between these points. An example of the raw two-dimensional time-series dataset is compressed in Figure 1.

Figure 1: Data compression and abstraction of large-scale time-series datasets.

(1) Inclination measurement and inflection points mark.

To extract the characteristics of a large-scale time-series dataset, we calculate the inclination of every two data points and identify the inflection points of the dataset. The inclination between two data points is the ratio of the value difference and time difference between the two data points, as defined in Equation (1):

(1)

where is the inclination between data points and . There are three conditions for : (a) refers to an upward trend; (b) shows a steady trend; and (c) refers to a downward trend. Examples of the inclination relationships between two data points are shown in Figure 2.

Figure 2: Inclination relationships between two data points.

The inflection points set for is initialized as an empty set (). Set the first inflection point , . We continuously calculate the inclination between and data point , and between data points and . If , the data points , , and have a congruous trend. Namely, is not an inflection point here. In this case, we continue to calculate the slopes of the subsequent data points and multiply them with the inclination rate . Otherwise, if , it indicates an incongruous trend of data points to and to . That is, here represents an inflection point. We append to the inflection point set and set , . Similarly, the slopes of the remaining data points are computed sequentially by repeating the above steps. In this way, the large-scale raw time-series dataset is compressed and re-expressed as the form of inflection points , as described as:

,

where () is the number of inflection points. Note that the scale of the compressed dataset is much smaller than the raw dataset. Depending on these inflection points, the raw time-series dataset is divided into multiple linear segments. These segments can be connected to form an abstract representation of the raw dataset, as shown in Figure 3.

Figure 3: Inflection points of the raw time-series dataset.

(2) Pseudo-inflection points deletion.

Considering that the set of inflection points still contains a lot of inflection points that have similar values to the neighbors in the abstract representation dataset. We need to further identify and remove these pseudo-inflection points to effectively describe the significant outlines of the raw dataset.

Definition 1: (Pseudo-inflection point). Pseudo-inflection points refer to the inflection points of which the values have a negligible difference from their neighbors. These data points have little impact on the distribution trends and patterns of the neighborhood of the abstract representation. After removing these pseudo-inflection points, the overall outlines of the abstract representation dataset will be well maintained.

We respectively calculate the slopes of every three adjacent inflection points to determine whether the middle one is a pseudo-inflection point. Let be the inclination between inflection points and , as calculated in Equation (2):

(2)

According to Equation (2), we continue to calculate the slopes , , and among , , and , respectively. The inclination relationship of inflection points , , and is calculated in terms of value differences and time difference, as defined in Equation (3) and (4):

(3)
(4)

where is the inclination threshold ( ) and is the threshold of the length of time ( ). If the inflection points , , and satisfy the inclination relationships in Equations (3) and (4), then is identified as a pseudo-inflection point and removed from .

For example, in Figure 4, and are identified as pseudo-inflection points. After removing and , the set of inflection points is updated to .

Figure 4: Pseudo-inflection points of data abstract representation.

(3) Data compression and abstraction of the raw time-series dataset.

For the raw time-series dataset , inflection points, excepting the pseudo-inflection points, are collected to form a compressed and abstracted representation . In this way, the large-scale dataset can be effectively compressed to reduce the data size while effectively extracting the core information. For example, the data abstraction of the raw dataset in Figure 1 is shown in Figure 5. Detailed steps for time-series data compression and abstraction algorithm are given in Algorithm 1.

Figure 5: Data compression and abstraction of the raw time-series dataset.
0:   : the raw time-series dataset;: the inclination threshold;: the threshold value of the time-window length;
0:   : the data abstraction of .
1:  initialize the inflection point set as empty ;
2:  set the first inflection point , and ;
3:  for each data point in  do
4:     calculate inclination rates , ;
5:     if (then
6:        continue;
7:     else
8:        mark inflection point ;
9:  for each inflection point in  do
10:     calculate inclination rates , ;
11:     calculate inclination relationship of value ;
12:     calculate inclination relationship of time ;
13:     if ( and then
14:        remove pseudo-inflection point from ;
15:  return  .
Algorithm 1 Time-series data compression and abstraction (TSDCA) algorithm

The TSDCA algorithm consists of the processes of inflection points marking and pseudo-inflection points deletion. Assuming that the number of data points of the raw dataset is equal to and the data abstraction has inflection points, the time complexity of Algorithm 1 is . The data compression ratio between and is . Benefitting from the data compression and abstraction, the storage requirement and the data processing workload of big data are reduced effectively.

3.2 Multi-layer Time Series Periodic Pattern Recognition

In this section, we propose a Multi-layer Time Series Periodic Pattern Recognition (MTSPPR) algorithm. A morphological similarity measurement is proposed for the continuous arrival time-series datasets. The Fourier Spectrum Analysis (FSA) method is used to identify the potential periodic models of time-series datasets.

3.2.1 FSA-based Periodic Pattern Recognition

Given a data abstraction of the raw time-series dataset , can be described as a non-stationary data model, including trend item , periodic item , and random item , as described as:

(5)

If the periodic item satisfies the expansion conditions of the Fourier series, we can extend in using a periodic length . Then, we get the Fourier expansion of in the interval . Namely, is represented as the sum of a series of spectrums doubling in frequency in the interval , as described in Equation (6):

(6)

where

is an estimate of

, which makeups by spectrums and the average component . is the highest item in these spectrums. and are the amplitudes of the cosine and sine components of each spectrum. is the initial phase angle of each spectrum. is the basic angular frequency. The number of these spectrums does not exceed , namely, is approximated by a limited number of spectrums.

According to the least square method, to obtain the values of the coefficients in Equation (6), the quadratic sum of the fitting error in Equation (7) should be minimized.

(7)

We calculate partial derivatives of Q with and and make them equal to 0, then

(8)

According to the orthogonality of the trigonometric function, we calculate Equation (9) to get the estimated expression of each Fourier spectrum coefficient:

(9)

The overall variance of the periodic item

of the time-series data abstraction is defined in Equation (10):

(10)

Let be the spectrum compositions of , we use a statistic to evaluate the significance of the variance of each spectrum, as defined in Equation (11):

(11)

According to Equation (11), we get the spectrum with the maximum significance and set as the period length of .

3.2.2 Morphological Similarity Measurement

From Equation (9), we can see that the Fourier coefficients of each spectrum depend on the time sequence length . In practical applications, time-series datasets are generated in an endless flow. Namely, new arriving time-series datasets continuously append to the original sequence. In such a case, is constantly updated with the arriving of new datasets, resulting in the Fourier coefficients need to be recalculated repeatedly. To effectively improve the performance of the periodic pattern recognition, we propose a morphological similarity measurement and optimize Equation (9) for the new arriving time-series datasets. In the morphological similarity measurement, the sequence of the data abstraction is partitioned into multiple subsequences. Then, we calculate the morphological similarities of these subsequences and provide new estimated expression of each Fourier spectrum coefficient.

Given a periodic item , assuming that there are two subsequences and in , and each subsequence consists of inflection points. Namely, there are subsequences in and , respectively. The morphological similarity between subsequences and is measured from five aspects: angular similarity, time length similarity, maximum similarity, minimum similarity, and value-interval similarity.

(1) Angular similarity.

Definition 2: (Angular similarity). The angular similarity between two subsequences and refers to the average of the angular similarities between the individual linear segments in the two subsequences. The angular similarity of each linear-segment part in and is equal to the ratio of the difference of inclination rates between these two segments to the larger inclination rate. is calculated by Equation (12):

(12)

where is the inclination rate of linear segment in subsequence and is that of in . The time-length similarity , maximum similarity , and minimum similarity between subsequences and are calculated in the same way.

(2) Value-interval similarity.

Definition 3: (Value-interval similarity). The value interval of a sequence is the difference between the mean value of all peaks of the sequence and the mean value of all valleys of the sequence. The value-interval similarity of two subsequences and refers to the degree of similarity between their value intervals. The value-interval similarity of and is defined in Equation (13):

(13)

where and are the values of the peaks and valleys of the -th segment in , respectively.

Based on the above five similarity indicators, we propose a five-dimension radar chart measurement method to evaluate the morphological similarity of the time-series data abstraction. The morphological similarity between subsequences and is defined as , where each score range of each indicator is (0 1]. Therefore, as shown in Figure 6, the radar chart of is plotted as a pentagon, where the distance from the center to each vertex is equal to 1.

Figure 6: Morphological similarity measurement of time-series data.

According to the radar chart, the value of is the area composed of the five indicators, as calculated in Equation (14):

(14)

It is easy to obtain each side-length of the pentagon is approximately equal to 1.18, and the area of the pentagon is . Hence, the value of is within the range of (). This novel similarity measure method addresses the problem of inaccurate distance measurement due to the different data shifts and time lengths.

Based on the morphological similarity measurement in Equation (14), we update the estimated expression of each Fourier spectrum coefficient. Assuming that is the length of and is the growth step of the comparison subsequences and , the estimated expression of each Fourier spectrum coefficient is calculated by Equation (15):

(15)

We calculate the quadratic sum of fitting residual sequences for each subsequence pair in and get the results , , , , where is the number of spectrums. Finally, the optimal period length of the periodic item is the fundamental frequency corresponding to .

(2) Periodic pattern recognition.

Different from the traditional periodic pattern recognition algorithms, a new method of periodic pattern recognition based on the time-series data abstraction is proposed in this section. The similarity of the time-series data is calculated by subsequences with the same number of inflection points. Afterwards, the subsequence with the most similarity is found out as a period of time-series data.

Set as the growth step of the comparison subsequences, namely, there are inflection points increasingly incorporated into the comparison subsequences each time. Let as an example, that is, 2 inflection points are incorporated into the comparison subsequences each time. Set as the first time subsequence and as a comparison subsequence. The two subsequences are compared with the morphological similarity measure, which is defined as . The detail of calculation method has been explained in the previous section. We continue incorporate the subsequent inflection points into the , namely . And then, the same number of inflection points in the data abstraction are collected to compose the comparison subsequence , namely .

In addition, the number of inflection points in the comparison subsequences that might exist periodic patterns may be slightly different due to the inflection points marking and the pseudo inflection points deletion operations. Therefore, we introduce a scaling ratio factor () to control the number of inflection points of the latter comparison subsequence . In this way, the comparison subsequences are optimized from fixed-length rigid sequences to variable-length flexible sequences. The length of is within the range of the left and right extension of the length of the previous comparison subsequence . Let be the number of inflection points of subsequence and be the number of inflection points of subsequences , the scaling ratio factor is calculated in Equation (16):

(16)

For example, assuming that and , then the value of is in the range of (). In other words, for subsequence with 5 inflection points, inflection points closely followed are taken as the corresponding candidate subsequences , respectively. Namely, for subsequence with 10 inflection points (), there are 5 different candidate comparison subsequences with different numbers of inflection points constructed for similarity measure. The candidate comparison subsequences are listed as follows: ; ; ; ; . And each is introduced to calculate the similarity with respectively. Finally, the candidate subsequence with the maximum similarity value is obtained as the comparison subsequence , and the corresponding number of inflection points is taken as the length of . Thus, this pair of comparison subsequences and are termed as , namely , where with 5 inflection points is the comparison subsequence with the highest similarity. Thus, the first-layer period of time-series dataset is recognized using the optimal period length , as described as:

(17)

where the length of each period is (). An example of the first-layer period of time-series dataset is shown in Figure 7. The detailed steps of the FSA-based time series periodic pattern recognition algorithm are presented in Algorithm 2.

Figure 7: The first-layer period model of time-series dataset.
0:   : the abstraction of the raw time-series dataset;: the growth step of the comparison subsequences;: the scaling ratio factor of the comparison subsequence length;
0:   : the first-layer periodic model of .
1:  calculate the non-stationary data model ;
2:  for each in  do
3:     ;
4:     for each in  do
5:         get subsequence from ;
6:         set the length of ;
7:         get comparison subsequence from ;
8:         calculate morphological similarity ;
9:         ;
10:         ;
11:     calculate the estimate value ;
12:     calculate the overall variance ;
13:     calculate the spectrum composition ;
14:     calculate ;
15:  find the maximum ;
16:  obtain the period length ;
17:  for  in  do
18:     obtain period model ;
19:     append period model ;
20:  return  .
Algorithm 2 Multi-layer Time series periodic pattern recognition (MTSPPR) algorithm

In Algorithm 2, is the highest item of the Fourier spectrums and is the length of . The length of comparison subsequence pairs is increased by the step size of . Assuming that the time complexity of each morphological similarity measurement process is and the time complexity of the first-layer period recognition is . Hence, the computational complexity of Algorithm 2 is .

3.2.3 Multi-layer Periodic Pattern Recognition

Considering that there exists potential multi-layer periodicity in given time-series datasets, we propose a multi-layer periodic pattern model to adaptively recognize the multi-layer time periods. After obtaining the first-layer periodic pattern, the time-series dataset is recognized into multiple periods. The contents of each period in the first-layer periodic pattern are further abstracted and represented by the Gaussian Blur function. Let be the dataset of the -th period in the first-layer periodic model , where is the period length of . We calculate the weight of each data point in using the Gaussian Blur function, as defined in Equation (18):

(18)

where is the variance of all data points in . Based on , we obtain the new value . In this way, the dataset is updated as .

For the updated dataset , we apply the big data compression and abstraction method on to further reduce the volume of each period and extract the key information. Then, the FSA-based periodic pattern recognition algorithm is used on the compressed first-layer dataset to obtain the second-layer periodic patterns. Repeat these steps until there is no significant periodic pattern can be recognized. Thus, the multi-layer periodic model of the time-series dataset is built, as defined as:

,

where is the number of period layers for the time-series dataset. An example of the multi-layer periodic model of a given time-series dataset is shown in Figure 8.

Figure 8: Multi-layer periodic model of time-series dataset.

3.3 Periodicity-based Time Series Prediction

Based on the multi-layer periodic model described in Section 3.2, we propose a Periodicity-based Time Series Prediction (PTSP) algorithm in this section. Different from the traditional time series prediction methods, in PTSP, the forecasting unit of upcoming data is one complete period rather than one timestamp. According to the identified periodic models, the forecasting object of each prediction behavior is the contents of the next complete period, instead of the data point in the next timestamp. The previous periodic models in different layers involve different contributions to the contents of the coming period. The periodicity-based time series prediction method is shown in Figure 9.

Figure 9: Periodicity-based time series prediction method.

(1) Prediction based on periodic model.

For each previous periodic model, its impact on the contents of the coming period is measured by a weight value, which is calculated using the time attenuation factor. Given a multi-layer periodic model for the time-series dataset, there are multiple period models in each layer, where is the number of period layers and is the number of periods in the -th layer. Assuming that is the current time period and is the next time period that will be predicted. The contents of are predicted based on all of the periodic models in each layer in the identified multi-layer periodic model. To evaluate the impact of each previous model on the contents of , a time attenuation factor is introduced to calculate the weight value of each periodic model in each layer, respectively. For example, for each period in the first layer, the weight of for is defined in Equation (19):

(19)

Based on the weights of periodic models in the first layer , we can calculate the prediction component from , as defined in Equation (20):

(20)

We continue to calculate the weight of periods in each layer for . Assuming that there are periods in the -th layer model , namely, the current period in the -th layer is , which corresponds to the current period . For each period in the -th layer, we use the time attenuation factor to measure the weight of for , as calculated in Equation (21):

(21)

Based on each prediction component calculated from each layer , we get the predicted contents of , as defined in Equation (22):

(22)

An example of the periodicity-based time series prediction process is illustrated in Figure 10.

Figure 10: Example of the periodicity-based time series prediction process.

(2) Calculation of the inflection points.

Due to the big data compression and abstraction, each periodic model in the identified multi-layer periodic model is built based on the inflection points rather than the raw time-series datasets. In this way, the predicted contents of the next time period are inflection points with the corresponding time points, rather than the data values at all time points. Therefore, we should calculate the values of all inflection points in and further fit the data values at all time points.

Considering that different periods in each layer contain different numbers of inflection points located at different time points, we need to map them to the corresponding positions on the time axis of the predicting period to form new predicting inflection points. The set of predicting inflection points in is defined as , there are multiple prediction components from all periods to form the values of . Assuming that there is a set of inflection points in the period in , we calculate the prediction component of for , as defined in Equation (23):

(23)

According to Equation (23), the set of predicting inflection points in is integrated based on the prediction components from all previous periodic models.

(3) Fit data values at all time points in the predicting period .

Based on the predicted inflection points , the data values at all time points among these inflection points are fitted. For each two adjacent inflection points and in , the fitting data value at each time point in the range of is calculated in Equation (24):