1 Introduction
1.1 Motivation
With the rapid development of the Internet, sensor network, Internet of Things (IoT), mobile Internet and other media, a large number of datasets are continuously generated in various fields, such as large commercial, medical, engineering, and social sciences [1, 19, 27, 32]. Timeseries data are collections of datasets arranged in a time sequence, such as stock price, exchange rate, sales volume, production capacity, weather data, ocean engineering [6, 35, 37]
. As important and complex data objects, massive timeseries data truly record valuable information and knowledge about the applications, playing an important role in different application fields. Abundant data mining and analysis technologies have been provided to seek the potentially available knowledge from these datasets. Based on the previously observed timeseries data, we can forecast the probable data in the coming periods. It is interesting to seek highperformance approaches to handle the largescale and streaming arrivals of timeseries data. In addition, the accuracy and robustness of timeseries data processing methods are also hot topics in the academic and industrial fields.
The era of big data has brought both opportunities and challenges to the processing of largescale timeseries datasets. On the one hand, in the era of big data, data generation and collection are becoming easier and less costly. Massive datasets are continuously generated through various means, providing rich data sources for big data analysis and mining [21, 30]. On the other hand, for timeseries prediction, the emergence of the big data era also posed serious problems and challenges besides the obvious benefits.

Periodic pattern recognition of timeseries data is essential for time series prediction. The periodic pattern of timeseries data in the real world does not always keep a constant length (e.g. one day or one month) and may show dynamic length over time [35]. In addition, many timeseries data have the characteristic of multilayer periods. Most of the existing periodic pattern recognition work calculate and analyze the singlelayer period patterns. It is necessary to adaptively identify time periodic patterns based on datadriven to discover the potential multilayer periodic patterns.

To achieve accurate prediction, massive historical and realtime datasets are required for combination and analysis, which costs a lot of time to thoroughly excavate the historical data [17]. Therefore, it is an important challenge that how to quickly process and analyze the massive historical data in the realtime prediction process. The volume of massive datasets is usually much larger than the storage capacity of hard disks and memory on a single computer. Therefore, we need to use distributed computing clusters to store and calculate these datasets. This raises issues, such as data communication, synchronization waiting, and workload balancing, which need further consideration and resolution.

The performance of data analysis and prediction is also essential for largescale timeseries data. There are increasingly strict time requirements for realtime time series prediction in various application fields, such as stock market, realtime pricing, and online applications [36]. Rapidly developed cloud computing and distributed computing provide highperformance computing capabilities for big data mining. We need to propose efficient prediction algorithms for timeseries data and execute these algorithms in highperformance computing environments. In such a case, these algorithms can take full advantage of highperformance computing capabilities and increase their performance and scalability, while keeping lower data communication costs.
1.2 Our Contributions
In this paper, we focus on the periodic pattern recognition and prediction of largescale timeseries data with periodic characteristics, and a Periodicitybased Parallel Time Series Prediction (PPTSP) algorithm for timeseries data in cloud computing environments. A data compression and abstraction method is proposed for timeseries data to effectively reduce the scale of massive historical datasets and extract the core information. Fourier Spectrum Analysis (FSA) method is introduced to detect potential singlelayer or multilayer periodic patterns from the compressed timeseries data. The prediction algorithm is parallelized in the Apache Spark cloud platform, which effectively improves the performance of the algorithm and maintains high scalability and low data communication. Extensive experimental results show that our PPTSP algorithm has significant advantages compared with other algorithms in terms of accuracy and performance. Our contributions in this paper are summarized as follows.

To effectively handle the massive historical datasets, a Time Series Data Compression and Abstraction (TSDCA) algorithm is presented, which can reduce the data scale as well as accurately extracting the characteristics.

We propose a Multilayer Time Series Periodic Pattern Recognition (MTSPPR) algorithm using the FSA method. The firstlayer periodic pattern is identified adaptively with the FSA method and morphological similarity measure. Then, potential multilayer periodic patterns are discovered in the same way.

Based on the detected periodic patterns, a Periodicitybased Time Series Prediction (PTSP) algorithm is proposed to predict data values in subsequent time periods. An exponential attenuation factor is defined to control the impact of each previous periodic model on the prediction results.

To improve the performance of the proposed algorithms, we propose a parallel solution on the Apache Spark platform, using the Streaming realtime computing module. DistributedStreams (DStreams) objects and Resilient Distributed Datasets (RDDs) are used to store and calculate these datasets in distributed computing environments.
The rest of the paper is organized as follows. Section 2 reviews the related work. Section 3 gives the multilayer period prediction algorithm for timeseries data, including the data compression and abstraction, FSAbased periodic pattern recognition, and periodicitybased time series prediction methods. Parallel implementation of the periodic pattern recognition algorithm with Spark Streaming is developed in Section 4. Experimental results and evaluations are shown in Section 5 from the aspects of prediction accuracy and performance. Finally, Section 6 concludes the paper with a discussion of future work and research directions.
2 Related Work
In this section, we review the related work about timeseries data mining from the perspectives of data compression and representation, periodic pattern recognition, timeseries data prediction, and performance acceleration.
Focusing on largescale timeseries data compression and representation, various effective methods were proposed in [5, 15, 23, 29]. In [5], the Chebyshev polynomials (CHEB) method was used to approximate and index the
dimensional SpatioTemporal trajectory, and the best extraction solution was obtained by minimizing the maximum deviation from the true value (termed the minimax polynomial). However, CHEB is a global technique and requires expensive computational overhead for the large eigenvalue and eigenvector matrices. As an approximation technique, the Piecewise Linear Approximation (PLA) algorithm was proposed in
[25] to approximate a timeseries with line segments. The representation consists of piecewise linear segments to represent the shape of the original time series. In addition, an Indexable PLA (IPLA) algorithm was proposed in [9] for efficient similarity search on timeseries datasets. Focusing on dimensionality reduction technique, Eamonn et al. introduced a Piecewise Aggregate Approximation (PAA) algorithm [23] for highdimensional timeseries datasets. In [24], a locally adaptive dimensionality reduction technique  Adaptive Piecewise Constant Approximation (APCA) algorithm was explored for indexing largescale timeseries databases. There are other dimensionality reduction techniques, such as Singular Value Decomposition (SVD)
[14], Discrete Fourier transform (DFT)
[15], and Discrete Wavelet Transform (DWT) [29]. Detail experiments were performed in [10] to compare the above timeseries data representation methods and test their effectiveness on various timeseries datasets. However, most of the existing algorithms are implemented by dimensionality reduction or approximation, where DWT, PAA, and APCA are approximation methods with a discontinuous piecewise function. The TSDCA algorithm proposed in this work falls in the category of approximation technique. Different from the existing studies, TSDCA can extract the critical characteristics in each dimension to form a data abstraction without reducing the data dimensions. It can guarantee the invariability of the data structure between the data abstraction and the raw dataset. Similarity measurements, periodic pattern recognition, and prediction methods can be applied indiscriminately to the compressed dataset without any modification.In the field of periodic pattern recognition of time series, various methods have been proposed [4, 26, 28], such as the complete periodic pattern, partial periodic pattern, period association rule, synchronous periodic pattern, and asynchronous periodic pattern. In [26], Loh et al. proposed an efficient method to mine temporal patterns in the popularity of web items, where the popularity of web items is treated as time series and a gap measurement method was proposed to quantify the difference between the popularity of two web items. They further proposed a densitybased clustering algorithm using the gap measure to find clusters of web items and illustrated the effectiveness of the proposed approach using realworld datasets on the Google Trends website. In [12, 13], Elfeky et al. defined two types of periodicities: segment periodicity and symbol periodicity, and then proposed the corresponding algorithms (CONV and WARP) to discover the periodic patterns of unknown periods. However, based on the convolution technique, the CONV algorithm works well on datasets with perfect periodicity, but faces limitations on noisy time series datasets. The WARP algorithm uses the time warping technique to overcome the problem of noisy time series. However, both CONV and WARP can only detect segment periodicity rather than symbol or sequence periodicity, and limited in detecting partial periodic patterns. In [34], Sheng et al. developed a ParPerbased algorithm to detect periodic patterns in time series datasets, where the dense periodic areas in the time series are detected using optimization steps. However, this method requires preset expected period values. In such a case, users should have the specific domain knowledge to generate patterns. Rasheed et al. proposed a SuffixTreebased NoiseResilient (STNR) algorithm to generate patterns and detect periodicity from time series datasets [31]. The STNR algorithm can overcome the problem of finding periodicity without user specification and interaction. However, the limitation of STNR is that it only works well in detecting fixedlength rigid periodic patterns, and it is poor effectiveness in tapping variablelength flexible patterns. To overcome this limitation, Chanda et al. introduced a Flexible Periodic Pattern Mining (FPPM) algorithm, which uses a suffix tree data structure and Discrete Fourier Transform (DFT) to detect flexible periodic patterns by ignoring unimportant or undesired events and only considering the important events [26]. However, in practical time series mining, the definition of the events of unimportant and important is difficult and infeasible. In addition, most of the existing studies focused on static timeseries database and the periodic pattern recognition in a single layer. Considering that there are multiple nested periods on some realworld timeseries datasets, i.e., the temperature shows periods both daily and seasonally, we focus on the potential multilayer periodicity pattern recognition in this work. In addition, to effectively detect flexible periodic patterns without user preparation knowledge, we propose a novel morphological similarity measurement and introduce the Fourier Spectrum Analysis (FSA) method for multilayer periodicity pattern detection. The morphological similarity is measured by a fivetuple (, , , , ), which refer to the angular similarity, timelength similarity, maximum similarity, minimum similarity, and the valueinterval similarity, respectively. The combination of the FSA and morphological similarity measurement can efficiently calculate the compressed time series from incremental online time series streams. Moreover, the morphological similarity measurement can be further applied to various periodic pattern recognition algorithms.
Over the past several decades, various time series prediction algorithms were proposed in existing studies, such as seasonal autoregressive differential sliding average, HoltWinters index [20, 3, 35, 22]. In [20], a novel highorder weighted fuzzy time series model was proposed and applied in nonlinear time series prediction. George et al.
used an online sequential learning algorithm for timeseries prediction, where a feedforward neural network was introduced as an online sequential learning model
[16]. Focus on local modeling, Marcin et al. proposed a periodaware local modeling and data selection for time series prediction [3], where the period of time series is determined by using autocorrelation function and moving average filter. Shi et al. proposed an offline seasonal adjustment factor plus GARCH model to model the seasonal heteroscedasticity in traffic flow series
[35]. However, this model faces limitations in realworld transportation timeseries processing. In [18], Huang et al. introduced an online seasonal adjustment factors plus adaptive Kalman filter (OSAF+AKF) algorithm for the prediction of the seasonal heteroscedasticity in traffic flow datasets. Considering the seasonal patterns in traffic timeseries datasets, four types of online seasonal adjustment factors are introduced in the OSAF+AKF algorithm. In addition, Tan
et al. defined a timedecaying online convex optimization problem and explored a TimeDecaying Adaptive Prediction (TDAP) algorithm for time series prediction [38]. In the biomedical field, timeseries forward prediction algorithms were used for realtime brain oscillation detection and phaselocked stimulation in [8].With the emergence of big data, the processing performance and realtime response requirements of largescale time series applications have received increasing attention. Various acceleration and parallel methods were proposed for massive timeseries data processing [17, 38, 33]. In [17], a GPGPU parallelization solution was introduced for fast knowledge discovery from timeseries datasets, where a General Programming (GP) framework was presented using the CUDA platform. Efforts on distributed and parallel timeseries data mining based on highperformance computing and cloud computing have achieved abundant favorable achievements [40, 11]. Apache Spark [2] is another good cloud platform that is suitable for data mining. It allows us to store a data cache in memory and to perform computations and iteration of the same data directly from memory. The Spark platform saves huge amounts of disk I/O operation time. Spark Streaming is a realtime computing framework based on the Spark cloud environment. It provides many rich APIs and highspeed engines based on memory computing. Users can combine the Spark Streaming with applications such as flowing computing, batch processing, and interactive queries. In [30], the Spark Streaming module was used to implement the nearest neighbor classification algorithm for highspeed big data streams. In [36], an effective prediction algorithm was proposed based on the Apache Spark for missing data over multivariable time series.
3 Periodicitybased Time Series Prediction Algorithm
In this section, we propose a Multilayer Time Series Periodic Pattern Recognition (MTSPPR) algorithm for timeseries data with periodic characteristics. In Section 3.1, to accelerate the periodic pattern recognition process of largescale timeseries datasets, a data compression and abstraction method is proposed, which can effectively extract the characteristics of data while reducing the scale of massive datasets. In Section 3.2, the Fourier Spectrum Analysis (FSA) method is used to identify periodic patterns from the compressed timeseries dataset. On these bases, Section 3.3 describes the multilayer periodic pattern recognition algorithm. Each potential seniorlayer period model is constructed successively based on the periods in the previous lowlayer models.
3.1 Timeseries Data Compression and Abstraction
In many actual applications, timeseries datasets grow at high speed over time. Although various storage technologies continue to be improved and storage costs are declining, it is still difficult to cope with the rapid development of largescale datasets. To process largescale and continuous timeseries datasets using limited storage and computing resources, we propose a TimeSeries Data Compression and Abstraction (TSDCA) algorithm to effectively reduce the data volume and extract key knowledge.
Given a big data processing application, let be the raw timeseries dataset with temporal and periodic attributes, where is the data point at the time stamp . In this way, the raw dataset can be compressed by a series of data points and the slopes between these points. An example of the raw twodimensional timeseries dataset is compressed in Figure 1.
(1) Inclination measurement and inflection points mark.
To extract the characteristics of a largescale timeseries dataset, we calculate the inclination of every two data points and identify the inflection points of the dataset. The inclination between two data points is the ratio of the value difference and time difference between the two data points, as defined in Equation (1):
(1) 
where is the inclination between data points and . There are three conditions for : (a) refers to an upward trend; (b) shows a steady trend; and (c) refers to a downward trend. Examples of the inclination relationships between two data points are shown in Figure 2.
The inflection points set for is initialized as an empty set (). Set the first inflection point , . We continuously calculate the inclination between and data point , and between data points and . If , the data points , , and have a congruous trend. Namely, is not an inflection point here. In this case, we continue to calculate the slopes of the subsequent data points and multiply them with the inclination rate . Otherwise, if , it indicates an incongruous trend of data points to and to . That is, here represents an inflection point. We append to the inflection point set and set , . Similarly, the slopes of the remaining data points are computed sequentially by repeating the above steps. In this way, the largescale raw timeseries dataset is compressed and reexpressed as the form of inflection points , as described as:
,
where () is the number of inflection points. Note that the scale of the compressed dataset is much smaller than the raw dataset. Depending on these inflection points, the raw timeseries dataset is divided into multiple linear segments. These segments can be connected to form an abstract representation of the raw dataset, as shown in Figure 3.
(2) Pseudoinflection points deletion.
Considering that the set of inflection points still contains a lot of inflection points that have similar values to the neighbors in the abstract representation dataset. We need to further identify and remove these pseudoinflection points to effectively describe the significant outlines of the raw dataset.
Definition 1: (Pseudoinflection point). Pseudoinflection points refer to the inflection points of which the values have a negligible difference from their neighbors. These data points have little impact on the distribution trends and patterns of the neighborhood of the abstract representation. After removing these pseudoinflection points, the overall outlines of the abstract representation dataset will be well maintained.
We respectively calculate the slopes of every three adjacent inflection points to determine whether the middle one is a pseudoinflection point. Let be the inclination between inflection points and , as calculated in Equation (2):
(2) 
According to Equation (2), we continue to calculate the slopes , , and among , , and , respectively. The inclination relationship of inflection points , , and is calculated in terms of value differences and time difference, as defined in Equation (3) and (4):
(3) 
(4) 
where is the inclination threshold ( ) and is the threshold of the length of time ( ). If the inflection points , , and satisfy the inclination relationships in Equations (3) and (4), then is identified as a pseudoinflection point and removed from .
For example, in Figure 4, and are identified as pseudoinflection points. After removing and , the set of inflection points is updated to .
(3) Data compression and abstraction of the raw timeseries dataset.
For the raw timeseries dataset , inflection points, excepting the pseudoinflection points, are collected to form a compressed and abstracted representation . In this way, the largescale dataset can be effectively compressed to reduce the data size while effectively extracting the core information. For example, the data abstraction of the raw dataset in Figure 1 is shown in Figure 5. Detailed steps for timeseries data compression and abstraction algorithm are given in Algorithm 1.
The TSDCA algorithm consists of the processes of inflection points marking and pseudoinflection points deletion. Assuming that the number of data points of the raw dataset is equal to and the data abstraction has inflection points, the time complexity of Algorithm 1 is . The data compression ratio between and is . Benefitting from the data compression and abstraction, the storage requirement and the data processing workload of big data are reduced effectively.
3.2 Multilayer Time Series Periodic Pattern Recognition
In this section, we propose a Multilayer Time Series Periodic Pattern Recognition (MTSPPR) algorithm. A morphological similarity measurement is proposed for the continuous arrival timeseries datasets. The Fourier Spectrum Analysis (FSA) method is used to identify the potential periodic models of timeseries datasets.
3.2.1 FSAbased Periodic Pattern Recognition
Given a data abstraction of the raw timeseries dataset , can be described as a nonstationary data model, including trend item , periodic item , and random item , as described as:
(5) 
If the periodic item satisfies the expansion conditions of the Fourier series, we can extend in using a periodic length . Then, we get the Fourier expansion of in the interval . Namely, is represented as the sum of a series of spectrums doubling in frequency in the interval , as described in Equation (6):
(6) 
where
is an estimate of
, which makeups by spectrums and the average component . is the highest item in these spectrums. and are the amplitudes of the cosine and sine components of each spectrum. is the initial phase angle of each spectrum. is the basic angular frequency. The number of these spectrums does not exceed , namely, is approximated by a limited number of spectrums.According to the least square method, to obtain the values of the coefficients in Equation (6), the quadratic sum of the fitting error in Equation (7) should be minimized.
(7) 
We calculate partial derivatives of Q with and and make them equal to 0, then
(8) 
According to the orthogonality of the trigonometric function, we calculate Equation (9) to get the estimated expression of each Fourier spectrum coefficient:
(9) 
The overall variance of the periodic item
of the timeseries data abstraction is defined in Equation (10):(10)  
Let be the spectrum compositions of , we use a statistic to evaluate the significance of the variance of each spectrum, as defined in Equation (11):
(11) 
According to Equation (11), we get the spectrum with the maximum significance and set as the period length of .
3.2.2 Morphological Similarity Measurement
From Equation (9), we can see that the Fourier coefficients of each spectrum depend on the time sequence length . In practical applications, timeseries datasets are generated in an endless flow. Namely, new arriving timeseries datasets continuously append to the original sequence. In such a case, is constantly updated with the arriving of new datasets, resulting in the Fourier coefficients need to be recalculated repeatedly. To effectively improve the performance of the periodic pattern recognition, we propose a morphological similarity measurement and optimize Equation (9) for the new arriving timeseries datasets. In the morphological similarity measurement, the sequence of the data abstraction is partitioned into multiple subsequences. Then, we calculate the morphological similarities of these subsequences and provide new estimated expression of each Fourier spectrum coefficient.
Given a periodic item , assuming that there are two subsequences and in , and each subsequence consists of inflection points. Namely, there are subsequences in and , respectively. The morphological similarity between subsequences and is measured from five aspects: angular similarity, time length similarity, maximum similarity, minimum similarity, and valueinterval similarity.
(1) Angular similarity.
Definition 2: (Angular similarity). The angular similarity between two subsequences and refers to the average of the angular similarities between the individual linear segments in the two subsequences. The angular similarity of each linearsegment part in and is equal to the ratio of the difference of inclination rates between these two segments to the larger inclination rate. is calculated by Equation (12):
(12) 
where is the inclination rate of linear segment in subsequence and is that of in . The timelength similarity , maximum similarity , and minimum similarity between subsequences and are calculated in the same way.
(2) Valueinterval similarity.
Definition 3: (Valueinterval similarity). The value interval of a sequence is the difference between the mean value of all peaks of the sequence and the mean value of all valleys of the sequence. The valueinterval similarity of two subsequences and refers to the degree of similarity between their value intervals. The valueinterval similarity of and is defined in Equation (13):
(13) 
where and are the values of the peaks and valleys of the th segment in , respectively.
Based on the above five similarity indicators, we propose a fivedimension radar chart measurement method to evaluate the morphological similarity of the timeseries data abstraction. The morphological similarity between subsequences and is defined as , where each score range of each indicator is (0 1]. Therefore, as shown in Figure 6, the radar chart of is plotted as a pentagon, where the distance from the center to each vertex is equal to 1.
According to the radar chart, the value of is the area composed of the five indicators, as calculated in Equation (14):
(14)  
It is easy to obtain each sidelength of the pentagon is approximately equal to 1.18, and the area of the pentagon is . Hence, the value of is within the range of (). This novel similarity measure method addresses the problem of inaccurate distance measurement due to the different data shifts and time lengths.
Based on the morphological similarity measurement in Equation (14), we update the estimated expression of each Fourier spectrum coefficient. Assuming that is the length of and is the growth step of the comparison subsequences and , the estimated expression of each Fourier spectrum coefficient is calculated by Equation (15):
(15) 
We calculate the quadratic sum of fitting residual sequences for each subsequence pair in and get the results , , , , where is the number of spectrums. Finally, the optimal period length of the periodic item is the fundamental frequency corresponding to .
(2) Periodic pattern recognition.
Different from the traditional periodic pattern recognition algorithms, a new method of periodic pattern recognition based on the timeseries data abstraction is proposed in this section. The similarity of the timeseries data is calculated by subsequences with the same number of inflection points. Afterwards, the subsequence with the most similarity is found out as a period of timeseries data.
Set as the growth step of the comparison subsequences, namely, there are inflection points increasingly incorporated into the comparison subsequences each time. Let as an example, that is, 2 inflection points are incorporated into the comparison subsequences each time. Set as the first time subsequence and as a comparison subsequence. The two subsequences are compared with the morphological similarity measure, which is defined as . The detail of calculation method has been explained in the previous section. We continue incorporate the subsequent inflection points into the , namely . And then, the same number of inflection points in the data abstraction are collected to compose the comparison subsequence , namely .
In addition, the number of inflection points in the comparison subsequences that might exist periodic patterns may be slightly different due to the inflection points marking and the pseudo inflection points deletion operations. Therefore, we introduce a scaling ratio factor () to control the number of inflection points of the latter comparison subsequence . In this way, the comparison subsequences are optimized from fixedlength rigid sequences to variablelength flexible sequences. The length of is within the range of the left and right extension of the length of the previous comparison subsequence . Let be the number of inflection points of subsequence and be the number of inflection points of subsequences , the scaling ratio factor is calculated in Equation (16):
(16) 
For example, assuming that and , then the value of is in the range of (). In other words, for subsequence with 5 inflection points, inflection points closely followed are taken as the corresponding candidate subsequences , respectively. Namely, for subsequence with 10 inflection points (), there are 5 different candidate comparison subsequences with different numbers of inflection points constructed for similarity measure. The candidate comparison subsequences are listed as follows: ; ; ; ; . And each is introduced to calculate the similarity with respectively. Finally, the candidate subsequence with the maximum similarity value is obtained as the comparison subsequence , and the corresponding number of inflection points is taken as the length of . Thus, this pair of comparison subsequences and are termed as , namely , where with 5 inflection points is the comparison subsequence with the highest similarity. Thus, the firstlayer period of timeseries dataset is recognized using the optimal period length , as described as:
(17) 
where the length of each period is (). An example of the firstlayer period of timeseries dataset is shown in Figure 7. The detailed steps of the FSAbased time series periodic pattern recognition algorithm are presented in Algorithm 2.
In Algorithm 2, is the highest item of the Fourier spectrums and is the length of . The length of comparison subsequence pairs is increased by the step size of . Assuming that the time complexity of each morphological similarity measurement process is and the time complexity of the firstlayer period recognition is . Hence, the computational complexity of Algorithm 2 is .
3.2.3 Multilayer Periodic Pattern Recognition
Considering that there exists potential multilayer periodicity in given timeseries datasets, we propose a multilayer periodic pattern model to adaptively recognize the multilayer time periods. After obtaining the firstlayer periodic pattern, the timeseries dataset is recognized into multiple periods. The contents of each period in the firstlayer periodic pattern are further abstracted and represented by the Gaussian Blur function. Let be the dataset of the th period in the firstlayer periodic model , where is the period length of . We calculate the weight of each data point in using the Gaussian Blur function, as defined in Equation (18):
(18)  
where is the variance of all data points in . Based on , we obtain the new value . In this way, the dataset is updated as .
For the updated dataset , we apply the big data compression and abstraction method on to further reduce the volume of each period and extract the key information. Then, the FSAbased periodic pattern recognition algorithm is used on the compressed firstlayer dataset to obtain the secondlayer periodic patterns. Repeat these steps until there is no significant periodic pattern can be recognized. Thus, the multilayer periodic model of the timeseries dataset is built, as defined as:
,
where is the number of period layers for the timeseries dataset. An example of the multilayer periodic model of a given timeseries dataset is shown in Figure 8.
3.3 Periodicitybased Time Series Prediction
Based on the multilayer periodic model described in Section 3.2, we propose a Periodicitybased Time Series Prediction (PTSP) algorithm in this section. Different from the traditional time series prediction methods, in PTSP, the forecasting unit of upcoming data is one complete period rather than one timestamp. According to the identified periodic models, the forecasting object of each prediction behavior is the contents of the next complete period, instead of the data point in the next timestamp. The previous periodic models in different layers involve different contributions to the contents of the coming period. The periodicitybased time series prediction method is shown in Figure 9.
(1) Prediction based on periodic model.
For each previous periodic model, its impact on the contents of the coming period is measured by a weight value, which is calculated using the time attenuation factor. Given a multilayer periodic model for the timeseries dataset, there are multiple period models in each layer, where is the number of period layers and is the number of periods in the th layer. Assuming that is the current time period and is the next time period that will be predicted. The contents of are predicted based on all of the periodic models in each layer in the identified multilayer periodic model. To evaluate the impact of each previous model on the contents of , a time attenuation factor is introduced to calculate the weight value of each periodic model in each layer, respectively. For example, for each period in the first layer, the weight of for is defined in Equation (19):
(19) 
Based on the weights of periodic models in the first layer , we can calculate the prediction component from , as defined in Equation (20):
(20) 
We continue to calculate the weight of periods in each layer for . Assuming that there are periods in the th layer model , namely, the current period in the th layer is , which corresponds to the current period . For each period in the th layer, we use the time attenuation factor to measure the weight of for , as calculated in Equation (21):
(21) 
Based on each prediction component calculated from each layer , we get the predicted contents of , as defined in Equation (22):
(22) 
An example of the periodicitybased time series prediction process is illustrated in Figure 10.
(2) Calculation of the inflection points.
Due to the big data compression and abstraction, each periodic model in the identified multilayer periodic model is built based on the inflection points rather than the raw timeseries datasets. In this way, the predicted contents of the next time period are inflection points with the corresponding time points, rather than the data values at all time points. Therefore, we should calculate the values of all inflection points in and further fit the data values at all time points.
Considering that different periods in each layer contain different numbers of inflection points located at different time points, we need to map them to the corresponding positions on the time axis of the predicting period to form new predicting inflection points. The set of predicting inflection points in is defined as , there are multiple prediction components from all periods to form the values of . Assuming that there is a set of inflection points in the period in , we calculate the prediction component of for , as defined in Equation (23):
(23)  
According to Equation (23), the set of predicting inflection points in is integrated based on the prediction components from all previous periodic models.
(3) Fit data values at all time points in the predicting period .
Based on the predicted inflection points , the data values at all time points among these inflection points are fitted. For each two adjacent inflection points and in , the fitting data value at each time point in the range of is calculated in Equation (24):
Comments
There are no comments yet.