Time series analysis is the study of data that are collected in time order. Commonly, a time series contains a sequence of data that is taken at fixed sampling time. Nowadays, the applications of time-series data are proliferating. For examples, self-driving cars collect data about the environment evolving around them in a continuous manner, and trading algorithms monitor the changing markets to create accurate transaction decisions. According to , time-series databases (TSDBs) have emerged as the fastest growing type of databases for the last 12 months, as can be seen in Figure 1.
In general, time series can be categorized into two types: stationary and non-stationary. Roughly speaking, a time series is considered as stationary if its statistical properties remain the same every time. Formally, given a sequence and a sequence in a time series, if the joint statistical distribution of the first sequence is identical as that of the second sequence for all , then the time series is strictly stationary 
. This means that the moments, e.g., expectations, variance, third-order, and higher, are identical at all times. This definition is extremely strict for real-world applications. Therefore, a weaker definition, namely second-order or weak stationarity, is usually used to analyze time-series for practical applications. Second-order stationary time-series is a time series that has constant mean and variance over time. From this point, a second-order stationary time series is considered as a stationary time series.
The stationarity assumption is especially appealing in time series analysis due to the widely available models, prediction methods, and well-established theory. However, applying this assumption to real-world data, which mostly are non-stationary, might lead to inappropriate forecasts. One of the solutions to handle non-stationarity is to consider non-stationary time series as a collection of piece-wise, or locally, stationary time-series. This means the parameters of the time series are changing but remain constant for a specific period of time. In relation to prediction or forecasting problems, using the piece-wise stationarity assumption, we can employ stationary methods for the prediction and update the predictor when the time-series move to a different stationary state. Therefore it is imperative to continuously check whether the time series is stationary or not.
A vast array of tools for detecting non-stationarity has been introduced by researchers. Most of the detection mechanisms are based on spectral densities [3, 4], covariance structures comparisons [5, 6], and, more recently, using locally stationary wavelet model [7, 8]. These tests are developed based on a specific choice of segments of the data, which is sometimes delicate and highly subjective. In 
, the test is developed based on the discrete Fourier transform using the entire length of the data, which is undesirable in online settings.
In this work, we are interested in developing a non-stationarity detection method that can be used in real-time and combined with powerful predictors such as state-of-the-art machine learning techniques. In the machine learning community, researchers are more interested in Concept Drift detection since most of them are dealing with classification problems [10, 11, 12]. However, in regression problems such as time-series predictions, it is more suitable to consider non-stationarity. Non-stationarity is a more general concept in a sense that time-series without concept drift might contain non-stationarity, e.g., a near-unit-root auto-regressive process. Although the concept is not drifting, i.e., the parameters of the model are static, the time series evolution might contain changing mean and variance. Here, we treat non-stationarity as concept drift and can be used interchangeably.
Generally, there are two ways to detect concept drift: passive and active methods . In the passive method, predictors adaptation is done regularly, independent of the occurrence of non-stationarity [14, 15, 16]. These methods are quite effective for prediction with gradual changes in the data. However, the main issues are the resource consumption and the potential of overfitting.
On the other hand, the active detection methods monitor the data to detect changes and perform adaptation only when it is needed. These can be done either by monitoring the error of the predictor  or monitoring the features of the data [18, 19, 20, 21]. The main issue of the first approach is that the error might not reflect the non-stationarity of the data and it heavily relies on the prediction accuracy of the predictor, which can be misleading if a poor training process is used to build the predictor. In this work, we are interested in developing an active detection mechanism based on the features of the data.
We propose an online non-stationary detection method based on monitoring the evolution of the spectral contents of a time series. Our main hypothesis is that frequency domain features contain more information than time domain ones. Furthermore, we specifically develop the method to work in combination with state-of-the-art machine learning methods such as Deep Neural Networks (DNN). By combining the power of frequency domain features and the known generalization capability and scalability of DNN in handling real-world data, we hope to achieve high prediction performances.
However, it is known that connectionist models are subjected to a serious problem known as Catastrophic Forgetting, i.e., forgetting the previously learned data when learning new data . Researchers have been trying to combat this problem by using ensemble learning methods [23, 24]25], and focusing on the regularization aspects of the models [26, 27]. These methods are mainly tested on classification problems. In regression problems, more specifically real-world time series problem, it is highly possible that the patterns in the past might not appear again in the future, such as the IBM stock closing price prediction problem, and the future data is highly affected by the past data close to the future only. Therefore, we propose an approach to include some previous data in the past that, which size is variable with respect to the degree of non-stationarity.
Our contribution is summarized as follows:
We develop an algorithm to detect non-stationarity based on the evolution of the spectral contents of the time series.
We develop an online learning framework that combines the detection method with online machine learning methods efficiently.
We propose an algorithm to proportionally include some data in the past to handle Catastrophic Forgetting.
We present comprehensive experiments to validate our hypothesis and show the effectiveness of our proposed approach. We performed rigorous experiments to test different distance functions to monitor the evaluation of the spectral contents. We are interested in comparing the frequency domain feature extraction performances with the time-domain feature extraction ones. Finally, we show the superiority of the DNN over several machine learning methods.
The rest of the paper is organized as follows. Section II describes the main approach developed in this work, namely SAFE. Section III explains the mechanism for embedding predictors with SAFE. Section IV elaborates the datasets, experimental settings, and performance metrics used to validate our hypothesis and show the effectiveness of our proposed framework. Section V presents the experimental results and discussions. Finally, section VI concludes the paper and provides directions for further research.
Ii The SAFE Approach
In this section, the proposed SAFE approach is presented. SAFE is a technique for explicitly detecting non-stationarity in time series. This technique monitors the evolution of the local spectral energy of a time series and computes the discrepancy between the energy at present and previous time instances. The discrepancy then is used to test whether non-stationarity presents or not.
SAFE consists of two main modules: feature extraction, and non-stationarity detection modules. In the first module, Short-Time Fourier Transform (STFT) is applied to extract frequency contents of the time series at each instant time. The results of STFT are frequency values in a complex form. Therefore, the spectral energy of each frequency is computed to simplify our calculations.
The second module uses Simple Moving Average (SMA) and Exponentially Weighted Moving Average (EWMA) 
methods to estimate the long-term and immediate responses of the evolution of the spectral energy through a distance function. In other words, the difference of the spectral energy at every instant time is considered rather than the spectral energy itself.
In an online learning setting, an incoming observation together with its past values are concatenated to form a window in which the STFT is performed. The result of the transformation then compared with the previous window using a distance function. This way, changes can be detected as soon as a new observation arrives.
Ii-a Frequency-Domain Feature Extraction
In the literature, several time-domain statistical features are used to characterize a time series [18, 19]. In this work, a frequency domain approach is presented. There are two main hypotheses in this work:
In stationary conditions, the extracted features are expected to be stationary, or at least fluctuating around a stationary value. Therefore, whenever this is not the case, it can be deduced that a non-stationarity presents.
More information can be gained in the frequency domain than that can be gained in its counterpart. Therefore, it is expected that the non-stationary detection is more accurate, in terms of true positive and detection delay.
In the previous section, it is assumed that non-stationary time-series can be split into chunks of stationary parts. Therefore, a sliding window with a sufficient width can be applied to obtain local stationarity status of a signal. Therefore, it is intuitively suitable to apply STFT to extract the frequency contents of the signal. The discrete STFT can be expressed as
where the is the time series of interest, , and represent the indicators of time and frequency, respectively; while is the length of the window function . In this work, a Hamming window function is used. The choice of the sliding window width determines the time-frequency resolution. A wide window provides better frequency and blurred time resolutions while a narrow window works in an inverse way. Once the complex values of each frequency of interest are computed, their spectral energies are then computed. Figure 2 illustrates the process of STFT for every sliding window. Take STFT as an example. When a new observation arrives, the STFT is computed using this values and its several values in the past. It is expected that STFT and will be similar if the series is stationary and diverged if it is not the case.
To show the effectiveness of the frequency-domain feature extraction on capturing the non-stationarity, a small experiment is conducted and the results are shown in Figure 3. In this experiment, four modes of non-stationarity are injected into the time series:
The first point, which is denoted by at , illustrates a gradual non-stationarity in term of variance.
The second point, which is denoted by at , shows an abrupt non-stationarity in term of the mean of the series. After this point, the mean is constant, which introduces a bias in the series.
The third point, which is denoted by at , depicts an abrupt non-stationarity in term of the mean of the series. However, the mean keeps increasing after this point. At this interval, the non-stationarity is in continuous mode.
In the last point, the series goes back to the original stationary form.
Indeed, there are many modes of non-stationarity that are not included in the experiment. However, these modes are sufficient to illustrate the necessary behavior of non-stationarity for time series prediction. The figure also shows that the spectral energies represent the behavior of the process, which in this case is concentrated in the lower frequency bin. The energy behavior after point (2) looks similar to the one before point(1) although they have different means. This should not be considered as a problem since the important part is the changes at the point of interest, which will be reflected when the distance between points is calculated. It should also be noted that the last point, when the system goes back to the original form, is important to consider since in connectionist modeling the predictor tends to forget the past when a new concept is learned. This creates a problem called Catastrophic Forgetting. By keep monitoring the changes, the predictor can be trained to learn the previous concept when necessary.
Ii-B Non-stationarity Detection Module
The next step after extracting features is to detect whether a non-stationarity occurs or not using the non-stationarity detection module. There are two sub-modules in this module: the distance module, which computes the similarity between to consecutive extracted features, and the non-stationarity test module, which decides whether the observed distances translate to non-stationarity or not. Furthermore, to find the most compatible distance function that can capture the non-stationarity better, three distance functions are tested, namely the Euclidean, absolute Cosine, and absolute Pearson distances. The absolute term is needed since we are interested only in the magnitude of similarity, and not in the direction.
The subsequent step is detecting non-stationarity based on the computed distances, which is developed based on SMA and EWMA. EWMA method estimates the moving mean of a random sequence by giving more importance to recent data. For a set of values of a variable given as , which has mean , the EWMA estimates the following
where . The parameter
weighs the importance of the recent data compared to the older ones. This is suitable to the proposed feature extraction method, where a new observation is appended to the sliding window and, in the same time, should provide immediate insight that a non-stationarity occurs. In addition, this capability is especially important for detecting gradual non-stationarity. Furthermore, the standard deviation ofis given as
The output of EWMA represents the immediate estimated mean of the distance while SMA represents the long-term evolution of the mean. Based on this output, a decision about the stationarity has to be made. To do this, an algorithm called SAFE is proposed. This algorithm is illustrated in Algorithm 1. The input of the algorithm is a sequence of data or time series, and initialization is required for some parameters such as for the weight of EWMA; for the warning counter; for the trigger detection threshold; for the warning detection threshold; and for the warning duration.
The first step in the algorithm after initialization is to append new incoming data to the window of previous data to form , as depicted in line 2 of the algorithm. Next, STFT is applied to , which results in . Subsequently, the distance of with its previous is computed. In the initial stage of the algorithm, and are not available. It is safe to assume that and since it is not going to significantly affect the rest of the computation.
The next step is to apply both EWMA and SMA to , which results in and . The is necessary since it represents the long-term states of , in particular when a non-stationarity is continuously occurring, as depicted Figure 3 point 3 to 4. Furthermore, the standard deviation of is calculated using Equation 3. This standard deviation is used together with the control limits and as moving thresholds to determine whether is still inside a particular stationary mode or not. This idea is illustrated in Figure 4. Line 8 and 11 of Algorithm 1 impose the moving thresholds to . If at an instant time is greater than , then the non-stationary flag is raised. However, when is greater than , the algorithm waits until certain duration to raise the flag. This is also useful when
is fluctuating insignificantly, which might be due to outliers and/or other unpredictable factors in the data. The flag can be used by predictors to update their parameters when necessary, which is more efficient compared to the blind adaptation scheme.
Iii Embedding SAFE to Online Predictors
An online predictor is applied only when a non-stationarity is detected. Initially, a predictor is trained using a presumed stationary dataset in an off-line manner. The parameters obtained from this training are used as initial conditions. In a simulation case, we can select a period of data where the stationary properties hold; while in a real-world case, an initial predictor is trained using all data available at hand.
Once the non-stationarity flag is raised, the next step is to update the parameters of the chosen predictor. The predictor should support online learning since some predictors require training from scratch when new data are available. Some notable online learning algorithms are online passive-aggressive 29], and deep neural networks. These learning algorithms are suitable to combine with SAFE. Furthermore, mini-batch SGD also will be more suitable for SAFE since we can include previous data points to form a mini batch and update the predictors accordingly.
It is known that updating these predictors leads to catastrophic forgetting. To avoid this problem we include previous data so that the model also learn the new data using a portion of data from the past. The question is, how many data points should we include in the online adaptation? To answer this question, we introduce the proportional deviation algorithm.
The main idea of this algorithm is to include several numbers of data points proportional to the absolute deviation of from . Large deviation means the new data is far from the previous stationary point. Therefore, it is intuitively suitable to include more data when the deviation is large and vice versa. The size of the mini batch is computed as follows
where is the number of data points included in the past or the size of the mini batch, and is the proportional gain to scale the mini batch. The choice of depends on the applications and affects the speed of the adaptation. The round operation rounds the calculation to the nearest integer. This operation is necessary since the size of the mini batch has to be an integer. Algorithm 2 is introduced to illustrate this procedure.
Iv Experimental Settings and Datasets
In this section, we describe the datasets and experimental settings that are used to test our hypothesis and to illustrate the effectiveness of the proposed approach. The datasets consist of two types: artificial and real-world data.
Artificial data allow us to perform an effective analysis because the exact locations of non-stationarity are known, which are not the case in real-world data. However, unless we are able to capture all possible non-stationary conditions, the proposed approaches that are tested using artificial data, might not work well in practice. Therefore, we use real-world data to test the effectiveness our approach. Since the exact locations and behavior of the non-stationarity is not known, we test the real-world data in the prediction stages only. The goal of the test is to see whether the proposed approach is able to correct the prediction that is harmed by non-stationarity.
We propose several experimental settings to test each element in our approach. The settings are defined as follows:
Experiment to find the most suitable distance function for our approach. We investigate three types of distance functions: Euclidean, Pearson, and Cosine distances. This is done using artificial data.
Experiment to test whether extracting features in frequency domain leads to better non-stationarity detection results than in time domain. This is done using artificial data.
Experiment to investigate which predictor performs well in our approach. This is done using linear and nonlinear artificial data.
Experiment to test which domain of feature extraction is better for prediction using real-world data.
At the end of this section, we introduce performance metrics used in each experiment.
Here, we describe the datasets used in the experiments.
Iv-A1 Artificial Datasets
We used two sets of artificial data for experimental settings 1, 2, and 3. The first set contains five time-series data sets that illustrate some behavior of stationary and non-stationary time series. These data sets are inspired, with modifications, by  and , and are given by the following processes
(TS-A) Stationary AR(1) process. This is a stationary process with , and a range of values of .
(TS-B) Piece-wise stationary AR process with obvious changes. The changes in this process occur at two known locations .
(TS-C) Piece-wise stationary AR process with less obvious changes. The change in this process is less observable and occur at .
(TS-D) Piecewise stationary near-unit-root process with changing variance. This process has changes in variance occur at two points , where .
(TS-E) Piecewise stationary ARMA(1, 1) process. This process has three points of changes .
The second set of artificial data is inspired by . This set consists of two linear time series and two non-linear time series; both linear and non-linear time series are known to have parameter changes in some specified points. The linear datasets have similar structure, which is given by AR(p) process. Time series linear-1 and linear-2 are respectively given by AR(4) and AR(5) processes. AR(p) process is given as follows:
The nonlinear time series are constructed using the following processes:
where Equation IV-A1 and Equation 12 represent nonlinear-1 and nonlinear-2 time series, respectively. Tabel I presents the parameters used in all the linear and nonlinear time series. Column Point provides points where the parameters are implemented.
Iv-A2 Real-World Datasets
We use two real-world datasets to test our proposed approach. The first one is IBM stocks closing price, and the second one is traffic flow of freeways in California.
The IBM dataset was gathered from Yahoo Finance historical data. We collected the stock daily closing price from January 8th 1962 to September 5th 2017, which is around 14000 data points were collected.
The traffic flow dataset was obtained from the Caltrans Performance Measurements Systems (PeMS) . The traffic flow was sampled every 30 seconds using inductive-loop deployed throughout the freeways. These data were aggregated into 5-min duration by PeMS. Furthermore, the traffic data are aggregated further into 15-min duration based on the recommendation of Highway Capacity Manual 2010 . We collected the traffic flow data of a freeway from January 1st 2011 to December 31st 2012.
Iv-B Experimental Settings
We apply the datasets to four different experimental settings. In the first setting, we use TS-B to TS-C to find which distance function gives us the best performance on clearly and not so clearly observable changes. In this part, we also illustrate the evolution of each distance function in relation to the time series. Each dataset is tested over 100 trials to account for the randomness introduced by the noise. To gain insight on the performance, we measure the total number of detection, false alarm, hit rate, missed detection, specificity, detection delay, and execution time.
In the second setting, we validate our hypothesis that by extracting the features in frequency domain we can get more information that in the time domain. We use TS-A to measure the false alarm and TS-B to TS-E to measure the total number of detection, false alarm, hit rate, missed detection, specificity, detection delay, and execution time. In the experiment, each dataset is tested over 100 trials.
For the third setting, we use linear (Linear-1, Linear-2) and nonlinear (Nonlinear-1, Nonlinear-2) time series with changes to test the prediction performance when SAFE is embedded into three different predictors namely Passive-Aggressive Regressors, Kernel Linear-SVM, and Deep Neural Networks. In the Kernel Linear-SVM, the original features are mapped into a randomized low-dimensional feature space, which is then trained using linear SVM . In this experiment, we measure the mean-squared error (MSE), execution time, and percentage update required. We run 20 simulations to account for the randomness.
Finally, we use real-world datasets, namely IBM and Traffic Flow, to measure the performance of deep neural networks which are combined with time-domain feature extraction and SAFE. In this experiment, we observe the MSE, execution time, and percentage update required. In addition, we illustrate all the experimental settings performance with graphs showing their responses over time.
Iv-C Evaluation metrics
In this section, we describe the performance metrics used in the experiments.
Total number of detection. We report the number of points detected in a time-series. The change point is considered true if it within of the sample size. As an example, the number of detection = 1 means there is only 1 change point is detected. This number is then cumulated over 100 trials. We define false positive as FP, true positive as TP, true negative as TN, and false negative as TN.
False Alarm Rate. It measures the number of change points detected that when there are no actual changes occur. Another name for this metric is False Positive Rate, which is given as .
Hit Rate. It measures the proportion of correctly detected changes over the actual number of change points, which is given as .
Missed Detection Rate. It is the number of undetected changes when there are actual change points over the actual number of change points, which is given as .
. It reflects the number of stationary points classified as stationary points over the actual number of stationary points. This is calculated as
Detection Delay. It measures the distance, in the number of steps, of the detected changes from the actual points.
Execution Time. It measures the average time, in seconds, required to perform an experiment over a specific number of trials.
Mean-Squared Error. This metric measures the similarity between predictions and actual time series. We used two types of MSE: the trajectory of MSE, which is defined as the evolution of the MSE at every instant time, and the overall MSE, which measures the total MSE of the whole test dataset.
Percentage of Update. The number of updates performed over the number of possible updates. As an example, we have a time series with 100 data points, and we start the online update procedure from . Let us assume our algorithm updates the predictors 15 times. It means the percentage of the update is 15%. If we perform blind adaptation scheme, the percentage of the update will be 100%.
V Results and Discussion
In this section, we discuss the experimental results to evaluate the performances of our proposed approach. The results are grouped according to the experimental settings mentioned in the previous section.
V-a Distance Functions
There are various ways to compute a distance between two vectors; the popular ones are Euclidean, Pearson, and Cosine distances. To decide which distance function gives better performance on non-stationary detection, we conducted experiments using TS-B and TS-C datasets. These datasets illustrate the obvious and non-obvious changes that may occur in a time series.
First, it is important to show that these three distance functions can capture changes occur in a time series. Therefore, we present a simulation result showing the responses of the distances to the time series. In this simulation, we applied the SAFE approach to the TS-C dataset with a sliding window of 5 samples. We chose the TS-C dataset because this dataset contains non-obvious changes. Intuitively, when more obvious changes occur, the response of the distance will be more distinguishable. Furthermore, we calculated the distances using the three distance functions mentioned in the previous paragraph. The result of this experiment is shown in Figure 5. This figure illustrates that when a non-obvious change occurs at a breakpoint , the three distances respond to the change immediately. It should be noted that the responses of these three distances are different, especially when using the Euclidean distance. Although Pearson and Cosine distances yield similar responses, they are still distinguishable. The response of the Pearson distance is a little bit smoother than that of the Cosine distance.
Next, we investigate the detection performances of the SAFE when it is combined with three distance functions. The size of the sliding window is kept at 5 samples. Other parameters that have to be set is and . It is suggested in  that . Since the detection performances highly depend on the thresholds
, we have to set these thresholds so that three distances have equal performance in one of the evaluation metrics. To achieve this, we set the thresholds such that the three distances have similar false alarm rate on TS-B, which is. From here, we can judge the other performance metrics given the false alarm rate.
To achieve the targeted false alarm rate, we performed several trials. We tried several values of , and we found that works best for all the three distances. The warning thresholds for Euclidean, Pearson, and Cosine distances are 2.85, 0.75 and 1.4, respectively, and the trigger thresholds are 3.35, 1,25 and 1.9, respectively. The size of the sliding window for the SMA is 20; this value is able to capture the long-term evolution of the distances. It is expected that these parameters work for different types of non-stationarity. Therefore, these parameters are fixed for all experiments using TS-A to TS-E.
Table II shows the detection performances of the three distances over 100 simulations. We can see that in TS-B and TS-C datasets, Euclidean distance provides superior results in terms of the number of detection, hit rate, missed detection, and execution time. The highest average detection delay on TS-B is produced by Euclidean distance, but this is not the case on TS-C. Overall, Euclidean distance produces better performances than other distances do. Therefore, we chose Euclidean distance as the distance function for the rest of the experiments.
|Det. delay (samples)|
|Exec. time (s)|
V-B SAFE Vs Time-domain Feature Extraction
In the second setting, we compare the performances of SAFE with those of time-domain feature extraction (FE). The time-domain FE is inspired by 
. We extracted 4 linear (auto-correlation, variance, skewness coefficient, kurtosis coefficient) and 1 nonlinear (bi-correlation) time-domain features. The detection method after computing the distance is similar to one used by SAFE in Algorithm1.
The parameters of the SAFE approach is kept the same as the previous experimental setting, and the parameters of the time-domain FE are set to achieve around 0.05 false alarm rate on TS-B. To achieve this false alarm rate, the forgetting factor is set to 0.3 while and are set to 3 and 3.5, respectively. The sliding window size of the SMA is set to 20. Both SAFE and time-domain FE are experimented using TS-(A-E).
The first experiment on this setting is performed on TS-A. Since TS-A is a stationary time series, it is expected that low false alarm rates are found on both SAFE and time-domain FE. Table III shows that acceptable false alarm rates are found in both cases. It is evident that using different , SAFE produces lower false alarm rates than time-domain FE does. Indeed, the false alarm rate on stationary time series ideally should be 0. This can be achieved if we set both the thresholds to the higher values. However, setting higher thresholds may lead to poor detection performances. This is up to the designer to decide which performances are important in their applications.
The second set of the experiments is done to test the non-stationarity detection performances using TS-(B-E). The results are summarized in Table IV. This table shows that the overall performance of SAFE on the datasets a better in terms of number detection, false alarm rate, hit rate, missed detection, specificity, detection delay, and execution time. It should be noted that although in some aspects time-domain FE has almost similar performances than SAFE does, time-domain FE has significantly higher average time to execute the experiments. In conclusion, time-domain FE executes the experiments more than 3 times slower than SAFE does.
|SAFE||time-domain FE||SAFE||time-domain FE|
|Det. delay (samples)|
|Exec. time (s)|
|SAFE||time-domain FE||SAFE||time-domain FE|
|Det. delay (samples)|
|Exec. time (s)|
To illustrate the performance of SAFE in more details, we present several graphs showing the predicted change points versus the ground truth. Figure 6 shows the detection on TS-B. From the previous section, we know that TS-B has two change points. However, the figure shows that the proposed algorithm detects three change points, where the third detected point is considered as a false positive. This is acceptable since it can be seen, by inspection from the first row of the graph, that the time series actually looks non-stationary in term of the variance. The second row of the graphs shows the evolution of and the warning and trigger thresholds. In Figure 7, we can see that similar behavior, where few false alarms present, is also shown in the experiment on TS-C.
Furthermore, different behavior is observable in an experiment using TS-D. Figure 8 shows that the false alarm rate is somewhat higher than the other experiments using different time series. If we look at the behavior of the time series closely, it certainly depicts a non-stationary behavior. This behavior is due to the fact that TS-D is a piece-wise stationary near-unit-root process with changing variance. Near-unit-root process means the process dynamic is close to unstable behavior. Therefore, it is expected that the SAFE approach considers some parts of the time-series as non-stationary, especially in the last part of the series. It is logical to consider that the process contains continuous non-stationarity, rather than just a change in the breakpoint. This way, our chosen online predictor can be continuously updated to adapt to the non-stationary process.
V-C Choice of Predictors
The third setting of the experiments is employed to test which predictor gives better prediction performances. We use Linear-1, Linear-2, Nonlinear-1, and Nonlinear-2 datasets and test them with three different predictors namely Passive-Aggressive Regressors (PAR), Kernel Linear-SVR (KSVR), and Deep Neural Networks (DNN). Before we run the experiments, both parameters in Algorithm 1 and 2 have to be set. We set to be 0.3, 10, 20, and 0.1 respectively.
Furthermore, the parameters of the predictors also need initialization. For PAR, we set the aggressiveness parameter to 0.05. For KSVR, we used epsilon insensitive as the loss function,regularizer with a constant equals to
. Finally, for the DNN, we set the number of hidden layers to 2, where each hidden layer contains 200 neurons, and use Rectified Linear Unit (ReLU) as the activation function. To avoid over-fitting, we implemented a drop out regularizer with rate equals to 0.1. All of the parameters were tuned using a similar off-line training scheme. We trained the predictors using the first 2000 samples, and validated using the next 1000 samples. We did not implement K-fold cross-validation since, in time series prediction, it is not suitable to train using data that come after the validation data. Basically, we trained the predictors until the validation errors stop decreasing. The trained predictors are then used as initial models that will be updated when they are triggered by the SAFE algorithm.
Finally, the rest of the data are used to test the predictors. The experimental results of these experiments are summarized in Table V(a). The results show that the smallest prediction errors on all datasets are achieved using DNN as the predictor. However, the average execution time of this predictor is also shown to be the highest among all. In some applications, it might be crucial to have slow execution time. Accordingly, if we concern more about execution time than prediction errors, then DNN might not be the best choice as a predictor. However, in an application such as traffic flow prediction, where the sampling time is 15 minutes, DNN is suitable as a predictor since it produces significantly lower prediction errors than the rest of the predictors do.
Figure 9 depicts the prediction performances on Linear-1 dataset. The first graph shows that all of the predictors are able to follow the ground truth closely. However, if we look at the errors on the second graph, the DNN error is fairly lower than those of other predictors although the off-line error of the DNN is higher than that of PAR. The next step is to compare the performance of the DNN with the baseline predictor, which is the predictor that is not updated. It can be seen in Figure 10 that while the error of the DNN stays constant, the error of the baseline predictor is drifting to a larger value. We can conclude that updating the SAFE approach is reliable for non-stationary time series prediction.
Lastly, Figure 11 and 12 shows that similar performances are shown in the nonlinear datasets. Although the predictor errors start around similar values, the DNN performs better in the long run. In addition, all of these prediction performances are achieved by updating the predictors when necessary only. This is reflected by the percentages of the update for all datasets that are less than 35%.
V-D Real-world Data Experiments
The objective of the last set of experiments is to test the prediction performance of DNN on two real-world datasets under SAFE and time-domain FE. The first dataset is IBM dataset. We use the data from January 8th 1962 to January 7th 1977 for training; January 8th 1977 to January 7th 1982 for validation; and January 8th 1982 to September 5th 2017 for testing. The second dataset is traffic flow dataset of San Fransisco Bay Area, District 4, California. We use the data from January 1st 2011 to August 31st 2011 for training; September 1st 2011 to December 31st 2011 for validation; and January 1st 2012 to December 31st 2012 for testing.
The off-line predictors for both SAFE and time-domain FE in each dataset are identical. For the IBM dataset, the DNN consists of 3 hidden layers, where each layer has 100 neurons. To avoid over-fitting, we implemented drop-out regularizers with rate equals to 0.1. The training scheme of the DNN is similar to that of explained in the previous section. Furthermore, before the online predictor is applied, the parameters of the SAFE approach and time-domain FE have to be set. The warning and trigger thresholds for the SAFE approach are 0.025 and 0.5 respectively while the warning and trigger thresholds for the time-domain FE are 0.27 and 0.55, respectively. The proportional deviation gain for both the approaches is set to 0.1.
The results of the proposed approach on IBM dataset is summarized in Table V(b). In general, the results show that higher percentage update leads to significantly better MSE, but the execution time also increases significantly. The table also shows that SAFE produces errors around three times lower than time-domain FE does, and executes the experiments faster than time-domain FE does. However, both of the approaches are able to provide acceptable performances compare to the baseline predictor.
Figure 13 illustrates the performance comparisons between the baseline predictor, SAFE-DNN, and time-domain FE-DNN in terms of the time-series prediction and MSE. We present a case where the percentage update is around 2.7%. We can see from the figure that if the predictor does not adapt to the non-stationarity, the time-series prediction drifts away from the actual values. This is not the case on both the approaches. In term of the errors, our proposed approach maintains low error throughout the experiments. The competing approach, however, drifts a little bit at the end when the prediction becomes more difficult due to the highly non-stationarity in the data.
Next, we test our approach with traffic flow dataset. The baseline DNN is configured as follows: the number of hidden layers is 3; each hidden layer contains 125 neurons; and the drop-out rate equals to 0.5. The activation function of both the hidden layers and the output layer is ReLU since we know that the traffic flow values cannot be negative. We select 5 freeways in our experiments. The results are shown in Table V(b). The results illustrate similar behavior as the ones on IBM datasets. In general, the combination of SAFE and DNN produces superior results in terms of errors and execution time.The evolution of the errors is shown in Figure 14. It can be seen that the errors start at the same level. However, the baseline error drifts away while the online predictor’s error keeps decreasing, and the SAFE-DNN error is always the lowest compared to the other errors.
Lastly, Figure 15 shows the prediction of traffic flow time-series using baseline predictor, SAFE approach, and time-domain FE approach. We select a portion of the prediction to better illustrate the results. It can be seen that the baseline predictor does not produce acceptable traffic flow prediction while the SAFE-DNN and time-domain FE-DNN do. Although the predictions of both the approaches look similar, they are essentially different, especially in the valley parts of the traffic flow. It should be noted that this excellent prediction is obtained only by updating the predictor 5% throughout the experiments. This means we save around 95% of the processor or GPU cycles.
This paper presents an approach to actively detect non-stationarity for time series prediction. The approach monitors the evolution of the spectral contents of time series using a distance function. We have successfully conducted comprehensive experiments to validate our hypothesis and test the effectiveness of our proposed approach on artificial and real-world datasets.
The experiments show that the approach is able to achieve high long-term prediction performances while significantly saving computational resources in terms of processor and GPU cycles. Although DNN requires more computational time than other predictors do, it is clearly worth to consider as an online predictor since its overall prediction errors are notably lower than those of the other predictors. The implementation of the proportional algorithm to variably include some data in the past makes the online adaptation of the predictors more flexible, i.e., there is no need to fix the batch size of the online training procedure.
To go further with this research, we can expand the approach to work with multi-step time series predictions. Furthermore, since Long-Short Term Memory recurrent neural networks are powerful in handling sequential data, this type of neural networks is worth to investigate. Moreover, the proposed method can be used to work with large-scale time-series, where distributed neural networks, i.e., DNN with multitask learning, are appropriate.
-  DB-engines, DBMS popularity broken down by database model, 2017 (accessed October 8, 2017).
-  G. P. Nason, “Stationary and non-stationary time series,” Statistics in Volcanology. Special Publications of IAVCEI, vol. 1, pp. 000–000, 2006.
-  M. Priestley and T. S. Rao, “A test for non-stationarity of time-series,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 140–149, 1969.
-  S. Adak, “Time-dependent spectral analysis of nonstationary time series,” Journal of the American Statistical Association, vol. 93, no. 444, pp. 1488–1501, 1998.
-  E. Andreou and E. Ghysels, “Structural breaks in financial time series,” Handbook of financial time series, pp. 839–870, 2009.
-  I. Berkes, E. Gombay, and L. Horváth, “Testing for changes in the covariance structure of linear processes,” Journal of Statistical Planning and Inference, vol. 139, no. 6, pp. 2044–2063, 2009.
-  H. Cho and P. Fryzlewicz, “Multiscale and multilevel technique for consistent segmentation of nonstationary time series,” Statistica Sinica, pp. 207–229, 2012.
-  K. K. Korkas and P. Fryzlewicz, “Multiple change-point detection for non-stationary time series using wild binary segmentation,” Statistica Sinica, vol. 27, no. 1, pp. 287–311, 2017.
-  Y. Dwivedi and S. Subba Rao, “A test for second-order stationarity of a time series based on the discrete fourier transform,” Journal of Time Series Analysis, vol. 32, no. 1, pp. 68–91, 2011.
-  F. Fdez-Riverola, E. L. Iglesias, F. Díaz, J. R. Méndez, and J. M. Corchado, “Applying lazy learning algorithms to tackle concept drift in spam filtering,” Expert Systems with Applications, vol. 33, no. 1, pp. 36–48, 2007.
-  G. J. Ross, N. M. Adams, D. K. Tasoulis, and D. J. Hand, “Exponentially weighted moving average charts for detecting concept drift,” Pattern Recognition Letters, vol. 33, no. 2, pp. 191–198, 2012.
-  P. M. Gonçalves Jr and R. S. M. De Barros, “Rcd: A recurring concept drift framework,” Pattern Recognition Letters, vol. 34, no. 9, pp. 1018–1025, 2013.
-  G. Ditzler, M. Roveri, C. Alippi, and R. Polikar, “Learning in nonstationary environments: A survey,” IEEE Computational Intelligence Magazine, vol. 10, no. 4, pp. 12–25, 2015.
-  J. Z. Kolter and M. A. Maloof, “Dynamic weighted majority: An ensemble method for drifting concepts,” Journal of Machine Learning Research, vol. 8, no. Dec, pp. 2755–2790, 2007.
-  J. A. Guajardo, R. Weber, and J. Miranda, “A model updating strategy for predicting time series with seasonal patterns,” Applied Soft Computing, vol. 10, no. 1, pp. 276–283, 2010.
-  R. Elwell and R. Polikar, “Incremental learning of concept drift in nonstationary environments,” IEEE Transactions on Neural Networks, vol. 22, no. 10, pp. 1517–1531, 2011.
-  L. Moreira-Matias, J. Gama, and J. Mendes-Moreira, “Concept neurons–handling drift issues for real-time industrial data mining,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 96–111, Springer, 2016.
-  R. C. Cavalcante, L. L. Minku, and A. L. Oliveira, “Fedd: Feature extraction for explicit concept drift detection in time series,” in Neural Networks (IJCNN), 2016 International Joint Conference on, pp. 740–747, IEEE, 2016.
C. Alippi, G. Boracchi, and M. Roveri, “A just-in-time adaptive classification system based on the intersection of confidence intervals rule,”Neural Networks, vol. 24, no. 8, pp. 791–800, 2011.
-  S. Liu, M. Yamada, N. Collier, and M. Sugiyama, “Change-point detection in time-series data by relative density-ratio estimation,” Neural Networks, vol. 43, pp. 72–83, 2013.
-  C. Alippi, G. Boracchi, and M. Roveri, “Just-in-time classifiers for recurrent concepts,” IEEE transactions on neural networks and learning systems, vol. 24, no. 4, pp. 620–634, 2013.
-  R. M. French, “Catastrophic forgetting in connectionist networks,” Trends in cognitive sciences, vol. 3, no. 4, pp. 128–135, 1999.
-  A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,” arXiv preprint arXiv:1606.04671, 2016.
-  R. Polikar, L. Upda, S. S. Upda, and V. Honavar, “Learn++: An incremental learning algorithm for supervised neural networks,” IEEE transactions on systems, man, and cybernetics, part C (applications and reviews), vol. 31, no. 4, pp. 497–508, 2001.
-  C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra, “Pathnet: Evolution channels gradient descent in super neural networks,” arXiv preprint arXiv:1701.08734, 2017.
-  K. Hashimoto, C. Xiong, Y. Tsuruoka, and R. Socher, “A joint many-task model: Growing a neural network for multiple nlp tasks,” arXiv preprint arXiv:1611.01587, 2016.
-  J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the National Academy of Sciences, p. 201611835, 2017.
-  K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer, “Online passive-aggressive algorithms,” Journal of Machine Learning Research, vol. 7, no. Mar, pp. 551–585, 2006.
-  O. Bousquet and L. Bottou, “The tradeoffs of large scale learning,” in Advances in neural information processing systems, pp. 161–168, 2008.
-  C. D. of Transportation, “Caltrans Performance Measurement System.” http://pems.dot.ca.gov/, 2016. ”[Online; accessed June-2016]”.
-  H. C. Manual, “Volumes 1-4.(2010),” Transporation Research Board, 2010.
-  A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” in Advances in neural information processing systems, pp. 1177–1184, 2008.