I Introduction
Time series analysis is the study of data that are collected in time order. Commonly, a time series contains a sequence of data that is taken at fixed sampling time. Nowadays, the applications of timeseries data are proliferating. For examples, selfdriving cars collect data about the environment evolving around them in a continuous manner, and trading algorithms monitor the changing markets to create accurate transaction decisions. According to [1], timeseries databases (TSDBs) have emerged as the fastest growing type of databases for the last 12 months, as can be seen in Figure 1.
In general, time series can be categorized into two types: stationary and nonstationary. Roughly speaking, a time series is considered as stationary if its statistical properties remain the same every time. Formally, given a sequence and a sequence in a time series, if the joint statistical distribution of the first sequence is identical as that of the second sequence for all , then the time series is strictly stationary [2]
. This means that the moments, e.g., expectations, variance, thirdorder, and higher, are identical at all times. This definition is extremely strict for realworld applications. Therefore, a weaker definition, namely secondorder or weak stationarity, is usually used to analyze timeseries for practical applications. Secondorder stationary timeseries is a time series that has constant mean and variance over time. From this point, a secondorder stationary time series is considered as a stationary time series.
The stationarity assumption is especially appealing in time series analysis due to the widely available models, prediction methods, and wellestablished theory. However, applying this assumption to realworld data, which mostly are nonstationary, might lead to inappropriate forecasts. One of the solutions to handle nonstationarity is to consider nonstationary time series as a collection of piecewise, or locally, stationary timeseries. This means the parameters of the time series are changing but remain constant for a specific period of time. In relation to prediction or forecasting problems, using the piecewise stationarity assumption, we can employ stationary methods for the prediction and update the predictor when the timeseries move to a different stationary state. Therefore it is imperative to continuously check whether the time series is stationary or not.
A vast array of tools for detecting nonstationarity has been introduced by researchers. Most of the detection mechanisms are based on spectral densities [3, 4], covariance structures comparisons [5, 6], and, more recently, using locally stationary wavelet model [7, 8]. These tests are developed based on a specific choice of segments of the data, which is sometimes delicate and highly subjective. In [9]
, the test is developed based on the discrete Fourier transform using the entire length of the data, which is undesirable in online settings.
In this work, we are interested in developing a nonstationarity detection method that can be used in realtime and combined with powerful predictors such as stateoftheart machine learning techniques. In the machine learning community, researchers are more interested in Concept Drift detection since most of them are dealing with classification problems [10, 11, 12]. However, in regression problems such as timeseries predictions, it is more suitable to consider nonstationarity. Nonstationarity is a more general concept in a sense that timeseries without concept drift might contain nonstationarity, e.g., a nearunitroot autoregressive process. Although the concept is not drifting, i.e., the parameters of the model are static, the time series evolution might contain changing mean and variance. Here, we treat nonstationarity as concept drift and can be used interchangeably.
Generally, there are two ways to detect concept drift: passive and active methods [13]. In the passive method, predictors adaptation is done regularly, independent of the occurrence of nonstationarity [14, 15, 16]. These methods are quite effective for prediction with gradual changes in the data. However, the main issues are the resource consumption and the potential of overfitting.
On the other hand, the active detection methods monitor the data to detect changes and perform adaptation only when it is needed. These can be done either by monitoring the error of the predictor [17] or monitoring the features of the data [18, 19, 20, 21]. The main issue of the first approach is that the error might not reflect the nonstationarity of the data and it heavily relies on the prediction accuracy of the predictor, which can be misleading if a poor training process is used to build the predictor. In this work, we are interested in developing an active detection mechanism based on the features of the data.
We propose an online nonstationary detection method based on monitoring the evolution of the spectral contents of a time series. Our main hypothesis is that frequency domain features contain more information than time domain ones. Furthermore, we specifically develop the method to work in combination with stateoftheart machine learning methods such as Deep Neural Networks (DNN). By combining the power of frequency domain features and the known generalization capability and scalability of DNN in handling realworld data, we hope to achieve high prediction performances.
However, it is known that connectionist models are subjected to a serious problem known as Catastrophic Forgetting, i.e., forgetting the previously learned data when learning new data [22]. Researchers have been trying to combat this problem by using ensemble learning methods [23, 24]
[25], and focusing on the regularization aspects of the models [26, 27]. These methods are mainly tested on classification problems. In regression problems, more specifically realworld time series problem, it is highly possible that the patterns in the past might not appear again in the future, such as the IBM stock closing price prediction problem, and the future data is highly affected by the past data close to the future only. Therefore, we propose an approach to include some previous data in the past that, which size is variable with respect to the degree of nonstationarity.Our contribution is summarized as follows:

We develop an algorithm to detect nonstationarity based on the evolution of the spectral contents of the time series.

We develop an online learning framework that combines the detection method with online machine learning methods efficiently.

We propose an algorithm to proportionally include some data in the past to handle Catastrophic Forgetting.

We present comprehensive experiments to validate our hypothesis and show the effectiveness of our proposed approach. We performed rigorous experiments to test different distance functions to monitor the evaluation of the spectral contents. We are interested in comparing the frequency domain feature extraction performances with the timedomain feature extraction ones. Finally, we show the superiority of the DNN over several machine learning methods.
The rest of the paper is organized as follows. Section II describes the main approach developed in this work, namely SAFE. Section III explains the mechanism for embedding predictors with SAFE. Section IV elaborates the datasets, experimental settings, and performance metrics used to validate our hypothesis and show the effectiveness of our proposed framework. Section V presents the experimental results and discussions. Finally, section VI concludes the paper and provides directions for further research.
Ii The SAFE Approach
In this section, the proposed SAFE approach is presented. SAFE is a technique for explicitly detecting nonstationarity in time series. This technique monitors the evolution of the local spectral energy of a time series and computes the discrepancy between the energy at present and previous time instances. The discrepancy then is used to test whether nonstationarity presents or not.
SAFE consists of two main modules: feature extraction, and nonstationarity detection modules. In the first module, ShortTime Fourier Transform (STFT) is applied to extract frequency contents of the time series at each instant time. The results of STFT are frequency values in a complex form. Therefore, the spectral energy of each frequency is computed to simplify our calculations.
The second module uses Simple Moving Average (SMA) and Exponentially Weighted Moving Average (EWMA) [11]
methods to estimate the longterm and immediate responses of the evolution of the spectral energy through a distance function. In other words, the difference of the spectral energy at every instant time is considered rather than the spectral energy itself.
In an online learning setting, an incoming observation together with its past values are concatenated to form a window in which the STFT is performed. The result of the transformation then compared with the previous window using a distance function. This way, changes can be detected as soon as a new observation arrives.
Iia FrequencyDomain Feature Extraction
In the literature, several timedomain statistical features are used to characterize a time series [18, 19]. In this work, a frequency domain approach is presented. There are two main hypotheses in this work:

In stationary conditions, the extracted features are expected to be stationary, or at least fluctuating around a stationary value. Therefore, whenever this is not the case, it can be deduced that a nonstationarity presents.

More information can be gained in the frequency domain than that can be gained in its counterpart. Therefore, it is expected that the nonstationary detection is more accurate, in terms of true positive and detection delay.
In the previous section, it is assumed that nonstationary timeseries can be split into chunks of stationary parts. Therefore, a sliding window with a sufficient width can be applied to obtain local stationarity status of a signal. Therefore, it is intuitively suitable to apply STFT to extract the frequency contents of the signal. The discrete STFT can be expressed as
(1) 
where the is the time series of interest, , and represent the indicators of time and frequency, respectively; while is the length of the window function . In this work, a Hamming window function is used. The choice of the sliding window width determines the timefrequency resolution. A wide window provides better frequency and blurred time resolutions while a narrow window works in an inverse way. Once the complex values of each frequency of interest are computed, their spectral energies are then computed. Figure 2 illustrates the process of STFT for every sliding window. Take STFT as an example. When a new observation arrives, the STFT is computed using this values and its several values in the past. It is expected that STFT and will be similar if the series is stationary and diverged if it is not the case.
To show the effectiveness of the frequencydomain feature extraction on capturing the nonstationarity, a small experiment is conducted and the results are shown in Figure 3. In this experiment, four modes of nonstationarity are injected into the time series:

The first point, which is denoted by at , illustrates a gradual nonstationarity in term of variance.

The second point, which is denoted by at , shows an abrupt nonstationarity in term of the mean of the series. After this point, the mean is constant, which introduces a bias in the series.

The third point, which is denoted by at , depicts an abrupt nonstationarity in term of the mean of the series. However, the mean keeps increasing after this point. At this interval, the nonstationarity is in continuous mode.

In the last point, the series goes back to the original stationary form.
Indeed, there are many modes of nonstationarity that are not included in the experiment. However, these modes are sufficient to illustrate the necessary behavior of nonstationarity for time series prediction. The figure also shows that the spectral energies represent the behavior of the process, which in this case is concentrated in the lower frequency bin. The energy behavior after point (2) looks similar to the one before point(1) although they have different means. This should not be considered as a problem since the important part is the changes at the point of interest, which will be reflected when the distance between points is calculated. It should also be noted that the last point, when the system goes back to the original form, is important to consider since in connectionist modeling the predictor tends to forget the past when a new concept is learned. This creates a problem called Catastrophic Forgetting. By keep monitoring the changes, the predictor can be trained to learn the previous concept when necessary.
IiB Nonstationarity Detection Module
The next step after extracting features is to detect whether a nonstationarity occurs or not using the nonstationarity detection module. There are two submodules in this module: the distance module, which computes the similarity between to consecutive extracted features, and the nonstationarity test module, which decides whether the observed distances translate to nonstationarity or not. Furthermore, to find the most compatible distance function that can capture the nonstationarity better, three distance functions are tested, namely the Euclidean, absolute Cosine, and absolute Pearson distances. The absolute term is needed since we are interested only in the magnitude of similarity, and not in the direction.
The subsequent step is detecting nonstationarity based on the computed distances, which is developed based on SMA and EWMA. EWMA method estimates the moving mean of a random sequence by giving more importance to recent data. For a set of values of a variable given as , which has mean , the EWMA estimates the following
(2) 
where . The parameter
weighs the importance of the recent data compared to the older ones. This is suitable to the proposed feature extraction method, where a new observation is appended to the sliding window and, in the same time, should provide immediate insight that a nonstationarity occurs. In addition, this capability is especially important for detecting gradual nonstationarity. Furthermore, the standard deviation of
is given as(3) 
The output of EWMA represents the immediate estimated mean of the distance while SMA represents the longterm evolution of the mean. Based on this output, a decision about the stationarity has to be made. To do this, an algorithm called SAFE is proposed. This algorithm is illustrated in Algorithm 1. The input of the algorithm is a sequence of data or time series, and initialization is required for some parameters such as for the weight of EWMA; for the warning counter; for the trigger detection threshold; for the warning detection threshold; and for the warning duration.
The first step in the algorithm after initialization is to append new incoming data to the window of previous data to form , as depicted in line 2 of the algorithm. Next, STFT is applied to , which results in . Subsequently, the distance of with its previous is computed. In the initial stage of the algorithm, and are not available. It is safe to assume that and since it is not going to significantly affect the rest of the computation.
The next step is to apply both EWMA and SMA to , which results in and . The is necessary since it represents the longterm states of , in particular when a nonstationarity is continuously occurring, as depicted Figure 3 point 3 to 4. Furthermore, the standard deviation of is calculated using Equation 3. This standard deviation is used together with the control limits and as moving thresholds to determine whether is still inside a particular stationary mode or not. This idea is illustrated in Figure 4. Line 8 and 11 of Algorithm 1 impose the moving thresholds to . If at an instant time is greater than , then the nonstationary flag is raised. However, when is greater than , the algorithm waits until certain duration to raise the flag. This is also useful when
is fluctuating insignificantly, which might be due to outliers and/or other unpredictable factors in the data. The flag can be used by predictors to update their parameters when necessary, which is more efficient compared to the blind adaptation scheme.
Iii Embedding SAFE to Online Predictors
An online predictor is applied only when a nonstationarity is detected. Initially, a predictor is trained using a presumed stationary dataset in an offline manner. The parameters obtained from this training are used as initial conditions. In a simulation case, we can select a period of data where the stationary properties hold; while in a realworld case, an initial predictor is trained using all data available at hand.
Once the nonstationarity flag is raised, the next step is to update the parameters of the chosen predictor. The predictor should support online learning since some predictors require training from scratch when new data are available. Some notable online learning algorithms are online passiveaggressive [28]
, linear support vector machine (SVM) with stochastic gradient descent (SGD)
[29], and deep neural networks. These learning algorithms are suitable to combine with SAFE. Furthermore, minibatch SGD also will be more suitable for SAFE since we can include previous data points to form a mini batch and update the predictors accordingly.It is known that updating these predictors leads to catastrophic forgetting. To avoid this problem we include previous data so that the model also learn the new data using a portion of data from the past. The question is, how many data points should we include in the online adaptation? To answer this question, we introduce the proportional deviation algorithm.
The main idea of this algorithm is to include several numbers of data points proportional to the absolute deviation of from . Large deviation means the new data is far from the previous stationary point. Therefore, it is intuitively suitable to include more data when the deviation is large and vice versa. The size of the mini batch is computed as follows
(4) 
where is the number of data points included in the past or the size of the mini batch, and is the proportional gain to scale the mini batch. The choice of depends on the applications and affects the speed of the adaptation. The round operation rounds the calculation to the nearest integer. This operation is necessary since the size of the mini batch has to be an integer. Algorithm 2 is introduced to illustrate this procedure.
Iv Experimental Settings and Datasets
In this section, we describe the datasets and experimental settings that are used to test our hypothesis and to illustrate the effectiveness of the proposed approach. The datasets consist of two types: artificial and realworld data.
Artificial data allow us to perform an effective analysis because the exact locations of nonstationarity are known, which are not the case in realworld data. However, unless we are able to capture all possible nonstationary conditions, the proposed approaches that are tested using artificial data, might not work well in practice. Therefore, we use realworld data to test the effectiveness our approach. Since the exact locations and behavior of the nonstationarity is not known, we test the realworld data in the prediction stages only. The goal of the test is to see whether the proposed approach is able to correct the prediction that is harmed by nonstationarity.
We propose several experimental settings to test each element in our approach. The settings are defined as follows:

Experiment to find the most suitable distance function for our approach. We investigate three types of distance functions: Euclidean, Pearson, and Cosine distances. This is done using artificial data.

Experiment to test whether extracting features in frequency domain leads to better nonstationarity detection results than in time domain. This is done using artificial data.

Experiment to investigate which predictor performs well in our approach. This is done using linear and nonlinear artificial data.

Experiment to test which domain of feature extraction is better for prediction using realworld data.
At the end of this section, we introduce performance metrics used in each experiment.
Iva Datasets
Here, we describe the datasets used in the experiments.
IvA1 Artificial Datasets
We used two sets of artificial data for experimental settings 1, 2, and 3. The first set contains five timeseries data sets that illustrate some behavior of stationary and nonstationary time series. These data sets are inspired, with modifications, by [7] and [8], and are given by the following processes
(TSA) Stationary AR(1) process. This is a stationary process with , and a range of values of .
(5) 
(TSB) Piecewise stationary AR process with obvious changes. The changes in this process occur at two known locations .
(6) 
(TSC) Piecewise stationary AR process with less obvious changes. The change in this process is less observable and occur at .
(7) 
(TSD) Piecewise stationary nearunitroot process with changing variance. This process has changes in variance occur at two points , where .
(8) 
(TSE) Piecewise stationary ARMA(1, 1) process. This process has three points of changes .
(9) 
The second set of artificial data is inspired by [18]. This set consists of two linear time series and two nonlinear time series; both linear and nonlinear time series are known to have parameter changes in some specified points. The linear datasets have similar structure, which is given by AR(p) process. Time series linear1 and linear2 are respectively given by AR(4) and AR(5) processes. AR(p) process is given as follows:
(10) 
where .
The nonlinear time series are constructed using the following processes:
(11)  
(12) 
where Equation IVA1 and Equation 12 represent nonlinear1 and nonlinear2 time series, respectively. Tabel I presents the parameters used in all the linear and nonlinear time series. Column Point provides points where the parameters are implemented.
Time Series  Point  

Linear1  13000  0.5  
30016000  1.5  
60019000  2.5  
900112000  3.5  
Linear2  13000  0.5  
30016000  1.5  
60019000  2.5  
900112000  3.5  
Nonlinear1  13000  0.5  
30016000  1.5  
60019000  2.5  
900112000  3.5  
Nonlinear2  13000  0.5  
30016000  1.5  
60019000  2.5  
900112000  3.5 
IvA2 RealWorld Datasets
We use two realworld datasets to test our proposed approach. The first one is IBM stocks closing price, and the second one is traffic flow of freeways in California.
The IBM dataset was gathered from Yahoo Finance historical data. We collected the stock daily closing price from January 8^{th} 1962 to September 5^{th} 2017, which is around 14000 data points were collected.
The traffic flow dataset was obtained from the Caltrans Performance Measurements Systems (PeMS) [30]. The traffic flow was sampled every 30 seconds using inductiveloop deployed throughout the freeways. These data were aggregated into 5min duration by PeMS. Furthermore, the traffic data are aggregated further into 15min duration based on the recommendation of Highway Capacity Manual 2010 [31]. We collected the traffic flow data of a freeway from January 1^{st} 2011 to December 31^{st} 2012.
IvB Experimental Settings
We apply the datasets to four different experimental settings. In the first setting, we use TSB to TSC to find which distance function gives us the best performance on clearly and not so clearly observable changes. In this part, we also illustrate the evolution of each distance function in relation to the time series. Each dataset is tested over 100 trials to account for the randomness introduced by the noise. To gain insight on the performance, we measure the total number of detection, false alarm, hit rate, missed detection, specificity, detection delay, and execution time.
In the second setting, we validate our hypothesis that by extracting the features in frequency domain we can get more information that in the time domain. We use TSA to measure the false alarm and TSB to TSE to measure the total number of detection, false alarm, hit rate, missed detection, specificity, detection delay, and execution time. In the experiment, each dataset is tested over 100 trials.
For the third setting, we use linear (Linear1, Linear2) and nonlinear (Nonlinear1, Nonlinear2) time series with changes to test the prediction performance when SAFE is embedded into three different predictors namely PassiveAggressive Regressors, Kernel LinearSVM, and Deep Neural Networks. In the Kernel LinearSVM, the original features are mapped into a randomized lowdimensional feature space, which is then trained using linear SVM [32]. In this experiment, we measure the meansquared error (MSE), execution time, and percentage update required. We run 20 simulations to account for the randomness.
Finally, we use realworld datasets, namely IBM and Traffic Flow, to measure the performance of deep neural networks which are combined with timedomain feature extraction and SAFE. In this experiment, we observe the MSE, execution time, and percentage update required. In addition, we illustrate all the experimental settings performance with graphs showing their responses over time.
IvC Evaluation metrics
In this section, we describe the performance metrics used in the experiments.
Total number of detection. We report the number of points detected in a timeseries. The change point is considered true if it within of the sample size. As an example, the number of detection = 1 means there is only 1 change point is detected. This number is then cumulated over 100 trials. We define false positive as FP, true positive as TP, true negative as TN, and false negative as TN.
False Alarm Rate. It measures the number of change points detected that when there are no actual changes occur. Another name for this metric is False Positive Rate, which is given as .
Hit Rate. It measures the proportion of correctly detected changes over the actual number of change points, which is given as .
Missed Detection Rate. It is the number of undetected changes when there are actual change points over the actual number of change points, which is given as .
Specificity
. It reflects the number of stationary points classified as stationary points over the actual number of stationary points. This is calculated as
Detection Delay. It measures the distance, in the number of steps, of the detected changes from the actual points.
Execution Time. It measures the average time, in seconds, required to perform an experiment over a specific number of trials.
MeanSquared Error. This metric measures the similarity between predictions and actual time series. We used two types of MSE: the trajectory of MSE, which is defined as the evolution of the MSE at every instant time, and the overall MSE, which measures the total MSE of the whole test dataset.
Percentage of Update. The number of updates performed over the number of possible updates. As an example, we have a time series with 100 data points, and we start the online update procedure from . Let us assume our algorithm updates the predictors 15 times. It means the percentage of the update is 15%. If we perform blind adaptation scheme, the percentage of the update will be 100%.
V Results and Discussion
In this section, we discuss the experimental results to evaluate the performances of our proposed approach. The results are grouped according to the experimental settings mentioned in the previous section.
Va Distance Functions
There are various ways to compute a distance between two vectors; the popular ones are Euclidean, Pearson, and Cosine distances. To decide which distance function gives better performance on nonstationary detection, we conducted experiments using TSB and TSC datasets. These datasets illustrate the obvious and nonobvious changes that may occur in a time series.
First, it is important to show that these three distance functions can capture changes occur in a time series. Therefore, we present a simulation result showing the responses of the distances to the time series. In this simulation, we applied the SAFE approach to the TSC dataset with a sliding window of 5 samples. We chose the TSC dataset because this dataset contains nonobvious changes. Intuitively, when more obvious changes occur, the response of the distance will be more distinguishable. Furthermore, we calculated the distances using the three distance functions mentioned in the previous paragraph. The result of this experiment is shown in Figure 5. This figure illustrates that when a nonobvious change occurs at a breakpoint , the three distances respond to the change immediately. It should be noted that the responses of these three distances are different, especially when using the Euclidean distance. Although Pearson and Cosine distances yield similar responses, they are still distinguishable. The response of the Pearson distance is a little bit smoother than that of the Cosine distance.
Next, we investigate the detection performances of the SAFE when it is combined with three distance functions. The size of the sliding window is kept at 5 samples. Other parameters that have to be set is and . It is suggested in [11] that . Since the detection performances highly depend on the thresholds
, we have to set these thresholds so that three distances have equal performance in one of the evaluation metrics. To achieve this, we set the thresholds such that the three distances have similar false alarm rate on TSB, which is
. From here, we can judge the other performance metrics given the false alarm rate.To achieve the targeted false alarm rate, we performed several trials. We tried several values of , and we found that works best for all the three distances. The warning thresholds for Euclidean, Pearson, and Cosine distances are 2.85, 0.75 and 1.4, respectively, and the trigger thresholds are 3.35, 1,25 and 1.9, respectively. The size of the sliding window for the SMA is 20; this value is able to capture the longterm evolution of the distances. It is expected that these parameters work for different types of nonstationarity. Therefore, these parameters are fixed for all experiments using TSA to TSE.
Table II shows the detection performances of the three distances over 100 simulations. We can see that in TSB and TSC datasets, Euclidean distance provides superior results in terms of the number of detection, hit rate, missed detection, and execution time. The highest average detection delay on TSB is produced by Euclidean distance, but this is not the case on TSC. Overall, Euclidean distance produces better performances than other distances do. Therefore, we chose Euclidean distance as the distance function for the rest of the experiments.
Perf.dataset  TSB  TSC  

euclidean  pearson  cosine  euclidean  pearson  cosine  
# detection:  
0  0  3  0  0  18  4 
1  4  30  5  100  82  96 
2  96  67  95       
False Alarm  0.050  0.050  0.063  
Hit rate  0.82  0.97  1.00  0.82  0.96  
Missed detection  0.18  0.03  0.00  0.12  0.04  
Specificity  0.95  0.95  0.95  0.95  0.96  0.95 
Det. delay (samples)  
Exec. time (s) 
VB SAFE Vs Timedomain Feature Extraction
In the second setting, we compare the performances of SAFE with those of timedomain feature extraction (FE). The timedomain FE is inspired by [18]
. We extracted 4 linear (autocorrelation, variance, skewness coefficient, kurtosis coefficient) and 1 nonlinear (bicorrelation) timedomain features. The detection method after computing the distance is similar to one used by SAFE in Algorithm
1.The parameters of the SAFE approach is kept the same as the previous experimental setting, and the parameters of the timedomain FE are set to achieve around 0.05 false alarm rate on TSB. To achieve this false alarm rate, the forgetting factor is set to 0.3 while and are set to 3 and 3.5, respectively. The sliding window size of the SMA is set to 20. Both SAFE and timedomain FE are experimented using TS(AE).
The first experiment on this setting is performed on TSA. Since TSA is a stationary time series, it is expected that low false alarm rates are found on both SAFE and timedomain FE. Table III shows that acceptable false alarm rates are found in both cases. It is evident that using different , SAFE produces lower false alarm rates than timedomain FE does. Indeed, the false alarm rate on stationary time series ideally should be 0. This can be achieved if we set both the thresholds to the higher values. However, setting higher thresholds may lead to poor detection performances. This is up to the designer to decide which performances are important in their applications.
False alarms  
SAFE  timedomain FE  
0.7  0.001  0.008 
0.4  0.004  0.015 
0.1  0.011  0.023 
0.1  0.019  0.030 
0.4  0.035  0.041 
0.7  0.052  0.053 
Average  0.0200.018  0.0280.015 
The second set of the experiments is done to test the nonstationarity detection performances using TS(BE). The results are summarized in Table IV. This table shows that the overall performance of SAFE on the datasets a better in terms of number detection, false alarm rate, hit rate, missed detection, specificity, detection delay, and execution time. It should be noted that although in some aspects timedomain FE has almost similar performances than SAFE does, timedomain FE has significantly higher average time to execute the experiments. In conclusion, timedomain FE executes the experiments more than 3 times slower than SAFE does.
Perf.dataset  TSB  TSC  

SAFE  timedomain FE  SAFE  timedomain FE  
# detection:  
0  0  0  0  0 
1  4  14  100  100 
2  96  86     
3         
False Alarm  
Hit rate  1.00  1.00  
Missed detection  0.00  0.00  
Specificity  0.95  0.95  0.95  0.95 
Det. delay (samples)  
Exec. time (s)  
Perf.dataset  TSD  TSE  
SAFE  timedomain FE  SAFE  timedomain FE  
# detection:  
0  0  0  0  0 
1  8  15  3  1 
2  92  85  23  30 
3      74  69 
False Alarm  0.039  0.041  0.042  0.047 
Hit rate  0.96  0.94  0.94  0.88 
Missed detection  0.04  0.06  0.06  0.12 
Specificity  0.96  0.96  0.96  0.95 
Det. delay (samples)  
Exec. time (s) 
To illustrate the performance of SAFE in more details, we present several graphs showing the predicted change points versus the ground truth. Figure 6 shows the detection on TSB. From the previous section, we know that TSB has two change points. However, the figure shows that the proposed algorithm detects three change points, where the third detected point is considered as a false positive. This is acceptable since it can be seen, by inspection from the first row of the graph, that the time series actually looks nonstationary in term of the variance. The second row of the graphs shows the evolution of and the warning and trigger thresholds. In Figure 7, we can see that similar behavior, where few false alarms present, is also shown in the experiment on TSC.
Furthermore, different behavior is observable in an experiment using TSD. Figure 8 shows that the false alarm rate is somewhat higher than the other experiments using different time series. If we look at the behavior of the time series closely, it certainly depicts a nonstationary behavior. This behavior is due to the fact that TSD is a piecewise stationary nearunitroot process with changing variance. Nearunitroot process means the process dynamic is close to unstable behavior. Therefore, it is expected that the SAFE approach considers some parts of the timeseries as nonstationary, especially in the last part of the series. It is logical to consider that the process contains continuous nonstationarity, rather than just a change in the breakpoint. This way, our chosen online predictor can be continuously updated to adapt to the nonstationary process.
VC Choice of Predictors
The third setting of the experiments is employed to test which predictor gives better prediction performances. We use Linear1, Linear2, Nonlinear1, and Nonlinear2 datasets and test them with three different predictors namely PassiveAggressive Regressors (PAR), Kernel LinearSVR (KSVR), and Deep Neural Networks (DNN). Before we run the experiments, both parameters in Algorithm 1 and 2 have to be set. We set to be 0.3, 10, 20, and 0.1 respectively.
Furthermore, the parameters of the predictors also need initialization. For PAR, we set the aggressiveness parameter to 0.05. For KSVR, we used epsilon insensitive as the loss function,
regularizer with a constant equals to. Finally, for the DNN, we set the number of hidden layers to 2, where each hidden layer contains 200 neurons, and use Rectified Linear Unit (ReLU) as the activation function. To avoid overfitting, we implemented a drop out regularizer with rate equals to 0.1. All of the parameters were tuned using a similar offline training scheme. We trained the predictors using the first 2000 samples, and validated using the next 1000 samples. We did not implement Kfold crossvalidation since, in time series prediction, it is not suitable to train using data that come after the validation data. Basically, we trained the predictors until the validation errors stop decreasing. The trained predictors are then used as initial models that will be updated when they are triggered by the SAFE algorithm.
Finally, the rest of the data are used to test the predictors. The experimental results of these experiments are summarized in Table V(a). The results show that the smallest prediction errors on all datasets are achieved using DNN as the predictor. However, the average execution time of this predictor is also shown to be the highest among all. In some applications, it might be crucial to have slow execution time. Accordingly, if we concern more about execution time than prediction errors, then DNN might not be the best choice as a predictor. However, in an application such as traffic flow prediction, where the sampling time is 15 minutes, DNN is suitable as a predictor since it produces significantly lower prediction errors than the rest of the predictors do.
Figure 9 depicts the prediction performances on Linear1 dataset. The first graph shows that all of the predictors are able to follow the ground truth closely. However, if we look at the errors on the second graph, the DNN error is fairly lower than those of other predictors although the offline error of the DNN is higher than that of PAR. The next step is to compare the performance of the DNN with the baseline predictor, which is the predictor that is not updated. It can be seen in Figure 10 that while the error of the DNN stays constant, the error of the baseline predictor is drifting to a larger value. We can conclude that updating the SAFE approach is reliable for nonstationary time series prediction.
Lastly, Figure 11 and 12 shows that similar performances are shown in the nonlinear datasets. Although the predictor errors start around similar values, the DNN performs better in the long run. In addition, all of these prediction performances are achieved by updating the predictors when necessary only. This is reflected by the percentages of the update for all datasets that are less than 35%.
VD Realworld Data Experiments


The objective of the last set of experiments is to test the prediction performance of DNN on two realworld datasets under SAFE and timedomain FE. The first dataset is IBM dataset. We use the data from January 8^{th} 1962 to January 7^{th} 1977 for training; January 8^{th} 1977 to January 7^{th} 1982 for validation; and January 8^{th} 1982 to September 5^{th} 2017 for testing. The second dataset is traffic flow dataset of San Fransisco Bay Area, District 4, California. We use the data from January 1^{st} 2011 to August 31^{st} 2011 for training; September 1^{st} 2011 to December 31^{st} 2011 for validation; and January 1^{st} 2012 to December 31^{st} 2012 for testing.
The offline predictors for both SAFE and timedomain FE in each dataset are identical. For the IBM dataset, the DNN consists of 3 hidden layers, where each layer has 100 neurons. To avoid overfitting, we implemented dropout regularizers with rate equals to 0.1. The training scheme of the DNN is similar to that of explained in the previous section. Furthermore, before the online predictor is applied, the parameters of the SAFE approach and timedomain FE have to be set. The warning and trigger thresholds for the SAFE approach are 0.025 and 0.5 respectively while the warning and trigger thresholds for the timedomain FE are 0.27 and 0.55, respectively. The proportional deviation gain for both the approaches is set to 0.1.
The results of the proposed approach on IBM dataset is summarized in Table V(b). In general, the results show that higher percentage update leads to significantly better MSE, but the execution time also increases significantly. The table also shows that SAFE produces errors around three times lower than timedomain FE does, and executes the experiments faster than timedomain FE does. However, both of the approaches are able to provide acceptable performances compare to the baseline predictor.
Figure 13 illustrates the performance comparisons between the baseline predictor, SAFEDNN, and timedomain FEDNN in terms of the timeseries prediction and MSE. We present a case where the percentage update is around 2.7%. We can see from the figure that if the predictor does not adapt to the nonstationarity, the timeseries prediction drifts away from the actual values. This is not the case on both the approaches. In term of the errors, our proposed approach maintains low error throughout the experiments. The competing approach, however, drifts a little bit at the end when the prediction becomes more difficult due to the highly nonstationarity in the data.
Next, we test our approach with traffic flow dataset. The baseline DNN is configured as follows: the number of hidden layers is 3; each hidden layer contains 125 neurons; and the dropout rate equals to 0.5. The activation function of both the hidden layers and the output layer is ReLU since we know that the traffic flow values cannot be negative. We select 5 freeways in our experiments. The results are shown in Table V(b). The results illustrate similar behavior as the ones on IBM datasets. In general, the combination of SAFE and DNN produces superior results in terms of errors and execution time.The evolution of the errors is shown in Figure 14. It can be seen that the errors start at the same level. However, the baseline error drifts away while the online predictor’s error keeps decreasing, and the SAFEDNN error is always the lowest compared to the other errors.
Lastly, Figure 15 shows the prediction of traffic flow timeseries using baseline predictor, SAFE approach, and timedomain FE approach. We select a portion of the prediction to better illustrate the results. It can be seen that the baseline predictor does not produce acceptable traffic flow prediction while the SAFEDNN and timedomain FEDNN do. Although the predictions of both the approaches look similar, they are essentially different, especially in the valley parts of the traffic flow. It should be noted that this excellent prediction is obtained only by updating the predictor 5% throughout the experiments. This means we save around 95% of the processor or GPU cycles.
Vi Conclusion
This paper presents an approach to actively detect nonstationarity for time series prediction. The approach monitors the evolution of the spectral contents of time series using a distance function. We have successfully conducted comprehensive experiments to validate our hypothesis and test the effectiveness of our proposed approach on artificial and realworld datasets.
The experiments show that the approach is able to achieve high longterm prediction performances while significantly saving computational resources in terms of processor and GPU cycles. Although DNN requires more computational time than other predictors do, it is clearly worth to consider as an online predictor since its overall prediction errors are notably lower than those of the other predictors. The implementation of the proportional algorithm to variably include some data in the past makes the online adaptation of the predictors more flexible, i.e., there is no need to fix the batch size of the online training procedure.
To go further with this research, we can expand the approach to work with multistep time series predictions. Furthermore, since LongShort Term Memory recurrent neural networks are powerful in handling sequential data, this type of neural networks is worth to investigate. Moreover, the proposed method can be used to work with largescale timeseries, where distributed neural networks, i.e., DNN with multitask learning, are appropriate.
References
 [1] DBengines, DBMS popularity broken down by database model, 2017 (accessed October 8, 2017).
 [2] G. P. Nason, “Stationary and nonstationary time series,” Statistics in Volcanology. Special Publications of IAVCEI, vol. 1, pp. 000–000, 2006.
 [3] M. Priestley and T. S. Rao, “A test for nonstationarity of timeseries,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 140–149, 1969.
 [4] S. Adak, “Timedependent spectral analysis of nonstationary time series,” Journal of the American Statistical Association, vol. 93, no. 444, pp. 1488–1501, 1998.
 [5] E. Andreou and E. Ghysels, “Structural breaks in financial time series,” Handbook of financial time series, pp. 839–870, 2009.
 [6] I. Berkes, E. Gombay, and L. Horváth, “Testing for changes in the covariance structure of linear processes,” Journal of Statistical Planning and Inference, vol. 139, no. 6, pp. 2044–2063, 2009.
 [7] H. Cho and P. Fryzlewicz, “Multiscale and multilevel technique for consistent segmentation of nonstationary time series,” Statistica Sinica, pp. 207–229, 2012.
 [8] K. K. Korkas and P. Fryzlewicz, “Multiple changepoint detection for nonstationary time series using wild binary segmentation,” Statistica Sinica, vol. 27, no. 1, pp. 287–311, 2017.
 [9] Y. Dwivedi and S. Subba Rao, “A test for secondorder stationarity of a time series based on the discrete fourier transform,” Journal of Time Series Analysis, vol. 32, no. 1, pp. 68–91, 2011.
 [10] F. FdezRiverola, E. L. Iglesias, F. Díaz, J. R. Méndez, and J. M. Corchado, “Applying lazy learning algorithms to tackle concept drift in spam filtering,” Expert Systems with Applications, vol. 33, no. 1, pp. 36–48, 2007.
 [11] G. J. Ross, N. M. Adams, D. K. Tasoulis, and D. J. Hand, “Exponentially weighted moving average charts for detecting concept drift,” Pattern Recognition Letters, vol. 33, no. 2, pp. 191–198, 2012.
 [12] P. M. Gonçalves Jr and R. S. M. De Barros, “Rcd: A recurring concept drift framework,” Pattern Recognition Letters, vol. 34, no. 9, pp. 1018–1025, 2013.
 [13] G. Ditzler, M. Roveri, C. Alippi, and R. Polikar, “Learning in nonstationary environments: A survey,” IEEE Computational Intelligence Magazine, vol. 10, no. 4, pp. 12–25, 2015.
 [14] J. Z. Kolter and M. A. Maloof, “Dynamic weighted majority: An ensemble method for drifting concepts,” Journal of Machine Learning Research, vol. 8, no. Dec, pp. 2755–2790, 2007.
 [15] J. A. Guajardo, R. Weber, and J. Miranda, “A model updating strategy for predicting time series with seasonal patterns,” Applied Soft Computing, vol. 10, no. 1, pp. 276–283, 2010.
 [16] R. Elwell and R. Polikar, “Incremental learning of concept drift in nonstationary environments,” IEEE Transactions on Neural Networks, vol. 22, no. 10, pp. 1517–1531, 2011.
 [17] L. MoreiraMatias, J. Gama, and J. MendesMoreira, “Concept neurons–handling drift issues for realtime industrial data mining,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 96–111, Springer, 2016.
 [18] R. C. Cavalcante, L. L. Minku, and A. L. Oliveira, “Fedd: Feature extraction for explicit concept drift detection in time series,” in Neural Networks (IJCNN), 2016 International Joint Conference on, pp. 740–747, IEEE, 2016.

[19]
C. Alippi, G. Boracchi, and M. Roveri, “A justintime adaptive classification system based on the intersection of confidence intervals rule,”
Neural Networks, vol. 24, no. 8, pp. 791–800, 2011.  [20] S. Liu, M. Yamada, N. Collier, and M. Sugiyama, “Changepoint detection in timeseries data by relative densityratio estimation,” Neural Networks, vol. 43, pp. 72–83, 2013.
 [21] C. Alippi, G. Boracchi, and M. Roveri, “Justintime classifiers for recurrent concepts,” IEEE transactions on neural networks and learning systems, vol. 24, no. 4, pp. 620–634, 2013.
 [22] R. M. French, “Catastrophic forgetting in connectionist networks,” Trends in cognitive sciences, vol. 3, no. 4, pp. 128–135, 1999.
 [23] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,” arXiv preprint arXiv:1606.04671, 2016.
 [24] R. Polikar, L. Upda, S. S. Upda, and V. Honavar, “Learn++: An incremental learning algorithm for supervised neural networks,” IEEE transactions on systems, man, and cybernetics, part C (applications and reviews), vol. 31, no. 4, pp. 497–508, 2001.
 [25] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra, “Pathnet: Evolution channels gradient descent in super neural networks,” arXiv preprint arXiv:1701.08734, 2017.
 [26] K. Hashimoto, C. Xiong, Y. Tsuruoka, and R. Socher, “A joint manytask model: Growing a neural network for multiple nlp tasks,” arXiv preprint arXiv:1611.01587, 2016.
 [27] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. GrabskaBarwinska, et al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the National Academy of Sciences, p. 201611835, 2017.
 [28] K. Crammer, O. Dekel, J. Keshet, S. ShalevShwartz, and Y. Singer, “Online passiveaggressive algorithms,” Journal of Machine Learning Research, vol. 7, no. Mar, pp. 551–585, 2006.
 [29] O. Bousquet and L. Bottou, “The tradeoffs of large scale learning,” in Advances in neural information processing systems, pp. 161–168, 2008.
 [30] C. D. of Transportation, “Caltrans Performance Measurement System.” http://pems.dot.ca.gov/, 2016. ”[Online; accessed June2016]”.
 [31] H. C. Manual, “Volumes 14.(2010),” Transporation Research Board, 2010.
 [32] A. Rahimi and B. Recht, “Random features for largescale kernel machines,” in Advances in neural information processing systems, pp. 1177–1184, 2008.
Comments
There are no comments yet.