Time series forecasting has received lots of attention recently. This is because many phenomena such as stock price, temperature, and weather can be modeled as time series. The fundamental challenge in time series forecasting is that the observations at different points in time are correlated, which makes some of the algorithms that change or permute the order of observations unusable.
Scientists and researchers have done extensive research in time series forecasting, such as [10, 28, 3, 8, 9]. They have borrowed tools from various domains, such as graphical modeling and statistics, to improve forecasting accuracy. For example, in  and 
, the authors used the hidden Markov model (HMM) to predict time series data. In, the authors showed that the HMM can achieve high accuracy to predict student performance in an educational game. In [3, 8] and , the authors used the well-known time series method, namely, autoregressive integrated moving average (ARIMA) to make the forecast of the stock market, electricity prices and energy demands.
Although classical methods like ARIMA and HMM proved to be successful in modeling time-series data, however, they exhibit some limitations as follows: 1- HMM follows the one step Markov assumption, which means conditioning on the current state the future state is independent of all the past states. This assumption might be violated for most real-world time series data. 2- HMM considers there are underlying hidden states that are responsible to generate the observations, which might not be correct for many time series forecasting problems. 3- ARIMA and most other classical time series models can only be used for univariate time series; furthermore, it is shown that they cannot model nonlinear time series accurately. [2, 32, 6].
Recent advancements in the processing power of the computers enable researchers to take advantage of the massive volume of data and develop more sophisticated algorithms such as neural networks to model and predict the data. Many researchers have used neural networks in the time series forecasting and showed that they can lead to better prediction than ARIMA model. [6, 32, 26].
Time series clustering is another major topic that receives lots of attention. Extensive research such as [1, 18] and  is done in developing accurate and efficient clustering algorithms. In  and , the authors performed the comprehensive surveys of the time series clustering methods. They described distance-based clustering methods such as dynamic time warping (DTW) as well as feature-based and representation based clustering methods. In 
, the authors combined the recurrent neural network, specifically long short term memory (LSTM) and feature-based clustering to improve the time series forecasting accuracy. In particular, they showed that their method outperforms the baseline LSTM model in terms of mean sMAPE accuracy.
Some papers such as [31, 17] and  made forecasting using the neural network by combining static and dynamic features. They showed that combining static and dynamic features can outperform the classification accuracy using the static or dynamic features alone.
In this paper, we are proposing to combine anomaly detection, clustering, and forecasting using LSTM for time series data to achieve better prediction. The contribution of this paper is fourfold. First, we perform anomaly detection to identify and replace the outliers in the time series data. This is a major preprocessing step that can significantly affect the prediction and clustering. Second, we introduce some methods to impute the missing values for time series data. Third, we compare distance-based and feature-based clustering in terms of speed and prediction accuracy. Fourth, we introduce multiple architectures to make forecasting using time series data as well as multiple architectures to combine static data and time series data.
The rest of this article is organized as follows: Section II defines the methods and algorithms employed in the later sections. Section III describes the dataset used in the article. Section IV describes the problem formulation. Section V presents and discusses the results. Finally, section VI presents our conclusion.
Ii Algorithms and Methods
Ii-a Anomaly Detection
Anomaly detection is an essential step before applying any machine learning algorithm. This step should be taken before imputing the missing values since the outliers will influence the missing values. Outliers are defined as observations that are significantly different from all other observations in data. If they are not detected and replaced correctly, they will have a negative impact on clustering and forecasting outcomes.
are proposed to detect anomalies in time series using the ARMA process and the neural network, specifically LSTM. In this paper, we are using seasonality and trend decomposition based on locally estimated scatterplot smoothing (Loess) to detect the outliers in time series. This is the algorithm developed by Twitter and works by decomposing the time series to seasonality, trend, and random components. It is suitable for seasonal time series and is claimed to work accurately for additive outliers. An example of the implementation of this algorithm can be found in the tsclean package at R.
Ii-B Missing Value Imputation
The second major preprocessing step after detecting and replacing the outliers is imputing the missing values. Depending on the nature of data, several methods exist to impute the missing values. Following techniques are examples of missing value imputation for time series data:
|Time Series Data|
|Interpolation||Linear, Spline, …|
|Locf||Last Observation Carry Forward|
|Nocb||Next Observation Carry Backward|
|Moving Average||Exponential,Linear, …|
|Mean||Mean, Median, …|
Table 1 lists only some of the well-known methods to impute missing values. It is also worth mentioning that the tsclean package mentioned in the last section performs linear interpolation to impute missing values after anomaly detection. Several algorithms are proposed to impute missing values for the static data such as [5, 29, 20, 11]. In 
the authors performed a comparative study of several missing value imputation techniques such as Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average (filling missing values with zeros). They showed that KNNimpute provides a more robust and sensitive method than row average and SVDimpute. In, the authors proposed the SOFT-IMPUTE algorithm to iteratively replace the missing elements with those obtained from a soft-thresholded SVD. The softImpute Package in R described In  implements Iterative methods for matrix completion that use nuclear-norm regularization.
Clustering is defined as dividing the data into some groups such that data in the same groups are more similar. It is divided into two major categories of distance-based and feature-based clustering. In distance-based clustering, the distance between data points is considered as similarity measure, and the objective is to have lower intra-cluster distance than inter-cluster distance. The main challenge in distance-based clustering is the notion of the distance itself. Choosing the appropriate distance is dependent on the nature of the dataset. In the case of static and cross-sectional data, euclidean distance is the most popular distance metric, and K-mean and hierarchical algorithms are two popular clustering choices. Hierarchical clustering is divided into two categories of agglomerative and divisive. Agglomerative hierarchical clustering is a more popular choice than divisive hierarchical clustering in which each data point is initially in its own cluster; then, at each iteration, the similar clusters merge with each other until K clusters are formed. The advantage of hierarchical clustering over K-mean is that we do not need to specify the number of clusters in advance.
Dynamic time warping (DTW) is one of the most popular distance metrics used for time series clustering. It is a well-known technique that was originally developed for speech recognition application and is used to find an optimal alignment between two given time-dependent sequences. It is a dynamic programming algorithm that Unlike the Euclidean distance, is not susceptible to distortions in the time-axis. It works by warping the sequences in a nonlinear fashion to match each other . Another advantage of DTW over euclidean distance is that it does not require the sequences to have the same lengths. However,despite its numerous applications, DTW suffers from quadratic computational complexity. This means if m and n represent the length of two sequences, then the computational complexity of finding the DTW distance between them is . Some papers such as [27, 25] proposed algorithms such as PrunedDTW, FastDTW to speedup the classical DTW algorithm.
Feature-based clustering works by first extracting features from data and then constructing the feature matrix from the extracted features. This method has several advantages over distance-based clustering. First, it is faster, as will be shown later in this paper. Second, standard clustering methods such as K-mean and hierarchical clustering can directly be applied to the feature matrix.
In this paper, we are introducing two feature extraction methods for time series data. Method A mainly extracts time series specific features such as entropy, autocorrelation, partial autocorrelation, stability, and holt parameter. These features showed to be very effective for clustering and detecting unusual time series[30, 16].
Ii-D Forecasting with Neural Networks
Neural networks are drawing lots of attention in recent years. They have been used successfully in various domains such as signal and image processing, control, biology, and finance. Different neural network architectures are used depending on the application and the dataset. For example, if the objective is classification and the input data is a set of images, then a convolutional neural network (CNN) might be the best choice. If the input data is a time series, then recurrent neural network (RNN), specifically LSTM, is a better choice. However, sometimes more than one network type can be used in a given problem. For instance, a multilayer perceptron (MLP) and CNN are also used for time series forecasting. Time series should be reshaped in input and output format to be used by MLP. For example, assume a time series X with length n and the following elements:. In order to use MLP to forecast the next value, X needs to be reshapes as follows:
RNN networks are mainly designed for sequence prediction problem. Unlike MLP, which considers input and output to be independent, RNN networks consist of memory cells that can remember the long term dependencies between elements of a sequence. In theory, RNN can remember arbitrary long time steps, but in practice, they suffer from vanishing gradient problem[12, 23]. LSTM is designed to address the vanishing gradient problem. It consists of cell states and various gate elements that decide which data in a sequence is necessary to keep or throw away. By doing that, it can pass only relevant information down the long chain of sequences to make predictions.
Ii-E Error Metrics
Forecasting error is divided into two categories of scale-dependent and scale-independent. Two well-known scale-dependent measures are mean absolute error (MAE), and root mean square error (RMSE). Minimizing the MAE will lead to forecasts of the median while minimizing the RMSE will lead to forecasts of the mean. Although both MAE and RMSE are widely used error measures, MAE has the advantage that it is easier to compute and understand . MAE and RMSE are defined as follows:
Mean absolute percentage error (MAPE) is a widely used scale-independent error measure that is defined as follows:
Despite being commonly used in practice, it has the following disadvantages: 1- It is undefined when the true value is zero, and it will be a very large value when is close to zero. 2- It puts a heavier weight on the negative errors than the positive ones. 3- It does not make sense when the measurement has an arbitrary zero point [14, 15].
In this paper, We are using the synthetic data created as follows: First, we employ ARIMA models with random AR and MA coefficients to generate multivariate time series with length L and a different range of values for different columns. Second, we find the absolute value of each column of the time series to make sure that each column consists of only positive values. Third, we add arbitrary number of columns to the previously generated columns in the last two steps by simulating the piecewise continuous function that takes constant value from 0 to and from to L. Fourth, replace elements of time series by big values larger than the maximum value of each column (measurement) and elements by zeros to simulate the effect of outliers. Finally, additive white Gaussian noise (AWGN) with a specific variance depending on the range of the data is added to each column.
Above method has the following advantages: First, parameters of data such as the dimension of time series (number of measurements in multivariate time series), the variance of the noise, the minimum and the maximum value of time series corresponding to each measurement, and number of outliers can easily be defined and modified in simulation. Second, an arbitrary number of multivariate time series data can easily be created inside a loop.
Generating static data is more straightforward and is done as follows: First, identify the number of features as K. Second, determine the maximum and the minimum value and the type of each feature. Third, generate or sample data for each feature given the range of the values and the type of each feature.
For this paper, we are generating 400 multivariate time series data with three measurement columns. The total number of outliers per each measurement is 10. The length of each measurement is 400 as well. We generate 400 static data with five continuous features. Therefore the static data is a matrix of 400 by 5.
For the multivariate time series, we will be using names corresponding to physical phenomena. For example, the first measurement is oil, then is water, and the third column is gas. For static features, we will call them feature 1, feature 2, … , feature 5.
All the simulations run on Mac pro laptop with 16 GB memory and 2.80 GHz CPU.
Iv Problem Formulation
In this paper, The objective is to predict the cumulative gas value (Third column of time series) given all other data and measurements. We use the first N elements of time series and the static data for training and try to predict the cumulative gas value where . More specifically, we use and K=150, 200, 300, 400.
Neural networks and, more specifically, the LSTM are used for the prediction. We developed several architectures when time series data is only used for training and several other architectures when the combination of time series and static data are used for the training. The followings are the architectures used for training neural networks using the time series measurements.
To combine static and time series data architectures in Fig. 2 are proposed:
Only gas measurement (third column of time series data) and the static data are used to predict the future values of gas.
All time series measurements or a subset of time series measurements, in addition to static data, are used to predict the future value gas.
V Results and Discussion
In this section, we describe the results for anomaly detection as well as forecasting with and without clustering. The plots in Fig. 3 present some examples of anomaly detection algorithm presented in section II.
In Fig. 3 the blue curve is the original curve, and the red is the curve when outliers are removed. As it is shown, the anomaly detection algorithm described in section II can accurately identify the outliers. This is a crucial preprocessing step to make sure the data used for prediction is reliable.
Table III presents the MAPE, MAE, and RMSE error measures using 5-fold cross-validations for different neural network architectures introduced in the last section when only gas measurement (third column of time series) is used to predict the future values of gas.
In Table III, red color identifies the best RMSE; blue identifies best MAPE, and cyan identifies best MAE values among all architectures. According to Table III, Model 3 (Bidirectional LSTM) overall has the best performance since it gives the best RMSE values for K=150, 300, 400, and best MAPE values for K=200, 300, 400. In terms of time, Model 4 has the highest average time complexity with 2938 (s). Prediction without clustering can have the following problems:
Long forecasting time. If the process is interrupted due to insufficient memory or CPU, then training and forecasting should be repeated for all the data again.
Since data exhibit various patterns, shapes, and characteristics, training all of them together might lower the prediction performance.
Table IV summarizes the clustering characteristics for various clustering algorithms described in section II part C.
|Number of Measurements||Time(s)||Number of cluster|
According to Table IV, feature-based methods outperform the distance-based method in terms of speed; moreover, Method A is faster of the two feature-based methods. The optimal number of clusters in Table IV is determined using several cluster validity indexes (CVI) such as Silhouette, Dunn, and Gamma. Next, we will present the prediction results with DTW clustering when only the gas measurement is used to predict the future value of the gas.
|Cluster 1||Architecture||Performance Measure||150||200||300||400|
|Cluster 2||Architecture||Performance Measure||150||200||300||400|
|Cluster 3||Architecture||Performance Measure||150||200||300||400|
According to Table V clustering has the following advantages:
Prediction for different clusters can be made at different times, or it can be done parallel, which significantly reduces the forecasting time.
For each K, the best architecture for different clusters can be chosen to improve the overall performance. For example, consider K=150 when there is no clustering; the error metrics are as follows: RMSE =9. 216, MAPE = 33.94 and MAE = 6462.38. Now when DTW clustering is used for prediction given that 94 time series are in cluster 1, 188 time series are in cluster 2 and 118 time series are in cluster 3 the error metrics are calculated as follows:
(6) (7) (8) (9)
In (6) is the total error, is the error corresponding to each cluster and
Comparing the results in (7), (8) and (9) with RMSE, MAPE, and MAE in Table III reveal that clustering can significantly improve the prediction performance.
Tables VI and VII present the forecasting results with feature-based clustering using method A and method B, respectively.
|Cluster 1||Architecture||Performance Measure||150||200||300||400|
|Cluster 2||Architecture||Performance Measure||150||200||300||400|
|Cluster 1||Architecture||Performance Measure||150||200||300||400|
|Cluster 2||Architecture||Performance Measure||150||200||300||400|
According to the Table VI, the total error measures for K=150 for method A is as follows:
The same calculation can be done according to the Table VII for method B and K=150:
Comparing (7) up to (15) reveal that DTW and method B clustering methods provide the best improvement to prediction Errors. Furthermore, according to Table IV, feature-based clustering methods (method A, method B) significantly outperform the DTW method in terms of speed. Therefore if speed and efficiency are the primary concern, method A and method B can be combined with the neural networks to improve efficiency and error. If improving error is the main objective, then method B and DTW are a clear choice.
Now consider the goal is to compare different clustering algorithms based on the improvement they provide to forecasting errors for all the architectures. This means we are not choosing the best model like before, but we consider all the models for each clustering method. For example, to find the RMSE error for model 1 for DTW clustering, we find the total RMSE when model 1 is used for all the clusters. The results of this analysis for DTW, method A, and method B are summarized in Tables VIII, IX, and X, respectively.
Analyzing Tables VIII, IX, and X reveal that DTW clustering provides better overall improvement to MAPE, RMSE and MAE scores when K=150,200, 300, while for K=400 both methods A and method B outperform DTW. However, comparing Tables III, VIII, IX, and X reveal that regardless of the clustering algorithms, forecasting accuracies can significantly be improved when data clustering is done. For example, consider K=300, with no clustering the best MAPE score is 38.52 for model 3; however, with DTW clustering model 3 achieves MAPE score of 27.57, methods A and B achieve MAPE scores of 31.10 and 28.88 respectively.
V-a Influence of adding more Time Series Measurements
Thus far, we have used the gas measurement along with static data to predict future gas values. However, an important question to ask is, can prediction errors be improved by including more time series measurement?
To answer this question, we will consider all possible combinations of multivariate time series column and compare their prediction errors to Table III. To show the concept and also save simulation time, the idea is tested only against model 1 and model 3.
|Model 1||Measurements Column||Performance Measure||150||200||300||400|
|First and Second Columns||RMSE||39450.01||56527.14||84660.19||86700.9|
|First and Third Columns||RMSE||10218.19||27168.96||58130.94||72662.22|
|Second and Third Columns||RMSE||10039.66||26831.03||60065.97||69787.29|
|First, Second and Third||RMSE||10319.54||26638.20||55663.48||72562.50|
|Model 3||Measurements Column||Performance Measure||150||200||300||400|
|First and Second Columns||RMSE||42960.62||52977.89||94306.01||88665.62|
|First and Third Columns||RMSE||10973.27||28103.06||53204.26||71642.94|
|Second and Third Columns||RMSE||10417.47||26390.57||55478.94||71627.27|
|First, Second and Third||RMSE||10287.64||26195.90||63499.57||71540.70|
Comparing Table XI to Table III reveals that adding more time series measurements is not, in general, improving the RMSE, MAPE, and MAE. Such high errors in Table XI corresponding to the first entries of model 1 and model 3 suggest that the first and second columns alone are not good predictors of the third column.
In this paper, we provided a comprehensive analysis of time series forecasting using neural networks. We introduced several architectures to combine static and dynamic measurements to forecast time series using neural networks. Our results reveal that: First, not one architecture is right for predicting all future values of time series, but multiple architectures should be tested, and the best architecture should be determined based on model selection criterion such as cross-validation. Second, adding static data to forecast does not necessarily improve the forecasting errors.
We discussed two primary clustering methods, namely, distance-based and feature-based clustering. We showed that DTW outperforms the feature-based clustering in most architectures; however, feature-based clustering methods outperform the DTW in terms of speed and time complexity and still improve the forecasting accuracies significantly when they are compared to no clustering case.
Finally, we presented several methods, such as anomaly detection and data imputation, to prepare time series data for forecasting and the clustering.
-  (2015) Time-series clustering–a decade review. Information Systems 53, pp. 16–38. Cited by: §I.
-  (2009) Forecasting nonlinear time series with a hybrid methodology. Applied Mathematics Letters 22 (9), pp. 1467–1470. Cited by: §I.
-  (2014) Stock price prediction using the arima model. In 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation, pp. 106–112. Cited by: §I.
-  (2020) Forecasting across time series databases using recurrent neural networks on groups of similar series: a clustering approach. Expert Systems with Applications 140, pp. 112896. Cited by: §I.
-  (2002) A study of k-nearest neighbour as an imputation method.. HIS 87 (251-260), pp. 48. Cited by: §II-B.
-  (1992) Forecasting the behavior of multivariate time series using neural networks. Neural networks 5 (6), pp. 961–970. Cited by: §I, §I.
-  (2018) Time series feature extraction on basis of scalable hypothesis tests (tsfresh–a python package). Neurocomputing 307, pp. 72–77. Cited by: §II-C.
-  (2003) ARIMA models to predict next-day electricity prices. IEEE transactions on power systems 18 (3), pp. 1014–1020. Cited by: §I.
-  (2007) ARIMA forecasting of primary energy demand by fuel in turkey. Energy policy 35 (3), pp. 1701–1708. Cited by: §I.
-  (2005) Stock market forecasting using hidden markov model: a new approach. In 5th International Conference on Intelligent Systems Design and Applications (ISDA’05), pp. 192–196. Cited by: §I.
-  (2015) Matrix completion and low-rank svd via fast alternating least squares. The Journal of Machine Learning Research 16 (1), pp. 3367–3402. Cited by: §II-B.
-  (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 6 (02), pp. 107–116. Cited by: §II-D.
-  (2019) Enhanced recurrent neural network for combining static and dynamic features for credit card default prediction. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1572–1576. Cited by: §I.
-  (2018) Forecasting: principles and practice. OTexts. Cited by: §II-E.
-  (2006) Another look at measures of forecast accuracy. International journal of forecasting 22 (4), pp. 679–688. Cited by: §II-E.
-  (2015) Large-scale unusual time series detection. In 2015 IEEE international conference on data mining workshop (ICDMW), pp. 1616–1619. Cited by: §II-C.
Combining static and dynamic features for multivariate sequence classification.
2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 21–30. Cited by: §I.
-  (2005) Clustering of time series data—a survey. Pattern recognition 38 (11), pp. 1857–1874. Cited by: §I.
-  (2015) Long short term memory networks for anomaly detection in time series. In Proceedings, pp. 89. Cited by: §II-A.
-  (2010) Spectral regularization algorithms for learning large incomplete matrices. Journal of machine learning research 11 (Aug), pp. 2287–2322. Cited by: §II-B.
-  (2007) Dynamic time warping. Information retrieval for music and motion, pp. 69–84. Cited by: §II-C.
Clustering time series with hidden markov models and dynamic time warping.
Proceedings of the IJCAI-99 workshop on neural, symbolic and reinforcement learning methods for sequence learning, pp. 17–21. Cited by: §I.
-  (2013) On the difficulty of training recurrent neural networks. In International conference on machine learning, pp. 1310–1318. Cited by: §II-D.
-  (2005) Anomaly detection in time series of graphs using arma processes. Asor Bulletin 24 (4), pp. 2. Cited by: §II-A.
-  (2007) Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis 11 (5), pp. 561–580. Cited by: §II-C.
-  (2018) A comparison of arima and lstm in forecasting time series. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 1394–1401. Cited by: §I.
-  (2016) Speeding up all-pairwise dynamic time warping matrix calculation. In Proceedings of the 2016 SIAM International Conference on Data Mining, pp. 837–845. Cited by: §II-C.
-  (2019) Predicting student performance in an educational game using a hidden markov model. arXiv preprint arXiv:1904.11857. Cited by: §I.
-  (2001) Missing value estimation methods for dna microarrays. Bioinformatics 17 (6), pp. 520–525. Cited by: §II-B.
-  (2006) Characteristic-based clustering for time series data. Data mining and knowledge Discovery 13 (3), pp. 335–364. Cited by: §II-C.
-  (2019) Integration of static and dynamic analysis for malware family classification with composite neural network. arXiv preprint arXiv:1912.11249. Cited by: §I.
-  (2003) Time series forecasting using a hybrid arima and neural network model. Neurocomputing 50, pp. 159–175. Cited by: §I, §I.