Comprehensive Analysis of Time Series Forecasting Using Neural Networks

01/27/2020 ∙ by Manie Tadayon, et al. ∙ NASA 85

Time series forecasting has gained lots of attention recently; this is because many real-world phenomena can be modeled as time series. The massive volume of data and recent advancements in the processing power of the computers enable researchers to develop more sophisticated machine learning algorithms such as neural networks to forecast the time series data. In this paper, we propose various neural network architectures to forecast the time series data using the dynamic measurements; moreover, we introduce various architectures on how to combine static and dynamic measurements for forecasting. We also investigate the importance of performing techniques such as anomaly detection and clustering on forecasting accuracy. Our results indicate that clustering can improve the overall prediction time as well as improve the forecasting performance of the neural network. Furthermore, we show that feature-based clustering can outperform the distance-based clustering in terms of speed and efficiency. Finally, our results indicate that adding more predictors to forecast the target variable will not necessarily improve the forecasting accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Time series forecasting has received lots of attention recently. This is because many phenomena such as stock price, temperature, and weather can be modeled as time series. The fundamental challenge in time series forecasting is that the observations at different points in time are correlated, which makes some of the algorithms that change or permute the order of observations unusable.

Scientists and researchers have done extensive research in time series forecasting, such as [10, 28, 3, 8, 9]. They have borrowed tools from various domains, such as graphical modeling and statistics, to improve forecasting accuracy. For example, in [10] and [28]

, the authors used the hidden Markov model (HMM) to predict time series data. In

[28], the authors showed that the HMM can achieve high accuracy to predict student performance in an educational game. In [3, 8] and [9], the authors used the well-known time series method, namely, autoregressive integrated moving average (ARIMA) to make the forecast of the stock market, electricity prices and energy demands.

Although classical methods like ARIMA and HMM proved to be successful in modeling time-series data, however, they exhibit some limitations as follows: 1- HMM follows the one step Markov assumption, which means conditioning on the current state the future state is independent of all the past states. This assumption might be violated for most real-world time series data. 2- HMM considers there are underlying hidden states that are responsible to generate the observations, which might not be correct for many time series forecasting problems. 3- ARIMA and most other classical time series models can only be used for univariate time series; furthermore, it is shown that they cannot model nonlinear time series accurately. [2, 32, 6].

Recent advancements in the processing power of the computers enable researchers to take advantage of the massive volume of data and develop more sophisticated algorithms such as neural networks to model and predict the data. Many researchers have used neural networks in the time series forecasting and showed that they can lead to better prediction than ARIMA model. [6, 32, 26].

Time series clustering is another major topic that receives lots of attention. Extensive research such as [1, 18] and [22] is done in developing accurate and efficient clustering algorithms. In [1] and [18], the authors performed the comprehensive surveys of the time series clustering methods. They described distance-based clustering methods such as dynamic time warping (DTW) as well as feature-based and representation based clustering methods. In [4]

, the authors combined the recurrent neural network, specifically long short term memory (LSTM) and feature-based clustering to improve the time series forecasting accuracy. In particular, they showed that their method outperforms the baseline LSTM model in terms of mean sMAPE accuracy.

Some papers such as [31, 17] and [13] made forecasting using the neural network by combining static and dynamic features. They showed that combining static and dynamic features can outperform the classification accuracy using the static or dynamic features alone.

In this paper, we are proposing to combine anomaly detection, clustering, and forecasting using LSTM for time series data to achieve better prediction. The contribution of this paper is fourfold. First, we perform anomaly detection to identify and replace the outliers in the time series data. This is a major preprocessing step that can significantly affect the prediction and clustering. Second, we introduce some methods to impute the missing values for time series data. Third, we compare distance-based and feature-based clustering in terms of speed and prediction accuracy. Fourth, we introduce multiple architectures to make forecasting using time series data as well as multiple architectures to combine static data and time series data.

The rest of this article is organized as follows: Section II defines the methods and algorithms employed in the later sections. Section III describes the dataset used in the article. Section IV describes the problem formulation. Section V presents and discusses the results. Finally, section VI presents our conclusion.

Ii Algorithms and Methods

Ii-a Anomaly Detection

Anomaly detection is an essential step before applying any machine learning algorithm. This step should be taken before imputing the missing values since the outliers will influence the missing values. Outliers are defined as observations that are significantly different from all other observations in data. If they are not detected and replaced correctly, they will have a negative impact on clustering and forecasting outcomes.

Several algorithms, such as [19] and [24]

are proposed to detect anomalies in time series using the ARMA process and the neural network, specifically LSTM. In this paper, we are using seasonality and trend decomposition based on locally estimated scatterplot smoothing (Loess) to detect the outliers in time series. This is the algorithm developed by Twitter and works by decomposing the time series to seasonality, trend, and random components. It is suitable for seasonal time series and is claimed to work accurately for additive outliers. An example of the implementation of this algorithm can be found in the tsclean package at R.

Ii-B Missing Value Imputation

The second major preprocessing step after detecting and replacing the outliers is imputing the missing values. Depending on the nature of data, several methods exist to impute the missing values. Following techniques are examples of missing value imputation for time series data:

Time Series Data
Interpolation Linear, Spline, …
Locf Last Observation Carry Forward
Nocb Next Observation Carry Backward
Moving Average Exponential,Linear, …
Mean Mean, Median, …
TABLE I: Time Series Missing Value Imputation Methods

Table 1 lists only some of the well-known methods to impute missing values. It is also worth mentioning that the tsclean package mentioned in the last section performs linear interpolation to impute missing values after anomaly detection. Several algorithms are proposed to impute missing values for the static data such as [5, 29, 20, 11]. In [29]

the authors performed a comparative study of several missing value imputation techniques such as Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average (filling missing values with zeros). They showed that KNNimpute provides a more robust and sensitive method than row average and SVDimpute. In

[20], the authors proposed the SOFT-IMPUTE algorithm to iteratively replace the missing elements with those obtained from a soft-thresholded SVD. The softImpute Package in R described In [11] implements Iterative methods for matrix completion that use nuclear-norm regularization.

Ii-C Clustering

Clustering is defined as dividing the data into some groups such that data in the same groups are more similar. It is divided into two major categories of distance-based and feature-based clustering. In distance-based clustering, the distance between data points is considered as similarity measure, and the objective is to have lower intra-cluster distance than inter-cluster distance. The main challenge in distance-based clustering is the notion of the distance itself. Choosing the appropriate distance is dependent on the nature of the dataset. In the case of static and cross-sectional data, euclidean distance is the most popular distance metric, and K-mean and hierarchical algorithms are two popular clustering choices. Hierarchical clustering is divided into two categories of agglomerative and divisive. Agglomerative hierarchical clustering is a more popular choice than divisive hierarchical clustering in which each data point is initially in its own cluster; then, at each iteration, the similar clusters merge with each other until K clusters are formed. The advantage of hierarchical clustering over K-mean is that we do not need to specify the number of clusters in advance.

Dynamic time warping (DTW) is one of the most popular distance metrics used for time series clustering. It is a well-known technique that was originally developed for speech recognition application and is used to find an optimal alignment between two given time-dependent sequences. It is a dynamic programming algorithm that Unlike the Euclidean distance, is not susceptible to distortions in the time-axis. It works by warping the sequences in a nonlinear fashion to match each other [21]. Another advantage of DTW over euclidean distance is that it does not require the sequences to have the same lengths. However,despite its numerous applications, DTW suffers from quadratic computational complexity. This means if m and n represent the length of two sequences, then the computational complexity of finding the DTW distance between them is . Some papers such as [27, 25] proposed algorithms such as PrunedDTW, FastDTW to speedup the classical DTW algorithm.

Feature-based clustering works by first extracting features from data and then constructing the feature matrix from the extracted features. This method has several advantages over distance-based clustering. First, it is faster, as will be shown later in this paper. Second, standard clustering methods such as K-mean and hierarchical clustering can directly be applied to the feature matrix.

In this paper, we are introducing two feature extraction methods for time series data. Method A mainly extracts time series specific features such as entropy, autocorrelation, partial autocorrelation, stability, and holt parameter. These features showed to be very effective for clustering and detecting unusual time series

[30, 16].

Method B looks at the time series more like a signal and extracts a more exhaustive list of features such as energy, fast Fourier transform (FFT) coefficients, continuous wavelet transform (CWT) coefficients, variance

[7].

Ii-D Forecasting with Neural Networks

Neural networks are drawing lots of attention in recent years. They have been used successfully in various domains such as signal and image processing, control, biology, and finance. Different neural network architectures are used depending on the application and the dataset. For example, if the objective is classification and the input data is a set of images, then a convolutional neural network (CNN) might be the best choice. If the input data is a time series, then recurrent neural network (RNN), specifically LSTM, is a better choice. However, sometimes more than one network type can be used in a given problem. For instance, a multilayer perceptron (MLP) and CNN are also used for time series forecasting. Time series should be reshaped in input and output format to be used by MLP. For example, assume a time series X with length n and the following elements:

. In order to use MLP to forecast the next value, X needs to be reshapes as follows:

Input Output
X(1) X(2)
X(2) X(3)
X(n-1) X(n)
TABLE II: Time Series Formulation For MLP

RNN networks are mainly designed for sequence prediction problem. Unlike MLP, which considers input and output to be independent, RNN networks consist of memory cells that can remember the long term dependencies between elements of a sequence. In theory, RNN can remember arbitrary long time steps, but in practice, they suffer from vanishing gradient problem

[12, 23]. LSTM is designed to address the vanishing gradient problem. It consists of cell states and various gate elements that decide which data in a sequence is necessary to keep or throw away. By doing that, it can pass only relevant information down the long chain of sequences to make predictions.

Ii-E Error Metrics

Forecasting error is divided into two categories of scale-dependent and scale-independent. Two well-known scale-dependent measures are mean absolute error (MAE), and root mean square error (RMSE). Minimizing the MAE will lead to forecasts of the median while minimizing the RMSE will lead to forecasts of the mean. Although both MAE and RMSE are widely used error measures, MAE has the advantage that it is easier to compute and understand [14]. MAE and RMSE are defined as follows:

(1)
(2)
(3)

Mean absolute percentage error (MAPE) is a widely used scale-independent error measure that is defined as follows:

(4)
(5)

Despite being commonly used in practice, it has the following disadvantages: 1- It is undefined when the true value is zero, and it will be a very large value when is close to zero. 2- It puts a heavier weight on the negative errors than the positive ones. 3- It does not make sense when the measurement has an arbitrary zero point [14, 15].

Iii Dataset

In this paper, We are using the synthetic data created as follows: First, we employ ARIMA models with random AR and MA coefficients to generate multivariate time series with length L and a different range of values for different columns. Second, we find the absolute value of each column of the time series to make sure that each column consists of only positive values. Third, we add arbitrary number of columns to the previously generated columns in the last two steps by simulating the piecewise continuous function that takes constant value from 0 to and from to L. Fourth, replace elements of time series by big values larger than the maximum value of each column (measurement) and elements by zeros to simulate the effect of outliers. Finally, additive white Gaussian noise (AWGN) with a specific variance depending on the range of the data is added to each column.

Above method has the following advantages: First, parameters of data such as the dimension of time series (number of measurements in multivariate time series), the variance of the noise, the minimum and the maximum value of time series corresponding to each measurement, and number of outliers can easily be defined and modified in simulation. Second, an arbitrary number of multivariate time series data can easily be created inside a loop.

Generating static data is more straightforward and is done as follows: First, identify the number of features as K. Second, determine the maximum and the minimum value and the type of each feature. Third, generate or sample data for each feature given the range of the values and the type of each feature.

For this paper, we are generating 400 multivariate time series data with three measurement columns. The total number of outliers per each measurement is 10. The length of each measurement is 400 as well. We generate 400 static data with five continuous features. Therefore the static data is a matrix of 400 by 5.

For the multivariate time series, we will be using names corresponding to physical phenomena. For example, the first measurement is oil, then is water, and the third column is gas. For static features, we will call them feature 1, feature 2, … , feature 5.

All the simulations run on Mac pro laptop with 16 GB memory and 2.80 GHz CPU.

Iv Problem Formulation

In this paper, The objective is to predict the cumulative gas value (Third column of time series) given all other data and measurements. We use the first N elements of time series and the static data for training and try to predict the cumulative gas value where . More specifically, we use and K=150, 200, 300, 400.

Neural networks and, more specifically, the LSTM are used for the prediction. We developed several architectures when time series data is only used for training and several other architectures when the combination of time series and static data are used for the training. The followings are the architectures used for training neural networks using the time series measurements.

(a) Model 1
(b) Model2
(c) Model3
Fig. 1: Neural Network Architectures for Time Series Data

To combine static and time series data architectures in Fig. 2 are proposed:

(a) Model 4
(b) Model5
(c) Model6
(d) Model7
Fig. 2: Neural Network Architectures for Combining Time Series and Static Data

Architectures in Fig. 1 and 2 are tested against clustered and unclustered data under the following conditions:

  • Only gas measurement (third column of time series data) and the static data are used to predict the future values of gas.

  • All time series measurements or a subset of time series measurements, in addition to static data, are used to predict the future value gas.

V Results and Discussion

In this section, we describe the results for anomaly detection as well as forecasting with and without clustering. The plots in Fig. 3 present some examples of anomaly detection algorithm presented in section II.

Fig. 3: Anomaly Detection Algorithm

In Fig. 3 the blue curve is the original curve, and the red is the curve when outliers are removed. As it is shown, the anomaly detection algorithm described in section II can accurately identify the outliers. This is a crucial preprocessing step to make sure the data used for prediction is reliable.

Table III presents the MAPE, MAE, and RMSE error measures using 5-fold cross-validations for different neural network architectures introduced in the last section when only gas measurement (third column of time series) is used to predict the future values of gas.

Architecture Performance Measure 150 200 300 400
Model 1 RMSE 9684.88 27005.89 57649.70 71668.95
MAPE 33.94 42.41 44.54 20.51
MAE 6827.62 19259.43 44179.18 53633.80
Time(s) 1754 1613 1420 1250
Model 2 RMSE 9736.41 25782.56 53861.46 69246.07
MAPE 35.53 40.50 38.69 18.88
MAE 6716.27 17245.11 43024.10 54552.27
Time(s) 3003 2521 2067 2125
Model 3 RMSE 9216.22 26161.17 50678.83 65593.37
MAPE 35.42 40.37 38.52 18.18
MAE 6462.38 19731.54 45336.42 51144.22
Time(s) 1900 1861 1600 1550
Model 4 RMSE 10866.09 24287.97 56205.88 73167.68
MAPE 44.19 44.12 38.73 20.98
MAE 8065.32 17892.22 43378.11 57337.14
Time(s) 2900 2851 3421 2550
Model 5 RMSE 10313.77 26131.29 53851.36 72557.48
MAPE 40.17 46.08 37.63 21.07
MAE 7533.38 19163.40 42630.06 55765.28
Time(s) 1228 1888 1536 1409
Model 6 RMSE 11473.68 26938.94 54924.93 74088.37
MAPE 45.24 56.77 38.54 22.62
MAE 8042.11 20872.36 44216.48 59516.29
Time(s) 1800 1600 1770 1662
Model 7 RMSE 16198.66 32106.55 66425.61 76797.46
MAPE 71.09 71.10 41.13 22.56
MAE 12662.59 25438.29 52732.66 61946.76
Time(s) 1600 1523 1327 1710
TABLE III: Forecasting Results with No Clustering

In Table III, red color identifies the best RMSE; blue identifies best MAPE, and cyan identifies best MAE values among all architectures. According to Table III, Model 3 (Bidirectional LSTM) overall has the best performance since it gives the best RMSE values for K=150, 300, 400, and best MAPE values for K=200, 300, 400. In terms of time, Model 4 has the highest average time complexity with 2938 (s). Prediction without clustering can have the following problems:

  • Long forecasting time. If the process is interrupted due to insufficient memory or CPU, then training and forecasting should be repeated for all the data again.

  • Since data exhibit various patterns, shapes, and characteristics, training all of them together might lower the prediction performance.

Table IV summarizes the clustering characteristics for various clustering algorithms described in section II part C.

Number of Measurements Time(s) Number of cluster
Distance-Based DTW 1 1352.40 3
3 1486.57 3
Feature-Based Method A 1 12.61 2
3 24.48 2
Method B 1 275.71 2
3 406.52 2
TABLE IV: Characteristics of Various Clustering Algorithms

According to Table IV, feature-based methods outperform the distance-based method in terms of speed; moreover, Method A is faster of the two feature-based methods. The optimal number of clusters in Table IV is determined using several cluster validity indexes (CVI) such as Silhouette, Dunn, and Gamma. Next, we will present the prediction results with DTW clustering when only the gas measurement is used to predict the future value of the gas.

Cluster 1 Architecture Performance Measure 150 200 300 400
Model 1 RMSE 13492.88 27227.05 49960.58 71390.70
MAPE 14.18 17.73 15.95 15.44
MAE 10493.94 19177.90 39821.71 57437.43
Time(s) 378 340 400 391
Model 2 RMSE 15664.32 29477.30 50128.86 72614.45
MAPE 16.13 18.19 15.80 15.47
MAE 13932.04 23122.89 38783.94 60840.61
Time(s) 687 1044 971 1200
Model 3 RMSE 14319.32 29365.25 50414.84 72061.09
MAPE 14.91 17.05 16.56 16.16
MAE 10814.49 24784.66 42159.97 60357.10
Time(s) 503 600 726 800
Model 4 RMSE 17292.00 32476.94 44582.48 72725.99
MAPE 19.78 18.94 12.88 16.39
MAE 14332.13 25061.84 38270.20 61102.26
Time(s) 1000 1080 1087 1200
Model 5 RMSE 15303.55 31857.90 61536.90 79902.93
MAPE 16.95 18.27 21.28 18.20
MAE 12397.09 25548.43 49572.39 66493.49
Time(s) 821 546 612 782
Model 6 RMSE 14399.24 32669.23 54847.72 82953.44
MAPE 14.89 17.93 17.62 19.19
MAE 11715.45 19259.43 44179.18 53633.80
Time(s) 453 460 433 421
Model 7 RMSE 19112.97 32594.89 51854.97 75612.36
MAPE 19.12 16.24 15.86 16.76
MAE 15384.47 25164.28 40750.27 65085.91
Time(s) 430 491 455 435
Cluster 2 Architecture Performance Measure 150 200 300 400
Model 1 RMSE 4414.00 10923.24 31035.56 52708.65
MAPE 33.37 43.41 40.05 20.52
MAE 3640.56 9157.58 25500.18 41605.72
Time(s) 800 557 800 871
Model 2 RMSE 4419.03 11557.68 31079.12 49957.10
MAPE 29.67 46.51 38.68 19.26
MAE 3452.11 9547.25 25161.26 39200.76
Time(s) 1450 1200 1745 1800
Model 3 RMSE 4375.45 10533.33 30225.99 47035.53
MAPE 32.53 43.11 38.26 16.62
MAE 3827.21 9393.74 24633.34 37753.89
Time(s) 952 942 923 968
Model 4 RMSE 4028.32 10457.87 34559.18 57166.64
MAPE 27.26 45.30 47.08 22.70
MAE 3297.35 8413.90 29013.81 44908.45
Time(s) 1379 1800 1826 1913
Model 5 RMSE 4604.66 12290.53 33935.11 53214.73
MAPE 31.24 52.95 41.64 20.69
MAE 3623.62 10039.21 27668.60 42097.45
Time(s) 584 713 920 678
Model 6 RMSE 4413.09 10595.32 33016.45 54944.10
MAPE 33.25 44.65 39.54 19.40
MAE 3562.62 8526.28 27270.80 42646.56
Time(s) 715 880 979 917
Model 7 RMSE 5012.49 12623.80 33588.65 51467.56
MAPE 41.53 53.82 42.48 19.79
MAE 4225.53 10208.43 27638.36 41616.74
Time(s) 712 725 781 796
Cluster 3 Architecture Performance Measure 150 200 300 400
Model 1 RMSE 8035.00 20863.14 41267.41 50736.26
MAPE 14.91 19.95 20.11 12.27
MAE 5999.42 16170.73 34492.70 41323.80
Time(s) 542 646 632 753
Model 2 RMSE 8138.11 18217.21 45578.03 55843.99
MAPE 14.35 17.24 19.37 12.15
MAE 5702.11 12931.99 34985.26 44785.75
Time(s) 1020 879 1225 1410
Model 3 RMSE 9606.88 19171.64 50842.20 65310.41
MAPE 17.07 20.13 19.33 17.51
MAE 6507.99 14906.23 36474.90 51834.14
Time(s) 802 1007 921 879
Model 4 RMSE 9513.50 24396.60 36394.98 62606.17
MAPE 19.66 19.87 17.92 15.15
MAE 7497.21 16790.39 29828.65 48750.67
Time(s) 1414 1834 1821 1690
Model 5 RMSE 8857.02 22665.07 35642.91 59074.92
MAPE 17.31 18.82 16.41 14.48
MAE 6841.79 16854.21 28238.08 47178.45
Time(s) 973 767 718 723
Model 6 RMSE 7891.00 16401.70 38307.10 60533.33
MAPE 15.09 17.90 18.70 15.72
MAE 6014.56 12895.91 31065.94 45152.46
Time(s) 800 617 648 712
Model 7 RMSE 9100.24 17191.88 39321.63 58002.70
MAPE 19.70 19.57 19.51 13.52
MAE 7582.63 13583.04 32742.17 44045.32
Time(s) 723 711 689 656
TABLE V: Forecasting Results with DTW Clustering

According to Table V clustering has the following advantages:

  • Prediction for different clusters can be made at different times, or it can be done parallel, which significantly reduces the forecasting time.

  • For each K, the best architecture for different clusters can be chosen to improve the overall performance. For example, consider K=150 when there is no clustering; the error metrics are as follows: RMSE =9. 216, MAPE = 33.94 and MAE = 6462.38. Now when DTW clustering is used for prediction given that 94 time series are in cluster 1, 188 time series are in cluster 2 and 118 time series are in cluster 3 the error metrics are calculated as follows:

    (6)
    (7)
    (8)
    (9)

    In (6) is the total error, is the error corresponding to each cluster and

    is the probability of choosing a cluster. Equation (6) follows the law of total probability.

Comparing the results in (7), (8) and (9) with RMSE, MAPE, and MAE in Table III reveal that clustering can significantly improve the prediction performance.

Tables VI and VII present the forecasting results with feature-based clustering using method A and method B, respectively.

Cluster 1 Architecture Performance Measure 150 200 300 400
Model 1 RMSE 4038.76 8832.14 27142.08 45872.40
MAPE 34.56 38.37 38.89 18.52
MAE 3204 7341.46 23202.30 37073.63
Time(s) 621 691 630 681
Model 2 RMSE 4545.92 9660.00 27239.60 44317.91
MAPE 32.44 41.28 36.48 17.11
MAE 3400.99 7842.32 23022.10 35806.13
Time(s) 1249 1398 1100 1080
Model 3 RMSE 3975.74 9906.88 27657.59 42505.00
MAPE 33.62 46.85 40.81 17.44
MAE 3306.95 8172.18 23897.82 35207.36
Time(s) 900 861 812 871
Model 4 RMSE 4328.14 10353.43 28500.70 42600.88
MAPE 36.20 49.93 37.78 15.72
MAE 3472.24 8346.69 24439.28 34096.96
Time(s) 1313 1814 1600 1324
Model 5 RMSE 4949.87 9389.69 25986.47 45082.67
MAPE 42.09 38.83 33.53 18.50
MAE 3951.77 7696.16 21675.31 36561.82
Time(s) 671 671 712 650
Model 6 RMSE 4279.94 9743.54 27923.25 45612.06
MAPE 32.34 54.16 40.99 17.29
MAE 3413.10 8544.21 23223.17 36221.62
Time(s) 591 660 687 612
Model 7 RMSE 4972.13 10147.80 27145.73 46906.73
MAPE 38.52 45.45 38.78 18.25
MAE 4040.88 8444.79 22175.59 38482.42
Time(s) 623 711 601 589
Cluster 2 Architecture Performance Measure 150 200 300 400
Model 1 RMSE 11872.12 29413.02 56085.34 65408.21
MAPE 28.97 29.99 25.27 15.75
MAE 8838.86 22715.02 46822.08 51364.60
Time(s) 800 882 1262 922
Model 2 RMSE 14263.18 29695.96 53459.04 60645.00
MAPE 33.67 34.36 23.32 13.93
MAE 11584.29 24174.64 43544.11 47841.09
Time(s) 2137 1945 1462 1541
Model 3 RMSE 14049.78 29075.38 56893.44 58575.78
MAPE 35.24 29.36 25.76 13.61
MAE 11626.64 23097.41 47130.03 47059.17
Time(s) 1162 1101 1053 996
Model 4 RMSE 15978.33 32114.92 58155.73 63148.46
MAPE 41.68 36.38 25.97 14.26
MAE 13298.98 26018.09 47476.40 48745.07
Time(s) 1902 2084 1969 2010
Model 5 RMSE 12947.65 29301.81 59024.33 65776.89
MAPE 30.35 29.24 25.84 16.16
MAE 10555.66 23042.84 46318.66 53606.05
Time(s) 1239 1300 1312 1080
Model 6 RMSE 11963.81 28878.83 53073.28 61825.07
MAPE 26.55 33.24 23.08 14.99
MAE 9487.32 23477.52 42112.35 48812.02
Time(s) 1001 1211 1194 951
Model 7 RMSE 11988.71 37235.03 61843.47 64530.66
MAPE 50.88 50.18 32.29 16.38
MAE 9263.22 32333.89 57271.93 53606.42
Time(s) 923 1187 1015 986
TABLE VI: Forecasting Results with Feature-Based Method A Clustering
Cluster 1 Architecture Performance Measure 150 200 300 400
Model 1 RMSE 6262.29 14602.69 40447.69 55142.00
MAPE 34.51 43.28 48.18 19.65
MAE 4705.40 11756.55 32875.25 43141.80
Time(s) 1211 1031 876 1105
Model 2 RMSE 5790.1 12494.61 33793.05 52679.51
MAPE 32.10 43.72 42.01 19.88
MAE 4670.78 10739.64 27789.09 42660.21
Time(s) 2086 2055 2010 1865
Model 3 RMSE 4906.85 12144.00 32997.05 49625.67
MAPE 31.89 42.44 37.96 17.46
MAE 3591.33 10180.60 27072.80 40025.59
Time(s) 964 1353 1469 1100
Model 4 RMSE 6212.31 12706.26 32646.63 52751.72
MAPE 41.08 50.42 37.80 19.56
MAE 4775.04 10641.48 26849.83 42106.65
Time(s) 2154 1772 2100 2254
Model 5 RMSE 5403.45 14679.94 36773.00 52649.04
MAPE 35.38 55.43 45.63 19.76
MAE 4361.95 12294.69 30163.46 42906.34
Time(s) 1129 945 1001 1122
Model 6 RMSE 5621.37 13893.23 32005.10 54944.21
MAPE 35.07 48.77 34.76 20.83
MAE 4480.75 11734.42 26004.11 43438.74
Time(s) 1122 1029 983 912
Model 7 RMSE 6515.12 15772.53 34345.54 53317.02
MAPE 43.30 65.44 45.34 20.82
MAE 5208.36 13215.27 27508.65 43108.25
Time(s) 987 998 1112 1010
Cluster 2 Architecture Performance Measure 150 200 300 400
Model 1 RMSE 12059.14 29017.86 47730.52 62920.93
MAPE 12.81 22.03 27.12 13.93
MAE 8583.00 22420.10 39235.51 51245.27
Time(s) 730 529 625 565
Model 2 RMSE 12281.21 28399.50 52188.18 57151.25
MAPE 15.99 22.09 19.81 11.44
MAE 8950.43 22083.35 42345.36 45136.57
Time(s) 1048 1137 1300 1323
Model 3 RMSE 11296.55 26270.88 48964.51 56242.17
MAPE 13.36 26.19 16.36 11.75
MAE 7776.63 19426.56 39846.93 45253.69
Time(s) 1100 913 848 874
Model 4 RMSE 14693.48 31948.36 46154.93 57638.41
MAPE 20.96 23.20 16.05 11.26
MAE 11306.93 24519.02 37135.12 44802.18
Time(s) 1283 1525 1469 1198
Model 5 RMSE 12000.75 33271.13 52588.69 56051.11
MAPE 14.56 25.98 18.75 11.49
MAE 8835.99 26458.75 42617.07 43542.71
Time(s) 664 661 869 867
Model 6 RMSE 13363.42 31299.26 47829.81 57017.62
MAPE 19.77 22.93 14.89 12.19
MAE 10311.96 23920.97 38111.54 45107.26
Time(s) 701 779 923 951
Model 7 RMSE 14424.54 30437.25 52917.75 59926.56
MAPE 17.57 22.48 17.76 11.96
MAE 10690.42 23874.18 43415.56 47402.78
Time(s) 923 1187 1015 986
TABLE VII: Forecasting Results with Feature-Based Method B Clustering

According to the Table VI, the total error measures for K=150 for method A is as follows:

(10)
(11)
(12)

The same calculation can be done according to the Table VII for method B and K=150:

(13)
(14)
(15)

Comparing (7) up to (15) reveal that DTW and method B clustering methods provide the best improvement to prediction Errors. Furthermore, according to Table IV, feature-based clustering methods (method A, method B) significantly outperform the DTW method in terms of speed. Therefore if speed and efficiency are the primary concern, method A and method B can be combined with the neural networks to improve efficiency and error. If improving error is the main objective, then method B and DTW are a clear choice.

Now consider the goal is to compare different clustering algorithms based on the improvement they provide to forecasting errors for all the architectures. This means we are not choosing the best model like before, but we consider all the models for each clustering method. For example, to find the RMSE error for model 1 for DTW clustering, we find the total RMSE when model 1 is used for all the clusters. The results of this analysis for DTW, method A, and method B are summarized in Tables VIII, IX, and X, respectively.

Architecture Performance Measure 150 200 300 400
Model 1 RMSE 7615.73 17686.91 38501.34 56517.08
MAPE 23.41 30.45 28.50 16.89
MAE 5946.97 13581.36 31518.53 45243.00
Model 2 RMSE 8158.80 17733.35 39832.98 57018.20
MAPE 21.96 31.22 27.60 16.27
MAE 6578.64 13736.02 31260.66 45133.94
Model 3 RMSE 8255.53 17507.13 41052.15 58307.62
MAPE 23.82 30.20 27.57 16.77
MAE 6260.05 14636.79 32245.35 47219.31
Model 4 RMSE 8763.41 19744.27 37456.21 62427.74
MAPE 23.26 31.60 30.44 18.98
MAE 7129.48 14797.23 31429.43 49847.45
Model 5 RMSE 8373.34 19949.35 40925.33 61215.21
MAPE 23.77 34.73 29.41 18.27
MAE 6634.74 15694.30 32983.98 49329.41
Model 6 RMSE 7785.81 17495.57 39707.54 63175.11
MAPE 23.57 30.47 28.24 18.26
MAE 6201.85 12337.61 32363.83 45967.80
Model 7 RMSE 11808.89 18664.58 39572.46 59069.45
MAPE 30.02 34.88 29.44 17.22
MAE 9216.95 14718.56 32225.28 47848.42
TABLE VIII: Forecasting Results For DTW Clustering For All the Models
Architecture Performance Measure 150 200 300 400
Model 1 RMSE 9091.27 22106.80 45810.48 58472.99
MAPE 30.95 32.96 30.10 16.73
MAE 6838.48 17257.40 38437.05 41366.98
Model 2 RMSE 10813.55 22583.19 44151.13 55400.72
MAPE 33.23 36.81 27.99 15.55
MAE 8679.21 18376.66 38882.59 44018.64
Model 3 RMSE 10473.49 22270.56 46514.71 52870.65
MAPE 34.66 35.56 31.10 14.96
MAE 8673.15 17798.95 38882.59 42851.77
Model 4 RMSE 11842.51 24389.59 47628.19 55854.06
MAPE 39.73 41.19 30.16 14.77
MAE 9810.48 19744.74 39298.22 43544.99
Model 5 RMSE 10108.43 22233.00 47295.88 58430.44
MAPE 34.51 32.64 28.56 16.99
MAE 8211.27 17594.76 37570.27 47555.34
Model 6 RMSE 9236.03 22085.80 44145.01 56069.45
MAPE 28.60 40.66 29.43 15.80
MAE 7330.97 18176.19 35406.69 44342.42
Model 7 RMSE 9497.82 27619.06 49525.77 58274.16
MAPE 46.49 48.50 34.59 17.04
MAE 7409.28 23853.25 44812.72 48237.39
TABLE IX: Forecasting Results For Method A Clustering For All Models
Architecture Performance Measure 150 200 300 400
Model 1 RMSE 8725.95 20729.13 43542.89 58448.04
MAPE 25.28 34.24 39.22 17.21
MAE 6489.63 16288.55 35578.36 46585.77
Model 2 RMSE 8548.82 19254.18 41610.98 54579.99
MAPE 25.25 34.52 32.57 16.29
MAE 6489.63 15560.71 33975.50 43712.66
Model 3 RMSE 7622.47 18147.92 39783.22 52437.68
MAPE 24.01 35.53 28.77 15.03
MAE 5370.08 14110.13 32501.80 42247.53
Model 4 RMSE 9816.80 20884.15 38387.65 54828.56
MAPE 32.52 38.85 28.55 16.03
MAE 7551.09 16539.43 31221.07 43252.25
Model 5 RMSE 8207.30 22581.19 43494.66 54094.91
MAPE 26.53 42.91 34.20 16.24
MAE 6263.41 18314.41 35456.24 43176.79
Model 6 RMSE 8911.74 21290.79 38730.60 55825.40
MAPE 28.56 37.78 26.31 17.15
MAE 6959.01 16913.70 31149.76 44147.86
Model 7 RMSE 9876.62 22005.03 40076.35 56126.07
MAPE 32.36 47.18 32.39 17.05
MAE 7538.23 17745.30 34269.08 44933.42
TABLE X: Forecasting Results For Method B Clustering For All Models

Analyzing Tables VIII, IX, and X reveal that DTW clustering provides better overall improvement to MAPE, RMSE and MAE scores when K=150,200, 300, while for K=400 both methods A and method B outperform DTW. However, comparing Tables III, VIII, IX, and X reveal that regardless of the clustering algorithms, forecasting accuracies can significantly be improved when data clustering is done. For example, consider K=300, with no clustering the best MAPE score is 38.52 for model 3; however, with DTW clustering model 3 achieves MAPE score of 27.57, methods A and B achieve MAPE scores of 31.10 and 28.88 respectively.

V-a Influence of adding more Time Series Measurements

Thus far, we have used the gas measurement along with static data to predict future gas values. However, an important question to ask is, can prediction errors be improved by including more time series measurement?

To answer this question, we will consider all possible combinations of multivariate time series column and compare their prediction errors to Table III. To show the concept and also save simulation time, the idea is tested only against model 1 and model 3.

Model 1 Measurements Column Performance Measure 150 200 300 400
First and Second Columns RMSE 39450.01 56527.14 84660.19 86700.9
MAPE 150.38 91.89 62.33 25.86
MAE 29850.68 44804.45 69698.23 70742.31
First and Third Columns RMSE 10218.19 27168.96 58130.94 72662.22
MAPE 34.36 47.14 45.89 20.66
MAE 7122.69 19889.56 45096.30 55307.80
Second and Third Columns RMSE 10039.66 26831.03 60065.97 69787.29
MAPE 34.53 43.16 41.94 21.14
MAE 7311.21 18740.35 46414.14 55175.58
First, Second and Third RMSE 10319.54 26638.20 55663.48 72562.50
MAPE 36.33 42.75 45.02 20.97
MAE 7257.43 19944.56 49724.80 54406.62
Model 3 Measurements Column Performance Measure 150 200 300 400
First and Second Columns RMSE 42960.62 52977.89 94306.01 88665.62
MAPE 98.84 106.53 80.84 25.47
MAE 30140.50 41278.36 78561.96 72679.35
First and Third Columns RMSE 10973.27 28103.06 53204.26 71642.94
MAPE 34.33 48.45 39.75 20.41
MAE 7886.56 20174.22 41335.21 61775.57
Second and Third Columns RMSE 10417.47 26390.57 55478.94 71627.27
MAPE 34.56 47.36 39.31 21.55
MAE 7036.22 20164.46 45503.44 55921.33
First, Second and Third RMSE 10287.64 26195.90 63499.57 71540.70
MAPE 36.33 45.75 38.02 18.97
MAE 7257.43 19944.56 49724.80 52406.62
TABLE XI: Influence of Adding more Measurements on Future Gas Value (Third Column)

Comparing Table XI to Table III reveals that adding more time series measurements is not, in general, improving the RMSE, MAPE, and MAE. Such high errors in Table XI corresponding to the first entries of model 1 and model 3 suggest that the first and second columns alone are not good predictors of the third column.

Vi Conclusion

In this paper, we provided a comprehensive analysis of time series forecasting using neural networks. We introduced several architectures to combine static and dynamic measurements to forecast time series using neural networks. Our results reveal that: First, not one architecture is right for predicting all future values of time series, but multiple architectures should be tested, and the best architecture should be determined based on model selection criterion such as cross-validation. Second, adding static data to forecast does not necessarily improve the forecasting errors.

We discussed two primary clustering methods, namely, distance-based and feature-based clustering. We showed that DTW outperforms the feature-based clustering in most architectures; however, feature-based clustering methods outperform the DTW in terms of speed and time complexity and still improve the forecasting accuracies significantly when they are compared to no clustering case.

Finally, we presented several methods, such as anomaly detection and data imputation, to prepare time series data for forecasting and the clustering.

References

  • [1] S. Aghabozorgi, A. S. Shirkhorshidi, and T. Y. Wah (2015) Time-series clustering–a decade review. Information Systems 53, pp. 16–38. Cited by: §I.
  • [2] C. H. Aladag, E. Egrioglu, and C. Kadilar (2009) Forecasting nonlinear time series with a hybrid methodology. Applied Mathematics Letters 22 (9), pp. 1467–1470. Cited by: §I.
  • [3] A. A. Ariyo, A. O. Adewumi, and C. K. Ayo (2014) Stock price prediction using the arima model. In 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation, pp. 106–112. Cited by: §I.
  • [4] K. Bandara, C. Bergmeir, and S. Smyl (2020) Forecasting across time series databases using recurrent neural networks on groups of similar series: a clustering approach. Expert Systems with Applications 140, pp. 112896. Cited by: §I.
  • [5] G. E. Batista, M. C. Monard, et al. (2002) A study of k-nearest neighbour as an imputation method.. HIS 87 (251-260), pp. 48. Cited by: §II-B.
  • [6] K. Chakraborty, K. Mehrotra, C. K. Mohan, and S. Ranka (1992) Forecasting the behavior of multivariate time series using neural networks. Neural networks 5 (6), pp. 961–970. Cited by: §I, §I.
  • [7] M. Christ, N. Braun, J. Neuffer, and A. W. Kempa-Liehr (2018) Time series feature extraction on basis of scalable hypothesis tests (tsfresh–a python package). Neurocomputing 307, pp. 72–77. Cited by: §II-C.
  • [8] J. Contreras, R. Espinola, F. J. Nogales, and A. J. Conejo (2003) ARIMA models to predict next-day electricity prices. IEEE transactions on power systems 18 (3), pp. 1014–1020. Cited by: §I.
  • [9] V. Ş. Ediger and S. Akar (2007) ARIMA forecasting of primary energy demand by fuel in turkey. Energy policy 35 (3), pp. 1701–1708. Cited by: §I.
  • [10] M. R. Hassan and B. Nath (2005) Stock market forecasting using hidden markov model: a new approach. In 5th International Conference on Intelligent Systems Design and Applications (ISDA’05), pp. 192–196. Cited by: §I.
  • [11] T. Hastie, R. Mazumder, J. D. Lee, and R. Zadeh (2015) Matrix completion and low-rank svd via fast alternating least squares. The Journal of Machine Learning Research 16 (1), pp. 3367–3402. Cited by: §II-B.
  • [12] S. Hochreiter (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 6 (02), pp. 107–116. Cited by: §II-D.
  • [13] T. Hsu, S. Liou, Y. Wang, Y. Huang, et al. (2019) Enhanced recurrent neural network for combining static and dynamic features for credit card default prediction. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1572–1576. Cited by: §I.
  • [14] R. J. Hyndman and G. Athanasopoulos (2018) Forecasting: principles and practice. OTexts. Cited by: §II-E.
  • [15] R. J. Hyndman and A. B. Koehler (2006) Another look at measures of forecast accuracy. International journal of forecasting 22 (4), pp. 679–688. Cited by: §II-E.
  • [16] R. J. Hyndman, E. Wang, and N. Laptev (2015) Large-scale unusual time series detection. In 2015 IEEE international conference on data mining workshop (ICDMW), pp. 1616–1619. Cited by: §II-C.
  • [17] A. Leontjeva and I. Kuzovkin (2016) Combining static and dynamic features for multivariate sequence classification. In

    2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

    ,
    pp. 21–30. Cited by: §I.
  • [18] T. W. Liao (2005) Clustering of time series data—a survey. Pattern recognition 38 (11), pp. 1857–1874. Cited by: §I.
  • [19] P. Malhotra, L. Vig, G. Shroff, and P. Agarwal (2015) Long short term memory networks for anomaly detection in time series. In Proceedings, pp. 89. Cited by: §II-A.
  • [20] R. Mazumder, T. Hastie, and R. Tibshirani (2010) Spectral regularization algorithms for learning large incomplete matrices. Journal of machine learning research 11 (Aug), pp. 2287–2322. Cited by: §II-B.
  • [21] M. Müller (2007) Dynamic time warping. Information retrieval for music and motion, pp. 69–84. Cited by: §II-C.
  • [22] T. Oates, L. Firoiu, and P. R. Cohen (1999) Clustering time series with hidden markov models and dynamic time warping. In

    Proceedings of the IJCAI-99 workshop on neural, symbolic and reinforcement learning methods for sequence learning

    ,
    pp. 17–21. Cited by: §I.
  • [23] R. Pascanu, T. Mikolov, and Y. Bengio (2013) On the difficulty of training recurrent neural networks. In International conference on machine learning, pp. 1310–1318. Cited by: §II-D.
  • [24] B. Pincombe (2005) Anomaly detection in time series of graphs using arma processes. Asor Bulletin 24 (4), pp. 2. Cited by: §II-A.
  • [25] S. Salvador and P. Chan (2007) Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis 11 (5), pp. 561–580. Cited by: §II-C.
  • [26] S. Siami-Namini, N. Tavakoli, and A. S. Namin (2018) A comparison of arima and lstm in forecasting time series. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 1394–1401. Cited by: §I.
  • [27] D. F. Silva and G. E. Batista (2016) Speeding up all-pairwise dynamic time warping matrix calculation. In Proceedings of the 2016 SIAM International Conference on Data Mining, pp. 837–845. Cited by: §II-C.
  • [28] M. Tadayon and G. Pottie (2019) Predicting student performance in an educational game using a hidden markov model. arXiv preprint arXiv:1904.11857. Cited by: §I.
  • [29] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman (2001) Missing value estimation methods for dna microarrays. Bioinformatics 17 (6), pp. 520–525. Cited by: §II-B.
  • [30] X. Wang, K. Smith, and R. Hyndman (2006) Characteristic-based clustering for time series data. Data mining and knowledge Discovery 13 (3), pp. 335–364. Cited by: §II-C.
  • [31] Y. S. Yen, Z. W. Chen, Y. R. Guo, and M. C. Chen (2019) Integration of static and dynamic analysis for malware family classification with composite neural network. arXiv preprint arXiv:1912.11249. Cited by: §I.
  • [32] G. P. Zhang (2003) Time series forecasting using a hybrid arima and neural network model. Neurocomputing 50, pp. 159–175. Cited by: §I, §I.