Accurate forecasting of time series is a fundamental challenge for machine learning and it naturally requires understanding the sequential behavior. Traditionally, many studies on time series have focused on learninginternal patterns (e.g., auto-correlation) in a given series. However, most of the time series in many practical applications are highly correlated to other time series. For instance, a stock price of a particular company is likely dependent on the prices of other companies in similar business field [19, 18]. Since many multivariate time series are mutually related, it is desired to learn not only intra-series patterns but also inter-series patterns. Moreover, the time series we observe have shown seasonality or abrupt rising/dropping patterns due to specific events. Recently, many works [20, 2] have analyzed a given series as interactions of different components such as the trend, seasonality, and event. While such an analysis is interpretable and mostly works well on many time series, a key open challenge is developing methods that can directly learn entangled relations among multiple time series. In other words, it is beneficial to build a model to handle both characteristics, 1) highly correlated multivariate and 2) multiple structural components, such as trend, seasonality, and events, for accurate time series forecasts.
In this paper, we develop a framework to model correlated and structural time series. Inspired by the recent work of using 1D CNN for text classification 
, we propose to use 1D CNN layer with multiple kernels to learn the complicated interactions among multivariate time series. However, unlike 1D CNN text classification, our model performs the 1D convolution over the feature dimension, rather than the temporal dimension. To improve the forecasts of time series, especially the forecasts of the trend, we further propose to use a 2D convolutional filters over the temporal dimension. In computer vision, convolutional filters are applied to local area of a given image, which aim to extract features corresponding to the structure of the filters. For example, if the filter is, the convolution computes the difference of adjacent pixels’ values and can help to detect an edge feature. These properties are equivalently applicable to time series and the gradient-based filters can be very useful to model a function based on first order derivative of each series. This is somewhat similar in spirit to the first-order differencing, a technique frequently used by time series forecasting models, such as ARIMA, to extract/eliminate the trend in the temporal sequences.
Recently, many deep learning models are data driven, instead of being based on carefully hand-crafted architectures. Among them, recurrent neural networks (RNNs) including LSTM and GRU  are especially designed to learn sequential dependencies. These models are powerful to learn long term patterns by reducing the troublesome vanishing gradients 
. To memorize the long-term sequential pattern while leveraging the correlations among different time series, we stack a LSTM layer on top of our 1D and 2D CNN layers. Since the features extracted by the 1D and 2D convolutional filters can be stacked along the temporal dimension, these extracted features are naturally time series, and thus can be fed into LSTMs to learn the long range temporal pattern changes in trend. The combination of CNN and RNN has been studied in video recognition and natural language model .  used convolutional filters to extract features from still images and did not consider any differencing features between adjacent images.  has multiple character-level convolutional filters to extract character -gram features, however, they do not use 1D convolutional filters to extract correlation features.
Besides the trend component learned by the convolutional filters, the seasonal and the event-based components are considered as additional components that compose the time series. These components have unique properties such as periodic behavior and discontinuous peaks, respectively. These properties give us a prior knowledge to build an expressive model. For instance, a known periodical function (e.g., sinusoidal functions) can be a base function to represent the cyclic characteristics efficiently. For the event-based components, a binary vector encoding is used to tell a model when the events happen. In our model, the weights of both components would be updated by data-driven learning.
To summarise, our paper has the following contributions:
Our model employs 1D CNN and 2D CNN to learn multivariate correlations and gradient-based features (weighted differencing). By stacking LSTM layer on top of the CNNs, our proposed model is able to learn long-term dependencies (e.g., trend) of the extracted features.
Our structural model is capable of decomposing time series into multiple latent structural factors including trend, seasonality, and event components, which is helpful for the interpretation of the time series forecasts.
Our model is compared with several state-of-the-art baselines on Amazon AWS daily billing data for different services and the closing price for stocks, and shows improvement in forecasting accuracy.
Ii The Model
In this section, we will introduce our proposed model by starting from structural time series (STM) analysis, a framework based on which our model is built. Then, we will describe problem formulation and each component of our model in more details.
Ii-a Structural Time Series Analysis
The STM framework assumes that a temporal sequence is generated by several independent additive components, which have a direct interpretation in terms of quantities of interest. There exist many variations of STM models, depending how each component is defined. Here we are interested in a specific STM model , in which a time series is decomposed into three components: the trend, seasonality, and event-related item. Denote the time series at time as , the model can be formulated as
where , , and are trend term, seasonality term, and event term, respectively. The STM model enables data analysts to analyze the trend, seasonal patterns, and irregular event effects of the data. It also allows for incorporation of prior knowledge such as seasonality, irregular event effects into time series forecasting. Many time series models, such as Prophet  and Bayesian STM , fall into the category of this formulation.
Ii-B Problem Formulation
Multivariate time series are often encountered in many applications, and in many cases there exists strong correlation among multiple temporal sequences. For instance, running a machine learning task may involve the use of Amazon EC2, S3, and relational database services (RDS). It is therefore expected the billings for these services are correlated. It is desirable if correlation among time series of different types is considered. Unlike the STM model, which assumes uni-variate time series input, we proposed a Deep Multivariate Structural Time series Model, which generalizes STM to handle multiple time series. We assume the input multivariate time series are , where is the number of input time series. We are interested in forecasting the -th time series by rewriting Equation (1) as
In our model, the three components , , and are learned jointly. The trend component is learned by a CNN-LSTM hierarchical deep architecture, which takes multiple temporal sequences as the input, and is capable of leveraging correlations among the input temporal sequences and first-order temporal trending information. The seasonality and event components are learned by a fully connected neural net and linear model, respectively. The architecture of our model is demonstrated in Figure 1.
Ii-C The Trend Component
The trend is the component of a time series that represents variations of low frequency in a time series. Predicting the trend is very important in many real-world applications, such as sales analysis, budget planning, and stock investment. In our model, we employ a CNN-LSTM hierarchical deep neural net to learn the trend of the time series. The CNN-LSTM neural net consists of two types of CNN layers, a one dimensional CNN layer and a two dimensional temporal CNN, followed by a LSTM layer. The hierarchical model enjoys several attractive properties. It is capable of (1) leveraging the correlations among multiple time series, (2) capturing the trend of the time series by convolution of signals temporally, and (3) memorizing changes of trend over long sequences. In the following part, we will describe the input and output, the CNN layers, and the LSTM layer of our hierarchical neural net.
(1) Input and output: The input for our trend learner includes time series. In addition, one important goal for extracting the trend component is to smooth out the irregular roughness to uncover a clearer signal. Many time series analysis models often only look at one-step lagged historical observation, which will introduce randomness in the forecasts of trend. To reduce the randomness in one lagged sample, our trend learner takes time series at time , each with
step lagged observations as the input. The input are then fed into CNN-LSTM hierarchical layer. The output is the estimation of the trend component for the time series that we would like to forecast.
(2) 1D CNN layer: The dependency among multiple time series is leveraged using a 1D CNN layer consisting of kernels of size . The multiple input time series can be reshaped as a matrix. The convolution is performed on each column of the matrix. The output of 1D CNN layer is a matrix . Assume is the -th kernel learned by the model. If we pick the first entry in the -th row of , it can be expressed by
From Equation 3, we can find the output of 1D CNN layer is equal to a weighted summation of the multiple input temporal sequences. By learning the weighting coefficients, we expect leveraging signals from all correlated time series can improve the forecasts of the time series we are interested in.
(3) 2D temporal CNN layer: Differencing [9, 21] is a technique commonly used in many time series models for extract the trend. For instance, ARIMA model [1, 12, 5, 22] usually uses differencing to remove the temporal trend to produce more stationary time series. Take the first-order differencing as an example, it employs the transformation
The 2D temporal CNN layer employed by our model is somewhat similar in spirit to differencing. We would like to extract a weighted trend/differencing of current signals for better forecasting of the future trend. Our 2D temporal CNN layer consists of kernels of size , and the convolution is performed on both columns and rows of the input matrix. The output of this layer is a matrix . Assume is the -th kernel learned by the model, then the first entry in the -th row of can be expressed by
where in the right side is somewhat similar to differencing. However, the original differencing assumes fixed weighting coefficients for lagged observations (e.g. and for and respectively in Equation 4), whereas the coefficients and in our model are learned.
(4) LSTM layer:
We use 1D and 2D CNN layers for feature extraction on multiple temporal sequences, and then stack with a long short-term memory (LSTM)[14, 8] layer followed by a Dense layer (with linear activation) to support temporal sequence forecast. The input for the LSTM layer is a matrix , which is obtained by row-wise concatenation of matrices and . Note that has columns, whereas has
columns. Zero padding is performed on the last column ofbefore concatenation. Details can be found in Figure 2.
Ii-D Modelling The Seasonality Component
Time series often shows cyclical patterns. For instance, sales in auto industry and real estate typically have changes of one year cyclical length. Fourier series  has been used to approximate the cyclical behavior of the time series in existing works [11, 7, 20]. In the study of Fourier series, complicated but periodic functions are written as the sum of simple waves mathematically represented by sines and cosines. Inspired by this, we model the seasonality component of the time series as a nonlinear function of Fourier terms (pairs of sines and cosines of different frequencies).
is the Fourier terms at time , and . is the length of seasonal cycle (e.g. for weekly seasonality, and for yearly seasonality), and is the number of Fourier terms, or the number of pre-defined frequencies. A smaller corresponds to applying a low-pass filter to the seasonality components, whereas a larger allows for fitting cyclical patters with more frequent changes. is a nonlinear function which is learned by a fully connected neural network.
Note that a time series may contain multiple seasonal cycles (e.g. weekly, yearly) of different lengths. In that case, multiple sets of Fourier terms with different seasonal cycle lengths can be used as the input for the fully connected neural net.
Ii-E Modeling The Event Component
Specific events (e.g. holidays) can have significant impact on daily time series, causing strong upward or downward changes. For instance, retailers often see boosted sales during Black Friday and Thanksgiving Day. It is difficult for the trend and seasonality components to capture these changes. However, the event effect is usually fixed and somewhat predictable. Therefore, we add an additional component to model the fixed event effect. We encoded the events as a binary regressor , where is the number of unique event types. Unlike seasonality component, is assumed to be a simple linear function of a regressor
where . The reason linear function is used is due to the fact that in this work the time series we use do not have complex event types. For data with a variation of different event types, nonlinear function, such as fully connect neural network, can be used.
We evaluate our model, both qualitatively (forecasts of the trend, seasonality, and event components) and quantitatively (in its ability to forecast future values), by performing experiments on two real-world data sets. One is a set of Amazon AWS daily billing data for different services, such as Amazon Elastic Compute Cloud (EC2), Simple Storage Service (S3), and Relational Database Service (RDS), in the first few months of 2018. The other dataset is the closing price for stocks of Verizon and T-mobile, both of which are telecommunications companies, from 2013 to 2017.
Iii-a Experimental Setup and Baselines
For the Amazon AWS billing time series, weekly seasonality () is considered since we find weekly cyclical patterns in the billings for specific service, such as S3. For the stock price, yearly seasonality (
) is considered. We used a smaller neural network with 4 filters for both 1D CNN and 2D CNN for AWS billing data, since this data has relatively fewer samples, and a relatively larger network with 16 filters for 1D and 2D CNN for the stock data since it has more samples. For both datasets, the number of hidden units for our model’s LSTM layer is set as 8. Our model is implemented in Python using the open source neural network library Keras. Mean absolute error between the ground truth signal and estimation is used as the objective function, and Adam algorithm is utilized for optimization. We compared our proposed method with the following four state-of-the-art baselines.
(1) Seasonal ARIMA: Autoregressive integrated moving average (ARIMA) model is a popular and widely used statistical method for time series forecasting. The basic assumption behind the ARIMA model is that a univariate time series is a combination of autoregressive (AR) and moving average (MA) lags which capture the autocorrelation within the time series. In seasonal ARIMA, AR and MA terms predict future values of time series using data values and errors at times with lags that are multiples of cyclical length.
(2) Prophet: Prophet  is a Bayesian nonlinear univariate generative model for time series forecasting which was proposed by Facebook in 2018. Like our method, Prophet is also a structural time series analysis method, which explicitly models the trend, seasonality, and event effects. The parameters for cyclical length and event date for the Prophet are set the same as our model.
(3) Univariate LSTM: The univariate LSTM model consists of one LSTM layer followed by a dense layer with linear activation. The input for the univariate LSTM model is a single time series. The univariate LSTM considers the same number of lagged steps of historical observations for training as our model does.
(4) Multivariate LSTM: The multivariate LSTM (MLSTM) model have the same neural net architecture as the univariate LSTM. Unlike univaraite LSTM, the input for the multivariate LSTM model are multiple temporal sequences.
Iii-B Forecasts of The Trend, Seasonality and Event
We picked two AWS billing time series to evaluate the proposed model’s capability of learning the trend, seasonality and event components. Multivariate input is not considered here. The input for our model is a single time series, and the output is a one-step short-term forecast.
The first time series is the S3 billing for an AWS account from February 2018 to March 2018, as shown in Figure 3(a). The data from February 1 to March 12 is used as the training set, and the rest is used for the testing. The time series has apparent changes in trend, and a weekly -shaped upward-and-downward cyclical change, as indicated by the red dashed curve. The trend, seasonality, and event components for the forecasting in Figure 3(a) are demonstrated in 3(b) and (c). As observed, the learned trend component is consistent with the overall trend of the ground truth signal, and the -shaped cyclical changes are successfully captured by the seasonality component.
The second time series we pick is the EC2 billing for the same AWS account from January 2018 to May 2018, as shown in Figure 4(a). It has a relatively flat trend, but several spikes on the first day of each month. The spikes are caused by Reserved Instance Fee (RI Fee), which is usually charged on the first day of each month. We, therefore, set the first day of each month as the event date, and use a binary scalar to encode whether it is an event date. Although here we only consider one type of event, more types of events can be easily encoded by using a binary vector as we mentioned before. The trend, seasonality, and event components are shown in Figure 4(b) and (c). As the figure shows, both the trend and seasonality components are relatively flat. However, the spiky signals caused by RI Fee is captured by the event component.
Iii-C Multivariate and Differencing
To evaluate the benefits of introducing multivariate time series learning and the weighted “differencing” (using 2D CNN), we compare our model (Model 3), which includes 1D and 2D CNN and is trained using multivariate time series, with two variations. The first variation (Model 1) is trained on univariate input time series with only 2D CNN, and the second variation (Model 2) is trained on multivariate input time series but without 2D CNN (only consider 1D CNN). Two data sets are used in this experiment. The first one is the daily Amazon AWS billings for RDS and S3 for two AWS accounts from January 2018 to May 2018. For multivariate models, the inputs are RDS and S3 billing sequences, and the predicted time series is the S3 time series. Data from January to April are used for training, and the forecasts of billings for May are evaluated. The second data is the daily stock closing prices for Verizon (stock symbol: VZ) and T-mobile (stock symbol: TMUS) from 2013 to 2017. For multivariate models, the input are the closing prices of the two stocks, and then forecasts are performed on one stock each time. Data from January 2013 to April 2017 are used for training, and the remaining are used as testing data. Root mean square error (RMSE) is used to evaluate the performance. As the amplitudes for different time series are different, we also consider relative RMSE (RRMSE), which normalizes RMSE with the mean of the time series.
where is the ground truth value of the time series at time . The results evaluated by RMSE and relative RMSE are shown in Table I and II respectively. By comparing between Model 1 and Model 3, we find including correlated multivariate time series, such as AWS RDS billing or other stocks, leads to reduced errors for forecasts of S3 and stock closing prices. Furthermore, introducing the weighted differencing features extracted by 2D temporal CNN will result in more accurate trend forecasts, as indicated by the comparison between Model 2 and Model 3.
|Data||Model 1||Model 2||Model 3|
|Data||Model 1||Model 2||Model 3|
Iii-D Comparing with the Baselines
The comparison of our DeepMSTM model with Prophet, seasonal ARIMA, univariate LSTM and multivariate LSTM (MLSTM) are demonstrated in Table III and IV. The data sets and the training/testing data splits used in this experiment are the same as the last section. As can be observed, DeepMSTM consistently perform better than other baselines. In our experiments, we find multivariate LSTM always outperforms univariate LSTM, which further evidence the fact that including correlated multivariate time series can help. Furthermore, by comparing our DeepMSTM with multivariate LSTM model, we can conclude that the 1D CNN and 2D temporal CNN indeed lead to improved performance for the forecasts.
The forecasts for the Amazon AWS S3 billings of the two AWSIDs obtained by Prophet, seasonal ARIMA, MLSTM, and DeepMSTM are shown in Figure 5
. Due to the relatively small error differences among all models for the stock data, we do not present the plot for stock data here. For AWSID 1, the S3 billing shows apparent weekly spikes, this seasonality change is successfully forecasted by seasonal ARIMA, Prophet, and DeepMSTM since all the three methods consider seasonality, whereas forecast by MLSTM is very smooth, and fails to learn the weekly changes. For AWSID 2, it is interesting to notice that both MLSTM and DeepMSTM seem to somehow remember the “Z” shaped changing pattern they observed far before, as indicated by the purple dotted lines. This is probably due to the LSTM layer’s capability to “memorize” long-term patterns. while the forecasts for seasonal ARIMA and Prophet mainly depends on the trend of the most recent historical data.
We have presented the CNN-LSTM model for structural time series analysis which learns the trend, seasonality, and event components jointly. It allows data analysts to incorporate their knowledge/understanding about cyclical patterns and irregular changes related to specific events into the model. The extracted trend, seasonality, and event components can give interpretable results for further qualitative analyses. Our method enriches the family of time series analysis models by seamlessly leveraging the correlations among multivariate time series, as well as extracting the weighted differencing/trend feature, and leads to improved performance in time series forecasts. To the best of our knowledge, our method is the first framework that extends structural time series models to a deep architecture.
-  (1970) Time series analysis: forecasting and control. In , pp. . Cited by: §II-C.
-  (2015) Inferring causal impact using bayesian structural time-series models. In Annals of Applied Statistics, pp. 247–274. Cited by: §I, §II-A.
-  (1921) Chapter 7: fourier’s series. Introduction to the Theory of Fourier’s Series and Integrals 1 (), pp. 196. Cited by: §II-D.
On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259. Cited by: §I.
-  (1997) Time series for macroeconomics and finance. In , pp. . Cited by: §II-C.
Long-term recurrent convolutional networks for visual recognition and description.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634. Cited by: §I.
-  (2013) Fourier analysis for demand forecasting in fashion company. In International Journal of Engineering Business Management, pp. . Cited by: §II-D.
-  (2017) Deep learning: a practitioner’s approach. In O’Reilly Media, pp. . Cited by: §II-C.
-  (1994) Time series analysis. In , pp. . Cited by: §II-C.
-  (1990) Estimation procedures for structural time series models. In Journal of Forecasting, pp. 89–108. Cited by: §II-A.
-  (1997) Structural time series models. In Handbook of Statistics, pp. 261–302. Cited by: §II-D.
-  (1994) Time series modelling of water resources and environmental systems. In , pp. . Cited by: §II-C.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §I.
-  (1997) Long short-term memory. In Neural Computation, pp. 1735–1780. Cited by: §II-C.
The vanishing gradient problem during learning recurrent neural nets and problem solutions. International J of Uncertainty, Fuzziness and Knowledge-Based Systems 6 (02), pp. 107–116. Cited by: §I.
-  (2016) Character-aware neural language models.. In AAAI, pp. 2741. Cited by: §I.
-  (2014) Convolutional neural networks for sentence classification. In Conference on Empirical Methods in NLP, Cited by: §I.
-  (2018) Minute-ahead stock price forecasting based on singular spectrum analysis and support vector regression. Applied Mathematics and Computation 320, pp. 444–451. Cited by: §I.
A hybrid arima and support vector machines model in stock price forecasting. Omega, pp. 497. Cited by: §I.
-  (2018) Forecasting at scale. In The American Statistician, pp. 37–45. Cited by: §I, §II-A, §II-D, §III-A.
-  (2005) Analysis of financial time series. In , pp. . Cited by: §II-C.
-  (2003) Time series forecasting using a hybrid arima and neural network model. pp. 159–175. Cited by: §II-C.