Multi-variable LSTM neural network for autoregressive exogenous model

06/17/2018 ∙ by Tian Guo, et al. ∙ EPFL ETH Zurich 0

In this paper, we propose multi-variable LSTM capable of accurate forecasting and variable importance interpretation for time series with exogenous variables. Current attention mechanism in recurrent neural networks mostly focuses on the temporal aspect of data and falls short of characterizing variable importance. To this end, the multi-variable LSTM equipped with tensorized hidden states is developed to learn hidden states for individual variables, which give rise to our mixture temporal and variable attention. Based on such attention mechanism, we infer and quantify variable importance. Extensive experiments using real datasets with Granger-causality test and the synthetic dataset with ground truth demonstrate the prediction performance and interpretability of multi-variable LSTM in comparison to a variety of baselines. It exhibits the prospect of multi-variable LSTM as an end-to-end framework for both forecasting and knowledge discovery.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Our daily life is now surrounded by various types of sensors, ranging from smartphones, video cameras, Internet of things, to robots. The observations yield by such devices over time are naturally organized in time series data Qin et al. (2017); Yang et al. (2015). In this paper, we focus on time series with exogenous variables. Specifically, given a target time series as well as an additional set of time series corresponding to exogenous variables, a predictive model using the historical observations of both target and exogenous variables to predict the future values of the target variable is an autoregressive exogenous model, referred to as ARX. ARX models have been successfully used for modeling the input-output behavior of many complex systems DiPietro et al. (2017); Zemouri et al. (2010); Lin et al. (1996). In addition to forecasting, the interpretability of such models is essential for deployment, e.g. understanding the different importance of exogenous variables w.r.t. the evolution of the target one Hu et al. (2018); Siggiridou & Kugiumtzis (2016); Zhou et al. (2015).

Meanwhile, long short-term memory units (LSTM) 

Hochreiter & Schmidhuber (1997)

and the gated recurrent unit (GRU) 

Cho et al. (2014), a class of recurrent neural networks (RNN), have achieved great success in various applications on sequence and time series data Lipton et al. (2015); Wang et al. (2016); Guo et al. (2016); Lin et al. (2017); Sutskever et al. (2014).

However, current recurrent neural networks fall short of achieving interpretability on the variable level when they are used for ARX models. For instance, when fed with the multi-variable historical observations of the target and exogenous variables, LSTM blindly blends the information of all variables into the memory cells and hidden states which are used for prediction. Therefore, it is intractable to distinguish the contribution of individual variables into the prediction by looking into hidden states Zhang et al. (2017).

Recently, attention-based neural networks Bahdanau et al. (2014); Vinyals et al. (2015); Chorowski et al. (2015); Choi et al. (2016); Qin et al. (2017); Cinar et al. (2017) have been proposed to enhance the ability of RNN in selectively using long-term memory as well as the interpretability. Current attention mechanism is mostly applied to hidden states across time steps, thereby focusing on capturing temporally important information and failing to uncover the different importance of input variables.

To this end, we aim to develop a LSTM neural network based ARX model to achieve a unified framework of both forecasting and knowledge discovery. In particular, the contribution is fourfold. First, we propose the multi-variable LSTM, referred to as MV-LSTM, with tensorized hidden states and associated updating scheme, such that each element of the hidden state tensor encodes information for a certain input variable. Second, by using the variable-wise hidden states we develop a probabilistic mixture representation of temporal and variable attention. Learning and forecasting of MV-LSTM are built on top of this mixture attention mechanism. Third, we propose to interpret and quantify variable importance by the posterior inference of variable attention. Lastly, we perform an extensive experimental evaluation of MV-LSTM against statistical, machine learning and neural network baselines to demonstrate the prediction performance and interpretability of MV-LSTM. The idea of MV-LSTM easily applies to other variants of RNN, e.g., GRU or stacking multiple MV-LSTM layers. These will be the future work.

2 Related work

Vanilla recurrent neural networks have been used to study nonlinear ARX problem in Zemouri et al. (2010); Diaconescu (2008); DiPietro et al. (2017). Tank et al. (2017, 2018) proposed to identify causal variables w.r.t. the target one via sparse regularization. Our MV-LSTM is intended for providing the accurate prediction as well as interpretability of variable importance via attention mechanism.

Recently, attention mechanism has gained increasing popularity due to its ability in enabling recurrent neural networks to select parts of hidden states across time steps as well as enhancing the interpretability of networks Bahdanau et al. (2014); Vinyals et al. (2015); Choi et al. (2016); Vaswani et al. (2017); Lai et al. (2017); Qin et al. (2017); Cinar et al. (2017); Choi et al. (2018); Guo et al. (2018). However, current attention mechanism is normally applied to hidden states across time steps, and for multi-variable input sequence, it fails to characterize variable level importance. Only some very recent studies Choi et al. (2016); Qin et al. (2017) attempted to develop attention mechanism capable of handling multi-variable sequence data. Qin et al. (2017); Choi et al. (2016) first use neural networks to learn weights on input variables and then feed weighted input data into another neural network Qin et al. (2017) or use it directly for forecasting Choi et al. (2016). In our MV-LSTM, temporal and variable attention are jointly derived from hidden states for individual variables learned via one end-to-end network.

Another line of related research is about tensorization and selectively updating of hidden states in recurrent neural networks. Novikov et al. (2015); Do et al. (2017) proposed to represent hidden states as a matrix. He et al. (2017) developed tensorized LSTM in which hidden states are represented by tensors to enhance the capacity of networks without additional parameters. Koutnik et al. (2014); Neil et al. (2016); Kuchaiev & Ginsburg (2017) put forward to partition the hidden layer into separated modules as independent feature groups. In MV-LSTM, hidden states are organized in a matrix, each element of which encodes information specific to an input variable. Meanwhile, the hidden states are correlatively updated such that inter-correlation among input variables is still captured.

3 Multi-Variable LSTM

Assume we have exogenous time series and a target series of length , where and .111Vectors are assumed to be in column form throughout this paper. By stacking exogenous time series and target series, we define a multi-variable input sequence as , where is the multi-variable input at time step and is the observation of -th exogenous time series at time . Given , we aim to learn a non-linear mapping to predict the next value of the target series, namely . Model should be interpretable in the sense that we can understand which exogenous variables are crucial for the prediction.

3.1 Network Architecture

Inspired by He et al. (2017); Kuchaiev & Ginsburg (2017), in MV-LSTM we develop tensorized hidden states and associated update scheme, which are able to ensure that each element of the hidden state tensor encapsulates information exclusively from a certain variable of the input. As a result, it enables to develop a flexible temporal and variable attention mechanism on top of such hidden states.

[width=0.8]figures/mv_lstm.eps

Figure 1: A toy example of a MV-LSTM with a two-variable input sequence and the hidden matrix of size , i.e. -dimensional hidden state per variable. Panel (a) exhibits the derivation of cell update matrix , i.e. Eq. (1). Purple and blue colors correspond to two variables. Rectangles with circles inside represent input data and hidden states at one step. Grey areas outline the corresponding weights, hidden states, and input. Panel (b) demonstrates the process of gate calculation, i.e. Eq. (2). (best viewed in color)

Specifically, we define the hidden state tensor (matrix) at time step in a MV-LSTM layer as , where , , and is overall size of the layer. The element of is a hidden state vector specific to -th input variable. Then, we define the input-to-hidden transition tensor (matrix) as , where and . The hidden-to-hidden transition tensor is defined as: , where and .

Similar to the standard LSTM neural networks Hochreiter & Schmidhuber (1997), MV-LSTM has the input, forget and output gates as well as the memory cells to control the update of hidden state matrix. Given the newly incoming input at time and the hidden state matrix and memory cell up to , we formulate the iterative update process in a MV-LSTM layer as follows:

(1)
(2)
(3)
(4)

Overall, Eq. (1) gives rise to the cell update matrix , where corresponds to the update w.r.t. input variable . The term and respectively capture the update from the hidden states of the previous step and the new input. Concretely, the tensor-dot operation in Eq. (1) returns the product of two tensors along a specified axis. Thus, given tensor and , the tensor-dot of and along the axis is expressed as , where . Additionally, we define as the product between the transition matrix and input vector: .

Eq. (2) derives the input gate , forget gate and output gate by using and . All these gates are vectors of dimension . refers to the vectorization operation, where in Eq. (2) it concatenates columns of into a vector of dimension . is the concatenation operation.

represents the element-wise sigmoid activation function. Each element in gate vectors is derived based on

that carries information regarding all input variables, so as to utilize the cross-correlation between input variables.

In Eq. (3), memory cell vector is updated by using the previous cell and vectorized cell update matrix obtained in Eq. (1). denotes element-wise multiplication. Finally, in Eq. (4) the new hidden state matrix at is the matricization 222 In our case, matricization is the operation that reshape a vector of into a matrix of . of weighted by the output gate.

3.2 Mixture Temporal and Variable Attention

After feeding a sequence of into MV-LSTM, we obtain a sequence of hidden state matrices, denoted by , where and element . is then used in our mixture temporal and variable attention mechanism, which facilitates the following learning, inference and interpretation of variable importance.

[width=0.84]figures/att.eps

Figure 2: Illustration of the mixture temporal and variable attention process in a MV-LSTM layer with a two-variable input sequence and the hidden matrix of size , i.e. -dimensional hidden state per variable. (best viewed in color)

Specifically, our attention mechanism is based on a probabilistic mixture of experts model Zong et al. (2018); Graves (2013); Shazeer et al. (2017) over as:

(5)

where and .

In Eq. (5

), we introduce a latent random variable

into the the density function of to govern the generation of conditional on historical data . is a discrete variable over the set of values corresponding to input variables. Mathematically, the joint density of and is decomposed into a component model (i.e. ) and the prior of conditioning on (i.e. ). The component model characterizes the density of conditioned on historical data of variable , while the prior of controls to what extent is generated by variable as well as enabling to adaptively adjust the contribution of variable to fit . For , it refers to the autoregressive part.

Evaluating each part of Eq. (5) amounts to the temporal and variable attention process using hidden states in MV-LSTM. Temporal attention is first applied to the sequence of hidden states for individual variables, so as to obtain a summarized hidden state for each variable. The history of each variable is encoded in such temporally summarized hidden states, which are used to calculate and . Then, since the prior in (5) is a discrete distribution on , it naturally characterizes the attention on the exogenous and autoregressive parts for predicting .

In detail, the weights and bias of the temporal attention process are defined as and . and the element corresponds to -th variable. The temporal attention is then derived as:

(6)
(7)
(8)
(9)

In Eq. (6), is derived via the tensor-dot operation, where element is the attention score on previous steps of variable (other methods of deriving attention scores is compatible with MV-LSTM Cinar et al. (2017); Qin et al. (2017) and we use the simple one layer transformation in the present paper). Then, the attention weights is obtained by performing on each row of . gives rise to the variable-wise context matrix . Recall that the hidden state matrix at is . By concatenating and along axis in Eq. (9), we obtain the context enhanced hidden state matrix , where is a hidden state summarizing the temporal information of variable .

Now we can formulate individual component model in Eq. (5) as:

(10)

where we impose normal distribution over

, and and are output weight and bias. In experiments, we simply set to one. Meanwhile, by using summarized hidden states , we derive to characterize variable level attention as:

(11)

where is the variable attention weight and is the bias.

3.3 Learning, Inference and Interpretation

In the learning phase, denote by the set of parameters in MV-LSTM. Given a set of training sequences and

, the loss function to optimize is defined based on the negative log likelihood of the mixture model plus the regularization term as:

(12)

In the inference phase, the prediction of is obtained by the weighted sum of means as Graves (2013); Bishop (1994): .

For the interpretation of the variable importance via mixture attention, we consider to use the posterior of , i.e.

(13)

which takes the prediction performance of individual variables into account. We refer to the derived and respectively as posterior and prior attention.

Meanwhile, note that we obtain the posterior of for each training sequence. In order to attain a uniform view of variable importance over the set of data, we define the importance of an input variable by aggregating all the posterior attention of this variable as follows:

(14)

4 Experiments

In this part, we report experimental results. Due to the page limitation, please refer to the appendix section for full results.

4.1 Datasets

We use three real datasets333https://archive.ics.uci.edu/ml/datasets.html and one synthetic dataset to evaluate MV-LSTM and baselines.

PM2.5: It contains hourly PM2.5 data and the associated meteorological data in Beijing of China. PM2.5 measurement is the target series. The exogenous time series include dew point, temperature, pressure, combined wind direction, cumulated wind speed, hours of snow, and hours of rain. Totally we have multi-variable sequences.

Energy: It collects the appliance energy use in a low energy building. The target series is the energy data logged every 10 minutes. Exogenous time series consist of variables, e.g. the house inside temperature conditions and outside weather information including temperature, wind speed, humanity and dew point from the nearest weather station. The number of sequence is .

Plant: This dataset records the time series of energy production of a photovoltaic (PV) power plant in Italy Ceci et al. (2017). Exogenous data consists of dimensional time series regarding weather conditions (such as temperature, cloud coverage, etc.). It gives sequences for evaluation.

Synthetic: It is generated based on the idea of Lorenz model Tank et al. (2017, 2018). Exogenous series are generated via the ARMA process with randomized parameters. The target series is driven by an ARMA process plus coupled exogenous series of variable and with randomized autoregressive orders and thus the synthetic dataset has ground truth of variable importance. In total, we generate sequences of 10 exogenous time series.

For each dataset, we perform Augmented Dickey-Fuller (AD-Fuller) and Kwiatkowski Phillips Schmidt Shin (KPSS) tests to determine the necessity of differencing time series Kirchgässner et al. (2012). The window size, namely in Sec. 3, is set to 30. We further study the prediction performance under different window sizes in the supplementary material. Each dataset is split into training (), validation () and testing sets ().

4.2 Baselines and Evaluation Setup

The first category of statistics baselines includes:

STRX is the structural time series model with exogenous variables Scott & Varian (2014); Radinsky et al. (2012). It is formulated in terms of unobserved components via the state space method.

ARIMAX augments the classical time series autoregressive integrated moving average model (ARIMA) by adding regression terms on exogenous variables Hyndman & Athanasopoulos (2014).

The second category of machine learning baselines includes popular tree ensemble methods and regularized regression as:

RF

refers to random forests. It is an ensemble learning method consisting of several decision trees

Liaw et al. (2002); Meek et al. (2002) and has been used in time series prediction Patel et al. (2015).

XGT

refers to the extreme gradient boosting

Chen & Guestrin (2016). It is the application of boosting methods to regression trees Friedman (2001).

ENET represents Elastic-Net, which is a regularized regression method combining both L1 and L2 penalties of the lasso and ridge methods Zou & Hastie (2005) and has been used in time series analysis Liu et al. (2010); Bai & Ng (2008).

The third category of neural network baselines includes:

RETAIN requires to pre-train two recurrent neural networks to respectively derive weights on temporal steps and variables, which are then used to perform prediction Choi et al. (2016).

DUAL is built upon encoder-decoder architecture Qin et al. (2017), which uses an encoder LSTM to learn weights on input variables and then feeds pre-weighted input data into a decoder LSTM for forecasting.

cLSTM proposes to identify Granger causal variables via sparse regularization on the weights of LSTM Tank et al. (2017, 2018).

Additionally, we have two variants of MV-LSTM denoted by MV-Indep and MV-Fusion, which are developed to evaluate the efficacy of the updating and mixture mechanism of MV-LSTM. MV-Indep builds independent recurrent neural networks for each input variable, whose outputs are fed into the mixture attention process to obtain prediction. The only difference between MV-Fusion and MV-LSTM is that, instead of using mixture attention, MV-Fusion fuses the hidden states of each variable into one hidden state via variable attention.

In ARIMAX, the orders of auto-regression and moving-average terms are set via the autocorrelation and partial autocorrelation. For RF and XGT, the hyper-parameter tree depth and the number of iterations are chosen from range and via grid search. For XGT, L2 regularization is added by searching within . As for ENET, the coefficients for L2 and L1 penalties are selected from . For these machine learning baselines, multi-variable input sequences are flattened into feature vectors.

We implemented MV-LSTM and neural network baselines with Tensorflow

444 Code will be released upon requested.. For training, we used Adam with the mini-batch of instances Kingma & Ba (2014). For the size of recurrent and dense layers in the baselines, we conduct grid search over

. The size of the MV-LSTM recurrent layer is set by the number of neurons per variable selected from

. Dropout is set to . Learning rate is searched in . L2 regularization is added with the coefficient chosen from . We train each approach times and report average performance.

We consider two metrics to measure the prediction performance. Specifically, RMSE is defined as . MAE is defined as .

4.3 Prediction Performance

We report the prediction errors of all approaches in Table 1 and Table 2. In Table 1, we observe that in most of the time, STRX and ARIMAX underperform machine learning and neural network solutions. Among RF, XGT, and ENET, XGT performs the best mostly. As for neural network baselines, DUAL outperforms RETAIN and cLSTM as well as machine learning baselines in the Synthetic and Energy datasets. Our MV-LSTM outperforms baselines by around at most. MV-LSTM performs slightly better than both of MV-Fusion and MV-Indep, while providing the interpretation benefit, which is shown in the next group of experiments. Above observations also apply to the MAE results in Table 2 and we skip the detailed description.

MethodsDataset Synthetic Energy Plant PM2.5
STRX
ARIMAX
RF
XGT
ENET
DUAL
RETAIN
cLSTM
MV-Fusion
MV-Indep
MV-LSTM
Table 1: Average test RMSE and std. errors
MethodsDataset Synthetic Energy Plant PM2.5
STRX
ARIMAX
RF
XGT
ENET
DUAL
RETAIN
cLSTM
MV-Fusion
MV-Indep
MV-LSTM
Table 2: Average test MAE and std. errors

4.4 Model Interpretation

In this part, we compare MV-LSTM to baselines also with interpretability over the variable importance, i.e. DUAL, RETAIN and cLSTM. For real datasets without ground truth about variable importance, we perform Granger causality test Arnold et al. (2007) to identify causal variables, which are considered as important variables for the further comparison. For the synthetic dataset, we evaluate by observing whether an approach can recognize variable and with high importance value.

Similar to MV-LSTM, we can collect variable attentions of each sequence in DUAL and RETAIN and obtain importance value by Eq. (14). Note that variable attentions obtained in RETAIN are unnormalized values. In cLSTM, we identify important variables by non-zero corresponding weights of the neural network Tank et al. (2018) and thus have no importance value to report in Table 3.

Table 3 reports some top variables ranked by the corresponding importance value in the brackets. The higher the importance value, the more crucial the variable. In dataset PM2.5, three variables (i.e. dew point, cumulated wind speed, and pressure) identified as Granger causal variables are also top ranked by the variable importance in MV-LSTM. As is pointed out by Liang et al. (2015), dew point and pressure are the most influential. Strong wind can bring dry and fresh air and it is crucial as well. This is in line with the variable importance detected by MV-LSTM. On the contrary, baselines miss identifying some variables. Likewise, for Plant dataset, as is suggested by Mekhilef et al. (2012); Ghazi & Ip (2014) in addition to cloud cover, humidity, wind speed, and temperature affect the efficiency of PV cells and thus important for power generation.

1.0! Dataset Method Rank of variables according to importance PM2.5 MV-LSTM Dew point, Cumulated wind speed, Pressure DUAL Temperature (0.29), Dew Point(0.26), Pressure(0.21) RETAIN Pressure(1.14), Cumulated hours of snow (0.04), Cumulated wind speed(-0.42) cLSTM Dew Point, Pressure, Temperature Plant MV-LSTM Cloud cover, Wind speed, Temperature, Humidity(0.07) DUAL Humidity, Cloud cover, Wind speed, Temperature(0.14) RETAIN Plant temp., Wind speed, Dew point, Temperature (0.25) cLSTM Dew point, Humidity, Plant temperature, Wind bearing Energy MV-LSTM Living room temp., Office room temp., Parents room temp. DUAL Humidity outside (0.17), Wind speed (0.16), Living room temp.(0.10) RETAIN Building outside temp. (0.13), Parents room temp.(0.11), Outside temp. (0.11) cLSTM Humidity outside, Office room temp., Living room temp. Synthetic MV-LSTM Variable 3(0.18), Variable 2(0.18), Variable 8 (0.17), Variable 6 (0.15), Variable 4 (0.13) DUAL Variable 1 (0.12), Variable 0 (0.12), Variable 7 (0.11), Variable 3 (0.10), Variable 6 (0.10) RETAIN Variable 10 (1.08), Variable 8(0.09), Variable 9(0.07), Variable 4 (0.06), Variable 6 (0.05) cLSTM Variable 7, Variable 6, Variable 0, Variable 3, Variable 1 *Color box represents the variable is important based on Granger causality test or ground truth.

Table 3: Interpretation of variable importance.

[width=1]figures/pm25_mv.eps

(a) MV-LSTM

[width=1]figures/pm25_dual.eps

(b) DUAL

[width=1]figures/pm25_retain.eps

(c) RETAIN
Figure 3: Histogram visualization of variable attentions w.r.t. two example variables in PM2.5. For MV-LSTM, both prior and posterior attentions are shown. DUAL and RETAIN only have attention weights.

Furthermore, Figure 3 visualizes the histograms of attention values of two example variables in the PM2.5 dataset. In MV-LSTM, compared with priors, the posterior attention of the variable “dew point” shifts rightward, while the posterior of variable “cumulated hours of rain” moves towards zero. It indicates that posterior attention rectifies the prior by taking into account the predictive likelihood. As a result, the variable importance derived from posterior attention is more distinguishable and informative, compared with the attention weights in DUAL and RETAIN.

5 Conclusion

In this paper, we propose an interpretable multi-variable LSTM for time series with exogenous variables. Based on the tensorized hidden states in MV-LSTM, we develop mixture temporal and variable attention mechanism, which enables to infer and quantify the variable importance w.r.t. the target series. Extensive experiments on a synthetic dataset with ground truth and real datasets with Granger causality test exhibit the superior prediction performance and interpretability of MV-LSTM.

References

  • Arnold et al. (2007) Andrew Arnold, Yan Liu, and Naoki Abe. Temporal causal modeling with graphical granger methods. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 66–75. ACM, 2007.
  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, 2014.
  • Bai & Ng (2008) Jushan Bai and Serena Ng. Forecasting economic time series using targeted predictors. Journal of Econometrics, 146(2):304–317, 2008.
  • Bishop (1994) Christopher M Bishop. Mixture density networks. 1994.
  • Ceci et al. (2017) Michelangelo Ceci, Roberto Corizzo, Fabio Fumarola, Donato Malerba, and Aleksandra Rashkovska. Predictive modeling of pv energy production: How to set up the learning task for a better prediction? IEEE Transactions on Industrial Informatics, 13(3):956–966, 2017.
  • Chen & Guestrin (2016) Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In SIGKDD, pp. 785–794. ACM, 2016.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
  • Choi et al. (2016) Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and Walter Stewart. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. In Advances in Neural Information Processing Systems, pp. 3504–3512, 2016.
  • Choi et al. (2018) Heeyoul Choi, Kyunghyun Cho, and Yoshua Bengio. Fine-grained attention mechanism for neural machine translation. Neurocomputing, 284:171–176, 2018.
  • Chorowski et al. (2015) Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-based models for speech recognition. In Advances in neural information processing systems, pp. 577–585, 2015.
  • Cinar et al. (2017) Yagmur Gizem Cinar, Hamid Mirisaee, Parantapa Goswami, Eric Gaussier, Ali Aït-Bachir, and Vadim Strijov. Position-based content attention for time series forecasting with sequence-to-sequence rnns. In International Conference on Neural Information Processing, pp. 533–544. Springer, 2017.
  • Diaconescu (2008) Eugen Diaconescu. The use of narx neural networks to predict chaotic time series. Wseas Transactions on computer research, 3(3):182–191, 2008.
  • DiPietro et al. (2017) Robert DiPietro, Christian Rupprecht, Nassir Navab, and Gregory D Hager. Analyzing and exploiting narx recurrent neural networks for long-term dependencies. In International Conference on Learning Representations, 2017.
  • Do et al. (2017) Kien Do, Truyen Tran, and Svetha Venkatesh. Matrix-centric neural networks. arXiv preprint arXiv:1703.01454, 2017.
  • Friedman (2001) Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232, 2001.
  • Ghazi & Ip (2014) Sanaz Ghazi and Kenneth Ip. The effect of weather conditions on the efficiency of pv panels in the southeast of uk. Renewable Energy, 69:50–59, 2014.
  • Graves (2013) Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
  • Guo et al. (2016) Tian Guo, Zhao Xu, Xin Yao, Haifeng Chen, Karl Aberer, and Koichi Funaya. Robust online time series prediction with recurrent neural networks. In 2016 IEEE DSAA, pp. 816–825. IEEE, 2016.
  • Guo et al. (2018) Tian Guo, Tao Lin, and Yao Lu. An interpretable lstm neural network for autoregressive exogenous model. In workshop track at International Conference on Learning Representations, 2018.
  • He et al. (2017) Zhen He, Shaobing Gao, Liang Xiao, Daxue Liu, Hangen He, and David Barber. Wider and deeper, cheaper and faster: Tensorized lstms for sequence learning. In Advances in Neural Information Processing Systems, pp. 1–11, 2017.
  • Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Hu et al. (2018) Ziniu Hu, Weiqing Liu, Jiang Bian, Xuanzhe Liu, and Tie-Yan Liu.

    Listening to chaotic whispers: A deep learning framework for news-oriented stock trend prediction.

    In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 261–269. ACM, 2018.
  • Hyndman & Athanasopoulos (2014) Rob J Hyndman and George Athanasopoulos. Forecasting: principles and practice. OTexts, 2014.
  • Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kirchgässner et al. (2012) Gebhard Kirchgässner, Jürgen Wolters, and Uwe Hassler. Introduction to modern time series analysis. Springer Science & Business Media, 2012.
  • Koutnik et al. (2014) Jan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. A clockwork rnn. In International Conference on Machine Learning, pp. 1863–1871, 2014.
  • Kuchaiev & Ginsburg (2017) Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for lstm networks. arXiv preprint arXiv:1703.10722, 2017.
  • Lai et al. (2017) Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. Modeling long-and short-term temporal patterns with deep neural networks. arXiv preprint arXiv:1703.07015, 2017.
  • Liang et al. (2015) Xuan Liang, Tao Zou, Bin Guo, Shuo Li, Haozhe Zhang, Shuyi Zhang, Hui Huang, and Song Xi Chen. Assessing beijing’s pm2. 5 pollution: severity, weather impact, apec and winter heating. In Proc. R. Soc. A, volume 471, pp. 20150257. The Royal Society, 2015.
  • Liaw et al. (2002) Andy Liaw, Matthew Wiener, et al. Classification and regression by randomforest. R news, 2(3):18–22, 2002.
  • Lin et al. (2017) Tao Lin, Tian Guo, and Karl Aberer. Hybrid neural networks for learning the trend in time series. In

    Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17

    , pp. 2273–2279, 2017.
  • Lin et al. (1996) Tsungnan Lin, Bill G Horne, Peter Tino, and C Lee Giles. Learning long-term dependencies in narx recurrent neural networks. IEEE Transactions on Neural Networks, 7(6):1329–1338, 1996.
  • Lipton et al. (2015) Zachary C Lipton, David C Kale, Charles Elkan, and Randall Wetzell. Learning to diagnose with lstm recurrent neural networks. arXiv preprint arXiv:1511.03677, 2015.
  • Liu et al. (2010) Yan Liu, Alexandru Niculescu-Mizil, Aurelie C Lozano, and Yong Lu. Learning temporal causal graphs for relational time-series analysis. In ICML, pp. 687–694, 2010.
  • Meek et al. (2002) Christopher Meek, David Maxwell Chickering, and David Heckerman. Autoregressive tree models for time-series analysis. In SDM, pp. 229–244. SIAM, 2002.
  • Mekhilef et al. (2012) S Mekhilef, R Saidur, and M Kamalisarvestani. Effect of dust, humidity and air velocity on efficiency of photovoltaic cells. Renewable and sustainable energy reviews, 16(5):2920–2925, 2012.
  • Neil et al. (2016) Daniel Neil, Michael Pfeiffer, and Shih-Chii Liu. Phased lstm: Accelerating recurrent network training for long or event-based sequences. In Advances in Neural Information Processing Systems, pp. 3882–3890, 2016.
  • Novikov et al. (2015) Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems, pp. 442–450, 2015.
  • Patel et al. (2015) Jigar Patel, Sahil Shah, Priyank Thakkar, and K Kotecha. Predicting stock and stock price index movement using trend deterministic data preparation and machine learning techniques. Expert Systems with Applications, 42(1):259–268, 2015.
  • Qin et al. (2017) Yao Qin, Dongjin Song, Haifeng Cheng, Wei Cheng, Guofei Jiang, and Garrison W. Cottrell. A dual-stage attention-based recurrent neural network for time series prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI’17, pp. 2627–2633. AAAI Press, 2017.
  • Radinsky et al. (2012) Kira Radinsky, Krysta Svore, Susan Dumais, Jaime Teevan, Alex Bocharov, and Eric Horvitz. Modeling and predicting behavioral dynamics on the web. In WWW, pp. 599–608. ACM, 2012.
  • Scott & Varian (2014) Steven L Scott and Hal R Varian. Predicting the present with bayesian structural time series. International Journal of Mathematical Modelling and Numerical Optimisation, 5(1-2):4–23, 2014.
  • Shazeer et al. (2017) Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. International Conference on Learning Representations, 2017.
  • Siggiridou & Kugiumtzis (2016) Elsa Siggiridou and Dimitris Kugiumtzis.

    Granger causality in multivariate time series using a time-ordered restricted vector autoregressive model.

    IEEE Transactions on Signal Processing, 64(7):1759–1773, 2016.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.
  • Tank et al. (2017) Alex Tank, Ian Cover, Nicholas J Foti, Ali Shojaie, and Emily B Fox. An interpretable and sparse neural network model for nonlinear granger causality discovery. arXiv preprint arXiv:1711.08160, 2017.
  • Tank et al. (2018) Alex Tank, Ian Covert, Nicholas Foti, Ali Shojaie, and Emily Fox. Neural granger causality for nonlinear time series. arXiv preprint arXiv:1802.05842, 2018.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 6000–6010, 2017.
  • Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Advances in Neural Information Processing Systems, pp. 2692–2700, 2015.
  • Wang et al. (2016) Linlin Wang, Zhu Cao, Yu Xia, and Gerard de Melo. Morphological segmentation with window lstm neural networks. In AAAI, 2016.
  • Yang et al. (2015) Jian Bo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiao Li Li, and Shonali Krishnaswamy.

    Deep convolutional neural networks on multichannel time series for human activity recognition.

    In IJCAI, pp. 25–31, 2015.
  • Zemouri et al. (2010) Ryad Zemouri, Rafael Gouriveau, and Noureddine Zerhouni. Defining and applying prediction performance metrics on a recurrent narx time series model. Neurocomputing, 73(13-15):2506–2521, 2010.
  • Zhang et al. (2017) Liheng Zhang, Charu Aggarwal, and Guo-Jun Qi. Stock price prediction via discovering multi-frequency trading patterns. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2141–2149. ACM, 2017.
  • Zhou et al. (2015) Xiabing Zhou, Wenhao Huang, Ni Zhang, Weisong Hu, Sizhen Du, Guojie Song, and Kunqing Xie. Probabilistic dynamic causal model for temporal data. In Neural Networks (IJCNN), 2015 International Joint Conference on, pp. 1–8. IEEE, 2015.
  • Zong et al. (2018) Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho, and Haifeng Chen.

    Deep autoencoding gaussian mixture model for unsupervised anomaly detection.

    In International Conference on Learning Representations, 2018.
  • Zou & Hastie (2005) Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.

6 Appendix

6.1 Multi-Variable LSTM

Theorem 1.

The hidden states and memory cells in MV-LSTM are updated by the process below:

and therefore each element of hidden state matrix encodes the information exclusively from the corresponding input variable .

Proof.

By the tensor-dot operation , only the elements of and corresponding to are matched to perform calculation, namely . Meanwhile, since the product between input-hidden transition weights and input vector is , each resulting element only carries information about variable . Then, though the derivation of gates , , and mix information from all input variables in order to capture cross-correlation among variables, memory cells are updated by multiplication operation between gates and and therefore the information encoded in are still specific to each input variable. Likewise, hidden state matrix derived from the updated memory retain the variable-wise hidden states. ∎

6.2 Prediction Performance

In addition to the results under window size in Table 1 and 2, we report the prediction errors under different window sizes i.e. in Eq. (12).

MethodsDataset Synthetic Energy Plant PM2.5
STRX
ARIMAX
RF
XGT
ENET
DUAL
RETAIN
cLSTM
MV-Fusion
MV-Indep
MV-LSTM
Table 4: Average test RMSE and std. errors under window size
MethodsDataset Synthetic Energy Plant PM2.5
STRX
ARIMAX
RF
XGT
ENET
DUAL
RETAIN
cLSTM
MV-Fusion
MV-Indep
MV-LSTM
Table 5: Average test MAE and std. errors under window size
MethodsDataset Synthetic Energy Plant PM2.5
STRX
ARIMAX
RF
XGT
ENET
DUAL
RETAIN
cLSTM
MV-Fusion
MV-Indep
MV-LSTM
Table 6: Average test RMSE and std. errors under window size
MethodsDataset Synthetic Energy Plant PM2.5
STRX
ARIMAX
RF
XGT
ENET
DUAL
RETAIN
cLSTM
MV-Fusion
MV-Indep
MV-LSTM
Table 7: Average test MAE and std. errors under window size

6.3 Model Interpretation

In this part, we provide the variable list about each dataset in Table 8 and report the full variable importance in Table 9. Figure 4 to 7 visualize the histograms of attention of all variables in each dataset. In MV-LSTM, compared with priors, the posterior attention rectifies the prior by taking into account the predictive likelihood. while the attention weights in DUAL and RETAIN are not representative enough.

Dataset Variables
PM2.5
Dew Point, Temperature, Pressure, Cumulated wind speed,
Cumulated hours of snow, Cumulated hours of rain, PM2.5 measurement
Plant
Plant temperature, Cloud cover, Dew point, Humidity,
Temperature, Wind bearing, Wind speed, Power production
Energy
Kitchen temperature, Living room temperature, Laundry room temperature,
Office room temperature, Bathroom temperature, Building outside temperature,
Ironing room temperature, Teenager room temperature, Parents room temperature,
Outside temperature, Wind speed, Humidity outside, Dew point, Energy consumption
Synthetic Variable 0 to 9, target variable 10
Table 8: Datasets.

1.0! Dataset Method Rank of variables according to importance PM2.5 MV-LSTM Dew point, Cumulated wind speed, Pressure, Temperature(0.067), Autoregressive(0.05), Cumulated hours of snow(0.04), Cumulated hours of rain(0.04) DUAL Temperature(0.29), Dew Point(0.26), Pressure(0.21), Cumulated wind speed(0.09), Cumulated hours of snow (0.08), Cumulated hours of rain (0.07) RETAIN Autoregressive(3.79), Pressure(1.14), Cumulated hours of snow(0.04),
Cumulated wind speed(-0.42), Cumulated hours of rain (-0.47), Dew Point (-1.24), Temperature (-1.80)
cLSTM Dew Point, Pressure, Temperature Plant MV-LSTM Cloud cover, Wind speed, Temperature, Humidity (0.07), Autoregressive (0.30), Dew point (0.06), Wind bearing (0.05), Plant-temp.(0.05) DUAL Humidity, Cloud cover, Wind speed, Temperature (0.14), Wind bearing (0.09), Plant-temp (0.09), Dew point (0.08) RETAIN Plant temp., Wind speed, Dew point, Autoregressive (0.77), Temperature (0.25), Wind bearing (-0.13), Cloud cover (-0.46), Humidity (-0.84) cLSTM Dew point, Humidity, Plant temperature, Autoregressive, Wind bearing, Wind speed Energy MV-LSTM Living room, Office room, Parents room, Humidity outside (0.15), Dew point(0.06), Wind speed(0.05), Kitchen temp. (0.02), Bathroom temp (0.02), Teenager room temp. (0.02), Building outside temp.(0.01), Outside temp.(0.01), Autoregressive (0.01), Ironing room temp.(0.008), Laundry room temp.(0.002) DUAL Humidity outside(0.17), Wind speed(0.16), Living room temp.(0.10), Parents room temp. (0.07), Laundry room temp. (0.06), Bathroom temp. (0.06), Building outside temp. (0.06), Teenager room temp.(0.06), Outside temp.(0.06), Ironing room temp. (0.05), Kitchen temp.(0.05), Dew point (0.05), ’Office room temp.’, 0.05) RETAIN Building outside temp.(0.13), Autoregressive (0.12), Parents room temp.(0.11), Outside temp.(0.11), Ironing room temp.(0.09), Bathroom temp. (0.09), Laundry room temp. (0.09), Living room temp.(0.08), Office room temp.(0.06), Dew point(0.06), Teenager room temp. (0.05), Kitchen temp.(0.047), Wind speed (0.03), Humidity outside (-0.07) cLSTM Humidity outside, Office room temp., Living room temp., Laundry room temp., Parents room temp., Dew point, Ironing room temp Synthetic MV-LSTM Variable 3(0.18), Variable 2(0.18), Variable 8(0.17), Variable 6(0.15), Variable 4 (0.13), Variable 1 (0.08), Variable 7 (0.06), Variable 0 (0.02), Variable 9 (0.01), Autoregressive (0.01), Variable 5 (0.01) DUAL Variable 1 (0.12), Variable 0 (0.12), Variable 7 (0.11), Variable 3 (0.10), Variable 6 (0.10), Variable 2 (0.09), Variable 4 (0.09), Variable 8 (0.09), Variable 5 (0.09), Variable 9 (0.09) RETAIN Variable 10 (1.08), Variable 8(0.09), Variable 9(0.07), Variable 4 (0.06), Variable 6 (0.05), Variable 2 (0.02), Variable 1 (0.01), Variable 7 (-0.04), Variable 0 (-0.1), Variable 5 (-0.11), Variable 3 (-0.13) cLSTM Variable 7, Variable 6, Variable 0, Variable 3, Variable 1
*Color box represents the variable is important based on Granger causality test or ground truth.

Table 9: Interpretation of variable importance (full results).

[width=1.0]figures/pm.eps

Figure 4: Histogram visualization of variable attentions in the PM2.5 dataset. For MV-LSTM, both prior and posterior attentions are shown. DUAL and RETAIN only have attention weights.

[width=1.0]figures/plant.eps

Figure 5: Histogram visualization of variable attentions in the Plant dataset. For MV-LSTM, both prior and posterior attentions are shown. DUAL and RETAIN only have attention weights.

[width=1.0]figures/energy.eps

Figure 6: Histogram visualization of variable attentions in the Energy dataset. For MV-LSTM, both prior and posterior attentions are shown. DUAL and RETAIN only have attention weights.

[width=1.0]figures/syn.eps

Figure 7: Histogram visualization of variable attentions in the Synthetic dataset. For MV-LSTM, both prior and posterior attentions are shown. DUAL and RETAIN only have attention weights.