Multivariate time series (MTS) forecasting is widely used in many applications such as weather forecasting [xingjian2015convolutional], clinical diagnosis [che2018recurrent], sales forecasting [wu2018restful, wu2019neural] and traffic analysis [yao2019revisiting, yao2018deep, yao2019learning, tang2019joint]. The popularity of MTS forecasting has attracted increasing attention, and many efforts have been taken to address the problem in the past few years [box2015time, qin2017dual, chang2018memory]
. Recurrent neural networks (RNNs), a class of deep learning frameworks designed for modeling sequential data, have been successfully applied to this problem. For example, LSTNet[lai2018modeling] adopts LSTM to capture long-term dependencies for time series forecasting. qin2017dual [qin2017dual] designed two attention mechanisms in RNN to improve forecasting accuracy. Despite their success, the majority of the aforementioned models assume MTS data is complete.
In the real-world, MTS data are usually incomplete due to various reasons, such as broken sensors, failed data transmissions, or damaged storages. For example, Figure 1 gives two multivariate time series snippets from Beijing air pollution data, both of which contain apparent missing values marked by gray boxes (i.e., 20 of the 144 data points are unobserved). Missing values damage temporal dependencies in MTS sequences [luo2018multivariate, cao2018brits], make it hard to apply existing RNN-based models on incomplete sequences and increase the difficulty of MTS forecasting tasks. As shown in Figure 1, because of missing values, the second peak of the blue signal is not observed, and cannot be inferred by simply relying on existing RNNs. Therefore, it is vital to design models that handle missing values in MTS data to perform accurate forecasting. Many prior efforts have been dedicated to this direction. For example, two-step approaches that first omit or impute missing values then process time series forecasting based on the pre-processed data are explored in [yi2016st, garcia2010pattern]. End-to-end solutions, where the missing patterns are modeled jointly with forecasting tasks, are investigated in [che2018recurrent, cao2018brits, luo2018multivariate]. However, those methods only explore local statistical features, while the global temporal patterns in exogenous sequences are neglected.
Jointly modeling local and global temporal dynamics is very promising for MTS forecasting with missing values. Though constructing local statistics (e.g., empirical mean and last observations) to estimate missing variables have certain potential [che2018recurrent], these local statistics are unreliable when the missing ratio raises up or consecutive missing values occur as illustrated in Figure 1. This problem can be alleviated by adopting global temporal dynamics. From a global perspective, there exist many MTS snippets with close temporal patterns. For example, in Figure 1, two MTS sequences from different air quality stations share similar temporal patterns. Although it is hard to recover consecutive missing values (i.e., circled by the grey dash boxes) purely from local statistics of one MTS, aggregating temporal patterns in both sequences is rather promising. The temporal patterns of one time series can be utilized for the other when dealing with missing values. However, how to take advantage of global temporal dynamics is a very challenging problem, which is under-explored in existing work.
To address the aforementioned challenges, we propose a novel framework LGnet to model Local and Global temporal dynamics jointly for MTS forecasting. LGnet absorbs model designs from previous work [che2018recurrent], where LSTM is leveraged for MTS forecasting. Since the original LSTM is unable to handle incomplete input, we first construct estimations for missing values. Specifically, representative local statistic features are constructed for each variable in an MTS. Besides, a memory module is designed for LGnet to explicitly leverage knowledge from exogenous sequences to generate global estimations for missing values. This is achieved by using local statistics as keys to query a global optimized memory component. We further introduce adversarial training to enhance the modeling of global temporal distribution. A discriminator is built to identify the generated MTS from real samples. Meanwhile, LGnet aims at producing forecasting sequences that are hard to be identified, which are also closer to the actual global distribution of real MTS data. The main contributions of the paper are:
We study a new problem of MTS forecasting with missing values by exploring local and global temporal dynamics.
We propose a novel framework LGnet, with a memory module to capture global temporal dynamics for missing values and adversarial training to enhances the modeling of global temporal distribution.
We conduct extensive experiments on four large-scale real-world datasets to validate the proposed approach.
Various methods have been proposed for MTS forecasting, such as Autoregressive (AR), Vector Autoregression (VAR), Autoregressive moving average (ARMA), standard regression models (e.g., support vector regression[smola2004tutorial], linear regression, and regression tree methods [chen2016xgboost]). Inspired by the recent success of deep neural networks, many RNN-based methods [lai2018modeling, qin2017dual] are developed for MTS forecasting. Even some vanilla RNNs, such as GRU [chung2014empirical] and LSTM [hochreiter1997long], can outperform the non deep learning models significantly [chang2018memory]. However, none of those approaches can handle input with missing values.
To handle missing values in MTS, the simplest solution would be removing all samples with missing values, such as pairwise deletion [marsh1998pairwise]. Obviously, such methods ignore many useful information, especially with a high missing ratio [king1998list]. General data imputation methods such as statistical imputation (e.g., mean, median), EM-based imputation [nelwamondo2007missing], K-nearest neighborhood [friedman2001elements], and matrix factorization [friedman2001elements] can be applied for the unobserved variables. However, those general approaches fail to model temporal dynamics of time series. Even if MTS imputation methods, such as multivariate imputation by chained equations [azur2011multiple] and generative adversarial network luo2018multivariate, can be applied to fill in missing values first, training a forecasting model on pre-processed MTS data would lead to sub-optimal results, since the temporal patterns of missing values are totally isolated from forecasting models [wells2013strategies]. To tackle this issue, some researchers propose end-to-end frameworks that jointly estimate missing values and forecast future MTS. che2018recurrent introduce GRU-D that imputes missing values using the linear combination of statistical features. Yoon2017MultidirectionalRN propose M-RNN that leverages bi-directional RNN for the imputation. cao2018brits model the relationships between missing variables to simultaneously perform imputation and classification/regression in one neural graph. However, those solutions focus on localized temporal dependencies and fail to model global temporal dynamics.
In practice, many multivariate time series signals are sampled evenly. Thus, we assume time span is divided into equal-length time intervals. Let denote one MTS of length , where is the observation at the -th time interval, is the -th variable of , and is the number of variables. Let mask matrix denote the missing status of each variable, where if is unobserved/missing, otherwise, . Note that we can use a symbol to denote missing values in (e.g., ).
We are interested in a general MTS forecasting task. Given incomplete MTS observation and their masks , we aim at learning a function that can forecast the values in future time intervals () of any new MTS, given its historical observations with the mask matrix .
The Proposed Framework
Figure 2 illustrates the proposed framework LGnet. LGnet is built on LSTM to forecast future MTS values. We design a memory module which contains temporal information from exogenous sequences to impute MTS during the running time of LSTM. Specifically, we first extract local statistic features for every time interval, then use them as keys to query a memory component, which is jointly optimized with LSTM on all MTS data. The query results, which preserve global temporal dynamics, are further combined with local statistic features to serve as the input of the LSTM. We also introduce adversarial training on forecasted sequences to make sure they follow the global distribution. The whole framework of LGnet is trained in an end-to-end manner. Next, we introduce each module of LGnet in detail.
MTS forecasting with LSTM
Recurrent neural networks (RNNs) have demonstrate remarkable success in various MTS forecasting tasks [lai2018modeling, chang2018memory]
. To leverage the advances of RNN, we build LGnet on the top of Long short-term memory (LSTM) network, a variant of RNN which is able to capture long/short term dependency. Note that other RNN variants such as GRU[cho2014learning] can serve as the replacement of the LSTM. Formally, LSTM takes one data point of the time series as input in each step, and iterates from to . Suppose is currently fed to the LSTM, and are the hidden state and memory cell of LSTM at previous step , then the hidden state of memory cell at time can be calculated as:
where is the input gate, is the forget gate, is the output gate, is the element-wise product,
is the sigmoid function, and, are parameters. Given the current hidden state , the forecasting of next data point can be generated recurrently as:
However, the original LSTM cannot handle missing values in its input. Obviously, if contains unobserved variables, matrix productions such as are invalid. One solution is using as an alternative of . However, early errors in can be quickly amplified in the following steps [bengio2015scheduled], leading to inaccurate forecasting.
Therefore, appropriate estimations of missing values should be constructed for the LSTM. We tackle the problem by exploring temporal dynamics from both local and global perspectives with a memory module. In the next section, we discuss its technical details.
The basic idea of the memory module is to learn a parameterized memory which caches global temporal patterns and projects each variable to the same feature space with the memory. For each variable in a MTS, we first capture informative statistics from the local context of this time series, then leverage local statistics as keys to query the memory component, which returns representation vectors with global temporal dynamics. The memory module brings two advantages: (i) learn and store meaningful temporal patterns from a global perspective; and (ii) utilize the knowledge of temporal patterns to construct global representations. Note that the memory module is not the memory cell of the LSTM as shown in Figure 2 and 3.
Capturing Local Statistics
We extract useful local statistic features using the contextual information from observed parts of the time series for missing values. Following prior studies [che2018recurrent], we first generate empirical mean and last observation of every time stamp as follows:
Empirical Mean: for variable , we construct its empirical mean using all available observations of before time , i.e., . The mean of previous observations reflects the time-aware data distribution of and serves as the prior knowledge of the variable.
Last Observation: the last observation of is the first available -th variable before time interval , which is the most temporally close neighbor. We use to denote the last observation of . Note that isn’t necessary equal to because could also be missing. We further introduce an indicator to record the temporal distance between each and its last observation, which reflects the confidence and trustworthy of previous values:
Generally, when is small, we tend to trust more; and when becomes larger, the averaged value would be more representative. Based on the above assumption, we propose the following decaying mechanism to balance empirical mean and last observation:
where and are parameters. The above decay mechanism leverages an exponentiated negative rectifier so that the decay value is monotonous decreasing in the range between 0 and 1 w.r.t [che2018recurrent, luo2018multivariate]. The localized estimation for is constructed as follows:
We use to denote local statistic features for . For observed variables, their original values are directly used. For missing values, we combine empirical mean with last observation to construct .
However, only takes data points observed before the -th time interval into consideration. Similar local statistics can also be extracted from time interval to . This is achieved by first reverse and on the temporal dimension, then extracting local statistics following the same definition of on time interval to . We use to denote local statistics from time interval to . As shown in Figure 2, and are forward and backward local statistic features, respectively.
In addition to the forward and backward local statistic features, the LSTM naturally provides estimations for missing variables. Specifically, we follow Equation 1 and use the output of LSTM at the previous time step as another local statistic from a model view:
Modeling Global Dynamics
The above imputations , , and can fed directly in LSTM to train a MTS forecasting model [che2018recurrent]. However, such an approach is sub-optimal. and become less trustful as the missing ratio raises up. Besides, purely relying on local statistics ignores global temporal dynamics from exogenous sequences, which potentially benefit the estimation of missing values. It is likely to capture time series snippets/patterns from other sequences that are temporally similar (e.g., periodicity) to the contextual of a missing value. For example, to impute one missing data point in a trajectory, similar time series snippet may be found from other trajectories.
However, capturing such global temporal dynamics to find informative temporal patterns for missing values is very challenging. Simply comparing with all potential snippets to find similar ones is impractical due to high computational costs. Recently, memory network [weston2014memory] has shown promising results in capturing patterns for sequential data [sukhbaatar2015end, chang2018memory]. Generally, a memory network initializes a memory component to store feature representations that optimized explicitly on the whole dataset. Those stored representations can be retrieved and utilized for specific tasks [tang2017end, kumar2016ask]. We design a memory module to capture global temporal dynamics explicitly, as shown in Figure 3. We assume there are temporal patterns existing in the dataset (
is a hyperparameter), and initialize a parameterized memory, where is the dimension of pattern representation. The memory is updated jointly with the LSTM. We utilize local statistics as keys to query the memory module because they can represent the uniqueness of variables. Specifically, queries to the memory module are constructed as follows:
where denotes concatenation on column. and are parameters. Then we calculate the similarity between and the memory component :
The similarity scores measure the importance of each temporal patterns in the memory. Any pattern with a higher attention score is more similar to the context of targeting missing value. The representation vector of that preserves global temporal dynamics is then constructed from the weighted sum of all temporal patterns in :
where represents the -th row of the memory component. Besides, since variables at the same time interval interact with each other in Equation 6, also preserves inner correlations of variables at the same time interval. We combine local statistic features and global representations to construct the input of LSTM. Specifically, , , , and are averaged as the input at time interval . Note that some of the local statistics can become unavailable in some cases. For example, we cannot construct for the first missing value, and no is available for the forecasting stage. We set unavailable local statistics to . However, we can generate either the forward or the backward local statistic features unless the whole time series is empty. This ensures LGnet is more reliably than those purely using the forward local statistic [che2018recurrent]. The forecasting results are generated after -th iteration of LSTM. The aligned ground truth data for is denoted as , which also contains missing values. Therefore, we incorporate the mask matrix into mean-square-error and propose the following objective function to train LGnet:
where are parameters of LGnet, including parameters of the LSTM and the memory component, is the mask matrix of the -th MTS data sample over the predicted variables, and is dot-production. Because of the mask matrix, LGnet is optimized over the observed part of .
The objective function of LGnet in Equation 9 only considers available variables. When the missing ratio is relatively high, the proposed objective function becomes inefficient because most values of is zero when sampling under the same data distribution. The predicted future sequences should also follow the same data distribution of the true MTS data. If we can encourage LGnet to generate more realistic data distribution, the overall accuracy of MTS forecasting can be improved. To achieve this goal, we introduce adversarial training to control the distribution of generated MTS.
Recently, generative adversarial networks (GANs) [goodfellow2014generative] have been widely applied to various domains [yu2017seqgan, sun2019megan, shu2018deep]. Typical GAN consists of a generator and a discriminator. The discriminator tries to distinguish samples from the generator and those from read data. The generator tries to generate samples that can “fool” the discriminator by modeling data distribution With such a min-max game, the generator can create more realistic samples. This motivates us to adopt adversarial learning to enhance the forecasting. We design a discriminator , as illustrated in Figure 2. LSTM generates future sequences as the forecasting to “fool” the discriminator ; while is trained to identify whether the input sequence is fake. Through iterative training, the LSTM is more capable of generating time series that fit the underlying distribution [goodfellow2014generative], which makes the forecasting result more accurate.
Specifically, we adopt W-GAN [arjovsky2017wasserstein] and construct a two-layer convolution net as . Given a MTS as input, outputs a real value , which is higher if is real, and lower if is “fake”. The “fake” multivariate time series of length are generated after the forecasting part. Let denote a generated (fake) time series. To compile a “true” dataset that preserves latent data distribution, we sample a subset of complete time series snippets with same length from the raw dataset. Let denote the sampled subset of time series snippets. Empirically, it is not a difficult task when is small (i.e., ). The training objective of the discriminator is:
where denotes “sampling from”, and
is parameters of the discriminator. Generally, time series with a high probability of being a “true” sample will receive a higher score. To generate more realistic sequences, the objective function for the LSTM is defined as:
which aims at faking the discriminator. Note that there is no overlap between the forecasting and the generated part, as we imperially find that adding adversarial loss on the forecasting part may hurt the performance. A potential reason is that the best time series to “fool” the discriminator might not be the most accurate forecasting result. Therefore, we put on extra generated sequences to achieve the best performance. luo2018multivariate state a similar conclusion.
Objective Function and Training
We define the overall objective function to learn model parameters for an accurate MTS forecasting with adversarial training as follows:
where balances the MTS forecasting part and the adversarial training part.
We use stochastic gradient descent to update model parameters. The discriminator and the LSTM are trained alternatively until converged. We first updatewith real MTS snippets and generated ones, then optimize for the LSTM and the memory module while fixing .
In this section, we present experiments to evaluate the proposed framework LGnet. Specifically, we aim at answering the following research questions: (i) RQ1: Can LGnet improve the accuracy of MTS forecasting with missing values? (ii) RQ2 How robust is LGnet w.r.t different missing ratios? (iii) RQ3 How the memory module benefits LGnet? (iv) RQ4 How adversarial training contributes to LGnet? Next, we start by introducing various experiments on MTS forecasting to answer the above questions.
Four large-scale real-world MTS datasets from different domains are selected to validate LGnet: Beijing Air111https://www.kdd.org/kdd2018/kdd-cup: This dataset is introduced by KDD Cup 2018. We extract PM2.5 values from 35 monitoring stations in Beijing, and formulate multivariate time series. The values are reported every hour between 05/01/2014 and 04/30/2015. It has a missing rate of 13% over the temporal dimension. We use past 9-hour observations to train each model, and forecast PM2.5 values for the following 3 hours. PhysioNet: PhysioNet [silva2012predicting] provides 4000 multivariate clinical time series from intensive care unit (ICU). Each time series records 35 measurements such as glucose and heart-rate from the first 48 hours since the patient entered the hospital. We compile time series from 12 important measurements such as heart-rate and temperature. The missing ratio of PhysioNet is about 78% over the temporal dimension. We use the past 6 observations to forecast values in the coming 3 hours. Porto Taxi222https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i: This dataset includes approximately one million trajectories for 442 taxis running in the city of Porto during a complete year (from 01/07/2013 to 30/06/2014). Each trajectory contains many GPS coordinates (i.e., longitude and latitude) recorded chronologically. The sampling speed is 15 second per coordinates. We use the past 7 GPS coordinates to forecast the location of future points. London Weather11footnotemark: 1: The dataset includes temperature, pressure, humidity, wind direction, and wind speed from 861 regions in London from 01/01/2017 to 03/27/2018. All features are collected hourly. We use the past 5 observations to forecast the coming values.
The first and second datasets naturally contain missing values, which are used as real-world settings to answer the first question. For the rest two datasets, we randomly remove % of observed values () to study the robustness of LGnet and compared methods.
We compare LGnet with classical and state-of-the-art baselines, including two non-RNN methods, two time series imputation methods, and two RNN methods:
Linear Regression (LR): Because conventional linear regression model cannot directly handle missing values, we concatenate each MTS with its mask matrix as the input features to train LR for the forecasting task.
MICE [azur2011multiple]: MICE fills the missing values using multiple imputations with chained equations. We first apply MICE to impute miss values. We then train LSTM for the forecasting task.
GRUI [luo2018multivariate]: GRUI leverages GAN and RNN for time series imputation. Similar to MICE, we first train GRUI for time series imputation. Then we train LSTM on imputed data as the forecasting model.
GRU-D [che2018recurrent]: GRU-D combines statistical features and linear decay in RNN to tackle missing variables in time series. It is proposed for multivariate time series forecasting task.
BRITS [cao2018brits]: BRITS designs bi-direction recurrent neural architecture for time series imputation and forecasting. It models missing patterns explicitly and improves forecasting accuracy.
We normalize each dataset and ensure all time series variables have the same scale (i.e., mean and variance) on each dataset so that their averaged results are comparable. For each dataset, we randomly select 70% of MTS data for training, 10% for validation to tune hyperparameters, and the remaining 20% for testing. We set the dimension of the hidden unit to 32 for the LSTM. We selectfrom to create the memory component according to the performance on validation sets. The dimension of each memory slot is 128. We tune that balance MTS forecasting and adversarial training on validation sets for the best performance. The discriminator contains two convolutional layers following by two fully-connected layers. kernels are used for both convolutional layers. The channel sizes are 64 and 128 for the first and second convolutional layer, respectively. The dimensions of fully connected layers are 1024 and 1.
Two widely used evaluation metrics, i.e.,root mean square error (RMSE) and mean absolute error (MAE), are adopted. Since different variables have different scales, we report the RMSE and MAE on their normalized values. The smaller RMSE and MAE are, the better the performance is.
To answer RQ1, we compare LGnet with baselines on Beijing Air and PhysioNet, where missing values naturally exist. We report the performance on the two datasets for (forecasting horizon) in Table 1, and make the following observations: (i) LGnet outperforms all the baseline methods for the majority of the cases, which shows the effectiveness of the memory module and adversarial learning for multivariate time series forecasting with missing values. The memory module explores global temporal dynamics and generates appropriate estimations for missing values; (ii) when increases, i.e., when forecasting far future values, the performance of all the methods decreases, which is reasonable because it’s more difficult to forecast far future values than near ones. However, LGnet still significantly outperform the compared methods, which is because we adopt adversarial training on the predicted sequences to make the forecasting more realistic; (iii) In addition, the performance improvement of LGnet is much more significant on PhysioNet than Beijing Air. Compared with Beijing Air, PhysioNet has a higher missing ratio, which challenges the baseline methods; while LGnet can still handle such high missing ratio, which further implies the effectiveness of LGnet by designing memory network and adopting adversarial training.
Robustness of LGnet
Real-world applications could encounter various data missing conditions. It is interesting to understand the robustness of LGnet under different missing ratios. To this end, we design experiments on two complete MTS datasets, including Porto taxi and London weather. In particular, for each dataset, we randomly drop % of observed values to generate synthetic missing condition and we alter from . We train LGnet and compared methods to forecast the next observation of all variables (i.e., ). The performance of LGnet and all compared methods in terms of RMSE and MAE varying is reported in Figure 4. Clearly, the forecasting error of non-RNN methods raises dramatically as increasing, because they fail to model missing temporal patterns for the forecasting. GRU-D and BRITS explicitly handle missing values and achieve lower errors compared with LR and XGBoost. However, they fail to maintain accurate forecasting when the missing ratio is high. LGnet achieves the highest accuracy even if the data is extremely sparse (e.g., ), which illustrate the effectiveness of the memory module and the adversarial schema. The global temporal patterns in memory module help LGnet perform well as the missing ratio increasing. Extra guidance from the discriminator improves the capability of LSTM in modeling the global distribution of MTS.
Memory Module Analysis
We analyze the contribution of the memory module. We create an ablation named by removing the memory module from LGnet, and use as the input for LSTM. The performance of is reported in Table 2. Obviously, LGnet significantly out-performs , indicating that modeling global temporal dynamics with the memory module benefits the forecasting. Moreover, the performance improvement of LGnet over is relatively bigger as the missing ratio raises up. This is because local statistic features are less reliable with a high missing ratio. Under such circumstances, it is vital to leverage global patterns stored in the memory component as support to estimate missing values.
|Porto Taxi||London Weather|
Adversarial Schema Analysis
We further study the effectiveness of adversarial training. The parameter balances the weight between the forecasting loss and the adversarial part. A variant of LGnet without the adversarial training (i.e., ) is denoted as , and its performance is reported in Table 2. Clearly, the adversarial training contributes a lot to the forecasting, reducing RMSE by 2% – 10% under different circumstances. More concretely, more significant error reductions occur on London weather dataset compared with Proto taxi dataset. One possible reason is that MTS from London weather dataset contain more variables and have a better description of the realistic data distribution. Besides, improvements from incorporating the discriminator are relatively greater when the missing ratio increases. This is because the original MTS forecasting objective is less efficient with a high missing ratio, as it only relies on observed parts of the time series. In conclusion, it is beneficial to introduce adversarial training for LGnet.
We investigate the sensitivity of , which balances the forecasting loss and the adversarial training part. Generally, More emphasis is put to the forecasting part when is closer to 0. We alter the value of among and report the performance of LGnet on Beijing Air dataset with . As shown in Table 3, the forecasting accuracy of LGnet first increases as becomes larger. However, extremely large values of result in low performances.
In this paper, we investigate a novel problem of exploring local and global temporal dynamics for MTS forecasting with missing values. We propose a new framework LGnet, which adopts memory network to capture global temporal patterns using local statistics as keys. To make the generated MTS more realistic, we further adopt adversarial training to enhance the modeling of global temporal data distribution. Experimental results on four large-scale real-world datasets show the efficacy of LGnet.
This material is based upon work supported by, or in part by, the National Science Foundation (NSF) under grant #1909702.