1 Introduction
A number of time series applications naturally produce missing values. Examples include electronic health records (EHR) consisting of patient visits where every possible test is not reported during every visit perhaps due to the costs of running healthcare tests. Other examples include climate/weather data, ecology, and astronomy. Historically, such missing data has been handled by imputation which requires choosing a function to impute with, followed by modeling the imputed time series with well understood time series modeling approaches
[boxjenkins:book08]. Given that modern data collection techniques allow for collection of voluminous data with relative ease, approaches involving imputation have fallen out of favor especially for cases that would result in an explosion in size due to imputing the missing values. Due to such reasons, a number of recent works have explored how to model time series data in its raw form, without imputation.Due to the reasons described above, recent works have explored a variety of approaches ranging from recurrent neural networks to attention mechanisms to learn from time series data with missing values. Given that patient EHR has been utilized as the primary benchmark in most of the recent works, certain commonalities exist. For instance, while EHR may contain multiple healthcare indicators in each visit it is unlikely that one patient’s healthcare indicators will correlate with another patient’s healthcare indicators. Thus, in almost all of the recent works, it is easy to determine which variables are reported at the same (irregularly sampled) timestep and which are not. Another commonality is to test the approaches on downstream classification tasks (e.g., does the patient suffer from diabetes?) rather than standard time series related tasks such as forecasting. We tackle the much more challenging setting where different spatially located sensors record multiple physical attributes whose values may be correlated with each other depending on how far they are located. Figure
1 shows how the sensors are located across the lab. Each sensor reports four attributes: temperature, humidity, light and voltage of the battery in the sensor. Each time a sensor reports, it reports the value of all four attributes, but different sensors report its attributes independently. In other words, it is not immediately clear which sensors’ reporting patterns are correlated. Moreover, sensors can break. For instance, towards the end of the reporting period the data (available at http://db.csail.mit.edu/labdata/labdata.html) contains unrealistically high temperature values exceeding 100C.In this paper, we show how to model the above setting using recurrent neural networks. As mentioned above, our setting is more challenging than the standard setting of time series with missing data on at least two counts: 1) since we are not given which multivariate attributes are correlated with each other and need to learn such correlations, and 2) because in our setting sensors can break and may report erroneous values. Moreover, we evaluate the proposed methods on downstream forecasting tasks as opposed to classification tasks that may have been the focus of previous works. In terms of results, we show that a fairly simple model outperforms GRUD [Che et al., 2018], a stateoftheart technique for modeling time series data with missing values (see [RiseDise] for a comparison of various techniques where GRUD turned out to be one of the strongest baselines).
2 Related Work
Machine learning has a long history in time series modelling (Mitchell, 1999; Gers et al., 2000; Wang et al., 2006; Chung et al., 2014). However, the recent massive realworld data collections increase the need for models capable of handling such complex data. Often this data is multivariate and is collected at irregular intervals leading to missing values. The above mentioned techniques do not handle multivariate irregularly sampled data with missing values. Handling this sporadic data is the main difficulty.
To address irregular sampling, a popular approach is to recast observations into fixed duration time bins. However, this representation results in missing observations both in time and across features dimensions. This makes the direct usage of neural network architectures tricky. To overcome this issue, the main approach consists in some form of data imputation and jointly feeding the observation mask and times of observations to the recurrent network (Che et al., 2018; Choi et al., 2016a; Lipton et al., 2016; Du et al., 2016; Choi et al., 2016b; Cao et al., 2018). This approach strongly relies on the assumption that the network will learn to process true and imputed samples differently. Despite some promising experimental results, there is no guarantee that it will do so. Some researchers have tried to alleviate this limitation by introducing a more meaningful data representation for sporadic time series (Rajkomar et al., 2018; Razavian & Sontag, 2015; Ghassemi et al., 2015). [RiseDise] evaluates a number of approaches described above by casting them as specific instances of the RISE framework (Recursive Input and State E
stimation) which allows an internal hidden state vector
to be computed (usually by some variant of the RNN family), a transformed hidden state vector (usually an older hidden state is decayed depending on how long ago it was computed), and a transformed input that allows for the inclusion of some imputation mechanism. One of its primary observations is that the current techniques lack mechanisms to handle longhorizon forecasts. Their experiments are conducted on univariate time series (patients with glucose measurements made every 5 minutes) and 2variable tasks (PM and PM measured every hour across monitoring stations in Beijing).Among slightly older works, RETAIN [choi:nips16] combines attention with RNNs to address missing data in EHR while retaining some semblance of interpretability so doctors can gain some trust into the underlying modeling mechanism and predictions made. More notably, they run their RNNs in reverse, i.e., start at the most recent observed values iterating to the past, and also they do not make use of the time lags among the observations. Doctor AI [choi:mlhc16]
uses various codes available in the medical domain (e.g., diagnostic codes such as ICD) to feed into an RNN to construct a hidden representation of the patient’s status. They describe an approach to learn embeddings for the input codes using word2vec
[mikolov:nips13] to feed to the RNN, besides feeding in the delta time lag since the last observation. TLSTM [baytas:kdd17], like GRUD, also proposes to discount the state of the RNN depending on how long ago the previous observation was made by dividing the RNN hidden state vector into a longterm component and a shortterm component.Others have addressed the missing data problem with generative probabilistic models. Among those, (multitask) Gaussian processes (GP) are the most popular by far (Bonilla et al., 2008). They have been used for smart imputation before a RNN or CNN architecture (Futoma et al., 2017; Moor et al., 2019), for modelling a hidden process in joint models (Soleimani et al., 2018), or to derive informative representations of time series (Ghassemi et al., 2015). GPs have also been used for direct forecasting (Cheng et al., 2017). However, they usually suffer from high uncertainty outside the observation support, are computationally intensive and learning the optimal kernel is tricky. Neural Processes, a neural version of GPs, have also been introduced by Garnelo et al. (2018).
Most recently, the seminal work of Chen et al. (2018) suggested a continuous version of neural networks that overcomes the limits imposed by discretetime recurrent neural networks. Coupled with a variational autoencoder architecture (Kingma & Welling, 2013), it proposed a natural way of generating irregularly sampled data. However, it transferred the difficult task of processing sporadic data to the encoder, which is a discretetime RNN. Related autoencoder approaches with sequential latents operating in discrete time have also been proposed (Krishnan et al., 2015, 2017). These models rely on classical RNN architectures in their inference networks, hence not addressing the sporadic nature of the data. What is more, if they have been shown useful for smoothing and counterfactual inference, their formulation is less suited for forecasting. Importantly, the ability of the GRU to learn longterm dependencies is a significant advantage.
Most of these models focus on classification whereas we focus on forecasting the values along with the time of their forecast.
3 Methods
In this section we first formally define the problem statement, then discuss some preliminaries and finally the proposed model.
3.1 Problem Statement
We consider the general problem of one step forecasting for data from sensors where climate longitudinal variables can potentially be measured. We pick a single longitudinal variable (
=1). We use an autoregression hyperparameter
. denotes the length of the sequence data (number of time steps), and the data at each constitutes a time series. The time series are obtained by taking steps of the dimensional data where each consecutive sequence overlaps on common steps. Each time series is measured at time points specified by a vector of observation times . Let denote the timestamp when the observation is obtained and we assume that the first observation is made at timestamp 0 (i.e., =0). The values of these observations are specified by a matrix of observations , an observation mask and a matrix specifying the time difference between each observation of a variable. To be more specific, for a given dimensional time series of length , we have:where both and are scalars and stand for the respective values of the dimension at time step . In this paper, we are interested in time series forecasting, where we predict the value given the time series data , where .
3.2 Preliminaries: The Gated Recurrent Unit
Recurrent Neural Networks (RNNs), such as Long ShortTerm Memory (LSTM)
[hochreiter1997long]and Gated Recurrent Unit (GRU)
[cho2014learning], have many useful properties such as strong prediction performance as well as the ability to capture longterm temporal dependencies (like seasonality) and variablelength observations. They have shown to achieve the stateoftheart results in many applications with time series or sequential data. RNNs for missing data have been studied in the past [DBLP:journals/corr/ChePCSL16], [bengio1996recurrent], [kim2018temporal], [parveen2002speech].Given our setting, these RNNs are ideal for time series forecasting. The sensor data has seasonality and trend in addition to irregular samples with missing values, so we need a model which can exploit the correlation between the variables and also capture the seasonality. The recursive nature of RNNs allow it to do both. As the parameters are shared across the time steps, the number of parameters is relatively small. We use the Gated Recurrent Units (GRUs) for our experiments but similar techniques are also valid for other variants of RNNs.
Figure 2 shows the structure of GRU. For each jth hidden unit, GRU has a reset gate and an update gate to control the hidden state at each time . The update functions are:
where matrices , , , , , and vectors , , are model parameters. Here
represents elementwise sigmoid function, and
stands for elementwise multiplication. It is assumed in this setup that all the inputs are observed and hence the missing values need to be either explicity or implicity imputed. Depending on the task, either classification or regression, an appropriate last layer, sigmoid/softmax or dense respectively, is applied on the output of GRU at the last time sep.3.3 Proposed Model
Inspired from GRUSimple [che2018recurrent], our model takes consecutive timesteps from the vector with the corresponding , and concatenated as the input to the GRU. The target vector is the concatenation of and for the next timestep. More specifically,
The missing values are imputed using forward imputation. More formally, for a given sequence x,
where is the last time the th variable was observed.
We differ from GRUSimple [che2018recurrent] in three aspects:

We focus on forecasting instead of classification.

We not only forecast the values but also the time at which they are expected to occur.

We use a custom loss function which optimizes only over the present values.
We can consider our neural network as a function parameterized on . Then the output = . The loss for the input sequence is
where is chosen to be HuberLoss in order to minimize the Mean Absolute Error (MAE). The same loss function is used for the delta prediction layer too. Figure 3 shows the model.
4 Evaluation
This section provides details about the dataset and the results obtained using different techniques. Later, this section also describes the properties and shortcomings of the proposed approach.
4.1 Dataset
We use Intel’s Berkeley Lab data from 2005 which consists of 58 sensors measuring 4 physical attributes: temperature, humidity, voltage and light. Of these 4, we select temperature for a subset of 4 sensors (sensor ID: 1, 2, 3 and 4) for simplicity and faster experiments. The original data at microsecond granularity is compressed to a second level granularity. The details for the temperature variable in the dataset are given below.
Timesteps  #Sensors  Temperature Avg (C)  Delta Avg (s)  Sparsity 

968,535  58  39  62  0.04117 
Details of the subset used for experiments:
Timesteps  #Sensors  Temperature Avg (C)  Delta Avg (s)  Sparsity 

167,875  4  38.77  61.64  0.3587 
Test set is created by holding out the last 30% of the dataset. See Figure 5 for the plot.
4.2 Results
We compare our technique with GRUD and GRUODEBayes on the above dataset.
Technique  Temperature MAE (C)  Delta MAE (s) 

GRUSimple  2.89  18.2 
GRUD  3.47   
GRUODEBayes  3.58   
Figure 6 compares GRUD and GRUSimple for different temperature cutoffs on the test set showing that GRUSimple performs better on the part of the test set without anomalies.
GRUD is sensitive to shuffling the dataset. Plots in Figure 9 were obtained by training on an unshuffled training set and then evaluating on an unshuffled test set (left shows zoomed in version while right shows zoomed out version). The dataset was also shuffled and tested. The plots obtained were similar. Why certain irregularities occur in these plots is difficult to explain. In the next section let’s look at some properties of GRUSimple.
4.3 GRUSimple Properties
Plots in Figure 7 overlay forecasted temperatures over ground truth temperature while plots in Figure 8 do the same with delta prediction. values are highly uncorrelated. GRUSimple performs well despite that. Temperatures for the 4 sensors are highly correlated. The predictions are likewise highly correlated. Figure 10 shows the plots.
4.4 Imputation and loss function
In this section, we justify the imputation and the loss function used. Later we also discuss the problems with GRUSimple.
4.4.1 Is imputation important?
As Figure 11 shows, a reasonable imputation technique is necessary. If not imputed, the missing values are replaced by 0 by default as the data cannot have NaNs as input to the model. In our model, we use forward imputation. The literature for imputation is already replete with various techniques so that is not the focus of our work.
4.4.2 Is directly optimizing the present values important?
As Figure 12 shows, it is necessary. The predictions are sometimes jagged, sometimes smooth. Optimizing the present values directly, makes sure the correct values are optimized and the prediction plot gets smoother.
4.4.3 Problems with GRUSimple
GRUSimple has two problems.
To overcome the above problems we tried a few variations as discussed in the next section.
5 Variations on GRUSimple
The motivation behind the variations is to overcome the two shortcomings discussed in the previous section. The next few subsections explain them.
5.1 BiLayer GRU
This approach was taken to overcome the high correlation between the outputs. The motivation behind this is to treat temperature as a dependent variable on . The time at which a particular sensor reports a value is random and hence should be treated as an independent variable. Figure 14 below shows the model.
The above approach resulted in temperature predictions which were affected by delta predictions as seen in the next plots (Figure 15) and hence cannot be used reliably.
5.2 GRUVelocity
The motivation here is to obtain the next temperature prediction using the following equation:
The architecture is shown in Figure 16.
Although this added “momentum" to the predictions as will be seen in the next plot (Figure 17), the predictions for all sensors still remain highly correlated.
Notice that the orange curve in Figure 17 is not directly affected by the other sensor now due to some “momentum" at around step 2500. Though the overall predictions don’t improve and the predictions remain correlated.
6 Future Work
To overcome the problems in GRUSimple, the approach suggested by [shukla2019interpolation] looks promising. A “covariance layer" can be introduced before feeding the inputs to GRU, which “decorrelates" the input in proportion to the trust in a particular sensor’s input. If the value is abnormally high, the trust should be reduced, and viceversa. Other possible extensions are:

Multistep forecasting: sequence to sequence models

Imputation: some learned mechanism instead of a forward imputation.

Scalability: To scale to more sensors and larger dataset, batchwise loading with imputation can be used. There are existing techniques in the Keras library where custom data loaders can be written.
While [che2018recurrent] presents the GRUDecay model which can be broadly applied to datasets with missing values, their assumption that the missing value decays to the mean value may not always hold in datasets. For instance, in stock market prediction, there is no meaningful mean value to decay to. Also, as the authors themselves state, the GRUDecay model cannot be used in unsupervised settings without prediction labels. Furthermore, all the recurrent neural network models are discrete in nature whereas the multivariate irregularly sampled time series is more suited to a continuous modelling of the data as we are not dealing with just fixed time steps anymore. So, a general model which can work in supervised and unsupervised settings without the assumptions of discreteness and decaying to some constant value would address the broadest range of problems in this setting. [DBLP:journals/corr/abs190703907] looks promising.