Log In Sign Up

Multivariate Time Series Imputation by Graph Neural Networks

Dealing with missing values and incomplete time series is a labor-intensive and time-consuming inevitable task when handling data coming from real-world applications. Effective spatio-temporal representations would allow imputation methods to reconstruct missing temporal data by exploiting information coming from sensors at different locations. However, standard methods fall short in capturing the nonlinear time and space dependencies existing within networks of interconnected sensors and do not take full advantage of the available - and often strong - relational information. Notably, most of state-of-the-art imputation methods based on deep learning do not explicitly model relational aspects and, in any case, do not exploit processing frameworks able to adequately represent structured spatio-temporal data. Conversely, graph neural networks have recently surged in popularity as both expressive and scalable tools for processing sequential data with relational inductive biases. In this work, we present the first assessment of graph neural networks in the context of multivariate time series imputation. In particular, we introduce a novel graph neural network architecture, named GRIL, which aims at reconstructing missing data in the different channels of a multivariate time series by learning spatial-temporal representations through message passing. Preliminary empirical results show that our model outperforms state-of-the-art methods in the imputation task on relevant benchmarks with mean absolute error improvements often higher than 20


page 1

page 2

page 3

page 4


A convolution recurrent autoencoder for spatio-temporal missing data imputation

When sensors collect spatio-temporal data in a large geographical area, ...

Learning to Reconstruct Missing Data from Spatiotemporal Graphs with Sparse Observations

Modeling multivariate time series as temporal signals over a (possibly d...

Graph state-space models

State-space models constitute an effective modeling tool to describe mul...

A Context Integrated Relational Spatio-Temporal Model for Demand and Supply Forecasting

Traditional methods for demand forecasting only focus on modeling the te...

Handling Missing Data via Max-Entropy Regularized Graph Autoencoder

Graph neural networks (GNNs) are popular weapons for modeling relational...

Graph-Guided Network for Irregularly Sampled Multivariate Time Series

In many domains, including healthcare, biology, and climate science, tim...

Multivariate Time Series Regression with Graph Neural Networks

Machine learning, with its advances in Deep Learning has shown great pot...

1 Introduction

Imputation of missing values is a prominent problem in multivariate time-series analysis (TSA) from both theoretical and practical perspectives (Yi et al., 2016; White et al., 2011; Che et al., 2018; Fedus et al., 2018). In a world of complex interconnected systems such as those characterizing sensor networks or internet of things, in fact, faulty sensors, network failures and physical interferences are widespread phenomena that cause disruptions in the data acquisition process. Luckily, failures of this type are often sparse and localized at the single sensor level, i.e., they do not compromise the whole sensor network at once. In other terms, it is often the case that, at a certain time step, missing data appear only at some of the channels of the resulting multivariate time series. There, missing data can affect multiple channels, are not necessarily synchronized in time and show different transient or even permanent dynamics. In this context, spatio-temporal imputation methods (Yi et al., 2016; Yoon et al., 2018b) aim at reconstructing the missing parts of the signal by exploiting both temporal and spatial dependencies. In particular, effective spatio-temporal approaches would reconstruct missing values by not only taking into account past and future values, but also the concurrent measurements of spatially close neighbouring sensors. Here, spatial similarity does not necessarily mean physical (e.g., geographic) proximity, but rather indicates that the considered sensors are related w.r.t. a generic (quantifiable) functional dependency (e.g., Pearson correlation or Granger causality) and/or that are close in a certain latent space. Relation information, then, can be interpreted as a set of constraints that allow replacing the malfunctioning sensor with a virtual one.

Among different imputation methods, approaches based on deep learning (LeCun et al., 2015; Schmidhuber, 2015; Goodfellow et al., 2016) have become increasingly popular (Yoon et al., 2018a; Cao et al., 2018; Liu et al., 2019). These methods often completely disregard available relational information or rely on rather simplistic modifications of standard neural architectures tailored for sequence processing (Hochreiter and Schmidhuber, 1997; Chung et al., 2014; Bai et al., 2018; Vaswani et al., 2017). We argue that stronger, structural, inductive biases are needed to advance the state of the art in this direction and allow to build effective inference engines in the context of large and complex sensor networks found in real-world applications.

In this preliminary work, we model input time series as sequences of graphs where edges represent the relationships among different channels at each time step. We propose graph neural networks (GNNs) (Scarselli et al., 2008; Bronstein et al., 2017; Battaglia et al., 2018)

as the building block of a novel, bi-directional, recurrent neural network for multivariate time series imputation (MTSI). Our method, named

Graph Recurrent Imputation Layer (GRIL), has at its core a recurrent neural network cell where gates are implemented by message-passing neural networks (MPNNs). Two of these networks process the input multivariate time series in both forward and backward directions at each node, while hidden states are processed by a message-passing imputation layer exploiting information at neighboring nodes to perform imputation.

Our contributions are manifold: 1) we introduce graph neural networks in the context of MTSI, 2) we propose a novel, practical and effective implementation of a GNN-based architecture for MTSI and 3) we achieve state-of-the-art results on several and varied MTSI benchmarks. Our method does not rely on any assumption on the distribution of the missing values (e.g., presence and duration of transient dynamics and/or length of missing sequences) other than stationarity of the underlying process. The rest of the paper is organized as follows. In Section 2 we discuss the related works. Then, in Section 3, we formally introduce the problem settings and the task of MTSI. We present our approach to MTSI in Section 4, by describing the novel methodological framework to implement imputation architectures based on GNNs. We proceed with an empirical evaluation of the presented method against state-of-the-art baselines in Section 5 and, finally, we draw out conclusions in Section 6.

2 Related works

Time series imputation

There exists a large literature addressing missing value imputation in time series. Besides the simple and standard interpolation methods based on linear and polynomial curve fitting, popular approaches aim at filling up missing values by taking advantage of the similarities among time series. For example, several approaches rely on k-nearest neighbors 

(Troyanskaya et al., 2001; Beretta and Santaniello, 2016) or matrix factorization (Cichocki and Phan, 2009; Mei et al., 2017). Other standard methods exploit the expectation-maximization algorithm (Ghahramani and Jordan, 1994; Nelwamondo et al., 2007) or linear predictors and state-space models (Durbin and Koopman, 2012; Kihoro et al., 2013; Moritz and Bartz-Beielstein, 2017). A more advanced method, namely STMVL (Yi et al., 2016), combines spatial and temporal information by using a hybrid solution consisting of standard forecasting techniques and methods from the recommender systems literature to impute geographically tagged time series.

More recently, several deep learning approaches have been proposed for MTSI. Among the others, methods based on recurrent neural networks found widespread success (Lipton et al., 2016; Yoon et al., 2018b; Che et al., 2018; Luo et al., 2018; Liu et al., 2019; Luo et al., 2019). In particular, Cao et al. (2018) propose BRITS, a bidirectional neural network for multivariate time series imputation that takes into account the correlation among different channels to perform the imputation. Other successful strategies in the literature have been proposed that exploit adversarial training to generate realistic reconstructed sequences (Yoon et al., 2018a; Fedus et al., 2018; Luo et al., 2018, 2019). For example, Yoon et al. (2018a) use a GAN (Goodfellow et al., 2014) to learn models that match the underlying data generating process. Liu et al. (2019), instead, uses adversarial learning to train a multi-scale model that imputes highly sparse time series in a hierarchical fashion. However, we argue that none of the above-cited methods is able to take full advantage of relational information and spatio-temporal dependencies unless time series are generated by linear time-invariant state-space models. Most importantly, the above methods lack the flexibility and expressiveness enabled by operating in the context of graph processing.

Graph neural networks for TSA

Graph neural networks has been exploited in TSA mostly to implement spatio-temporal forecasting methods. The basic idea behind most of the methods present in the literature is to modify standard neural network architectures for sequential data by relying on operators that work in the graph domain. For example, Seo et al. (2018) propose a gated recurrent graph neural network where the standard gates used by a GRU cell (Chung et al., 2014) are implemented by spectral GNNs (Defferrard et al., 2016); Li et al. (2018) propose an analogous architecture replacing spectral GNNs with a diffusion-convolutional network (Atwood and Towsley, 2016). Yu et al. (2017) and Wu et al. (2019)

propose, instead, spatio-temporal convolutional neural networks that alternate convolutions on the temporal and spatial dimensions. Similar approaches have also been studied in the context of attention-based sequence models 

(Vaswani et al., 2017) with Transformer-like architectures alternating spatial and temporal attention layers (Zhang et al., 2018; Cai et al., 2020). Another particularly interesting line of research is related to the problem of learning, end-to-end, the best graph structure to process an input multivariate time series (Kipf et al., 2018; Wu et al., 2020; Shang et al., 2020).

While the above approaches focus on multivariate time series prediction and use the graph structure only as a representational tool, other methods aim at predicting changes in the graph topology (Zambon et al., 2019; Paassen et al., 2020). Finally, methods such as Temporal Graph Networks (Rossi et al., 2020) are tailored to learn node embeddings in dynamical graphs and adopt an event-based paradigm that is out of the scope of this work. Finally, recent works have proposed GNNs for imputing missing features in the context of i.i.d. data. Among the others, Spinelli et al. (2020) propose an adversarial framework to train GNNs on the data reconstruction task, while You et al. (2020) propose a bipartite graph representation for feature imputation. To the best of our knowledge, no previous GNN-based method targeted missing value imputation in the context of sequential data streams.

3 Preliminaries

Figure 1: An example of graph stream, in which the red circles denote nodes with missing values. Note that a node may have missing values in consecutive graphs, while we may have multiple (possibly connected) nodes with missing value in the same graph.

Sequences of graphs

We consider sequences of weighted directed graphs where at each time step we observe a graph with nodes. The graph is a couple , where is the node-attribute matrix which has as -th row the

-dimensional node-attribute vector

associated with the -th node; the adjacency matrix , on the other hand, has entries which indicate the weight of the edge (if any) connecting the -th and -th node. We assume that nodes are identified, i.e., have a unique id that enables time-wise consistent processing. This problem setting can be easily extended to more general classes of graphs with attributed edges and global attributes (Battaglia et al., 2018). In this work, we mainly focus on problems where the topology of the graph is fixed and does not change over time, i.e., at each time step and .

Any generic multivariate time series can be trivially described in the above framework by letting each channel of the sequence correspond to a node and using the identity matrix as adjacency if no relational information is available. However, this would defeat the purpose of the formulation: a more proper choice of

could be made using any standard similarity measure (e.g., Pearson correlation) or using a (thresholded) kernel; a more advanced approach instead could aim at learning a proper adjacency directly from data by using, for instance, spatial attention scores or resorting to graph learning techniques, e.g., Kipf et al. (2018).

Multivariate time series imputation

To model the presence of missing values, we augment, at each step, the graph with a binary mask where each row indicates which of the corresponding node attributes of are available in . In particular, implies that is not valid, conversely, if , stores the actual sensor reading. We indicate with the ground truth node-attribute matrix, i.e., the complete node-attribute matrix without any missing data. We do not make any special assumption on the distribution of the missing values.

The objective of MTSI is to impute missing values in a sequence of input data. More formally, given a sequence of length , we can define the reconstruction error as


where is the reconstructed , and are respectively the logical negation of and , is an element-wise error function (e.g., absolute or squared error) and indicates the standard dot product. Note that in practice, it is impossible to have access to and, as a consequence, it is necessary to define a surrogate optimization objective by, for example, using a prediction loss or generating synthetic missing values.

In the context of trainable, parametric, imputation methods we consider two different operational settings. In the first one, that we call in-sample imputation, the model is trained on the same input sequence on which it will be evaluated. Differently, in the second one (referred to as out-of-sample imputation), the model is trained and evaluated on disjoint sequences. Note that in both cases, clearly, the model does not have access to the ground-truth data used for evaluation. The first operational setting simulates the case where a practitioner fits the model directly on the target sequence to fill up its missing parts. The second, instead, simulates the case where one wishes to use a model fitted on historical data to impute missing values in an unseen target sequence.

4 Graph Recurrent Imputation Layer

In this section, we present our approach, the Graph Recurrent Imputation Layer (GRIL), a graph-based, recurrent neural network for MTSI. Given a multivariate time series with mask , our objective is to reconstruct the input sequence by combining the information coming from both the temporal and spatial dimensions. In order to do so, we design an architecture based on two blocks: a spatio-temporal encoder and a spatial decoder. The spatio-temporal encoder maps the input sequence in a spatio-temporal representation by exploiting an ad hoc designed recurrent GNN. The spatial decoder, instead, projects the learned representations into . In particular, the decoder is implemented by a MPNN which learns to infer, at each time step , the observed value at the -th node by only considering – locally – and the value observed at time at its neighboring nodes. In practice, in order to benefit from both the forward and backward temporal dimensions, we duplicate this encoder-decoder architecture and add a last layer to combine the two representations. More precisely, the final imputation depends on the output two parallel networks and , being the family of GRIL models.

An overview of GRIL is given in Figure 2. As can be seen from the schema, we proceed by imputing missing values iteratively, using at each time step previously imputed values as input. In the following, we describe in detail the architecture of the encoder and decoder blocks, and the computational steps required to perform the imputation. Then, we describe the unidirectional version of the model, and finally we provide the straightforward bidirectional extension.

Figure 2: An overview of GRIL. For simplicity, here we show only the unidirectional processing. We model a multivariate time series as a sequence of graph signals, in which each channel is mapped into a node. In this example, the network is processing the -th step of a graph stream with nodes. In this step, two of the nodes show a missing value. GRIL performs imputation by means of spatio-temporal representation . Afterwards, the imputations for each node are refined using messages received from neighbors. Second stage imputations are then used to continue the processing at the next step.

4.1 Spatio-temporal Encoder

In the encoder, the input sequence and mask, and are processed one step at a time sequentially, by means of a recurrent neural network with gates implemented by massage-passing layers.

Message-passing gates

The gates of the GRIL encoder can be implemented by any MPNN (Gilmer et al., 2017). In particular, given , i.e., the node features vector at layer, we consider the general class of message passing layers described as


where is the set of neighbors of the -th node in , and

are generic, differentiable, update and message functions (e.g., multilayer perceptrons), and

is a permutation invariant, differentiable aggregation function (e.g., or ). In particular, the function specifies how to aggregate the messages received from the neighbors. For simplicity sake, from now on we indicate with MPNN the forward pass of a generic -layered message-passing neural network. In the following, we use MPNNs as the building blocks for our spatio-temporal feature extractors.

Spatio-temporal recurrent cell

To learn the dynamics of the system, we leverage gated recurrent units (GRUs) 

(Cho et al., 2014). As previously mentioned, similarly to Seo et al. (2018) and  Li et al. (2018), we implement the GRU gates by relying on the message-passing layers defined above. At the node level, the elements of the message-passing GRU (MPGRU) can be described by the following computational steps:


where are the reset and update gates, respectively,

is the hidden representation of the

-th node at time , and is the output of the decoding block at the previous time-step (see the next subsection). The symbols and denote the Hadamard product and the concatenation operation, respectively. The initial representation can either be initialized as a constant or with a learnable embedding. Note that for the steps where input data are missing, the encoder is fed with predictions from the decoder block, as explained in the next subsection. By carrying out the computation above time and graph wise, we get the encoded sequence .

4.2 Spatial Decoder

As a first decoding step, we generate one-step predictions from the hidden representations of the MPGRU by means of a linear readout


where is a learnable weight matrix and

is a learnable bias vector. We then define the

filler operator as


intuitively, the filler operator replaces the missing values in the input with the values at the same positions in . By feeding to the filler operator, we get the matrix such that the output is with missing values replaced by the one-step-ahead predictions . The resulting node-level predictions are then concatenated to the mask and the hidden representation , and processed by a final one-layer MPNN which computes for each node an imputation representation as


Notice that, as previously highlighted, the imputation representations only depend on messages received from neighboring nodes: In fact, by aggregating only messages from the one-hop neighborhood, the representations , are independent from the input features of the -th node itself. This constraint forces the model to learn how to reconstruct a target input by taking into account spatial dependencies.

Afterwards, we concatenate the imputation representation with the hidden representation and the mask , and we generate the second-stage imputations by using a second linear readout


Note that concatenating , does not change the fact that representation of each node (and the final prediction) is still independent from its features at time step . We use the filler operator to get an updated prediction of the reconstructed input. Then, we use as input to update the state of the MPGRU (Eq. 3 – 6) and proceed to the next input graph .

4.3 Bidirectional Model

Extending GRIL to account for both forward and backward dynamics is straightforward. We simply duplicate the architecture described in the two previous subsections. The first network processes the sequence in the forward direction (from to ), while the second one the backward direction (from back to

). The final imputation is then obtained with a Multi-Layer Perceptron (MLP) aggregating the representations extracted by the two networks:

where and denote the forward and backward models, respectively. The final output can then be easily obtained as .

5 Empirical evaluation

In this section, we empirically evaluate our approach against state-of-the-art baselines in three relevant application domains. In particular, we consider BRITS (Cao et al., 2018) as a principal competing alternative among non-adversarial approaches, as it shares architectural similarities with our method. However, it is worth noting that BRITS can only operate on multivariate time-series with a fixed size, as opposite to GRIL. For all the experiments, we use as message-passing operator the diffusion convolution introduced by Atwood and Towsley (2016), which makes our recurrent GNN cell similar to DCRNN (Li et al., 2018). As additional baselines we consider: 1) MEAN, i.e., imputation using the node-level average; 2) KNN, i.e., imputation by averaging values of the neighboring nodes with the highest weight in the adjacency matrix ; 3) MICE (White et al., 2011), limiting the maximum number of iterations to and the number of nearest features to ; 4) Matrix Factorization (MF) with .

We consider four different datasets coming from relevant applications; our approach achieves state-of-the-art performance on all of them. As performance metrics, we use the mean absolute error (MAE), mean squared error (MSE) and mean absolute percentage error (MAPE) computed over the imputation window.

5.1 Air quality

The problem of air pollution is nowadays ubiquitous in the world. The Urban Computing project (Zheng et al., 2014, 2015) published several datasets containing real measurements of different indices affecting human life in the urban spaces. We consider as a benchmark the dataset regarding the air quality index (AQI). The complete dataset contains hourly measurements of six pollutants from 437 air quality monitoring stations, spread over 43 cities in China, over a period of one year (from May 2014 to April 2015). For this experiment, we consider only the pollutant PM25. Prior works on imputation (Yi et al., 2016; Cao et al., 2018) considered a reduced version of this dataset, including only 36 sensors (in the following, AQI-36). We evaluate our model on both datasets (i.e., the reduced and complete one), to further show the advantages of our approach over the baselines when the number of dimensions increases. This dataset is particularly interesting as a benchmark for imputation due to the high rate of missing values ( in AQI and in AQI-36). In accordance to Yi et al. (2016), we consider as the test set the months of March, June, September and December. We consider both the in-sample and out-of-sample scenarios. In this last case, we do not consider windows overlapping with any of the test months. We use the same procedure of Yi et al. (2016) to simulate the presence of missing data for evaluation. Besides air quality readings, the dataset provides geographical coordinates of each monitoring station. We create a weighted adjacency matrix by considering each station as a node and weighting edges according to a thresholded Gaussian kernel based on pairwise distances (Shuman et al., 2013). For more details on the dataset and experimental settings, please refer to subsection A.1.

D In-sample Out-of-sample


Mean 55.08 0.00 4715.70 0.00 307.30 0.00
KNN 30.20 0.00 2892.30 0.00 129.69 0.00
MF 30.34 0.13 2784.14 19.24 139.36 1.37
MICE 29.82 0.11 2572.46 10.22 149.22 0.44 30.34 0.10 2592.56 04.02 159.41 0.28
BRITS 14.01 0.22 711.05 28.37 44.57 2.75 14.98 0.46 719.57 72.55 52.04 2.87
GRIL 11.52 0.24 455.88 24.36 35.67 2.62 12.28 0.33 497.09 42.43 42.67 4.20


Mean 40.26 0.00 3276.70 0.00 149.73 0.00
KNN 30.23 0.00 2973.81 0.00 76.59 0.00
MF 26.82 0.15 2023.92 14.67 73.86 0.48
MICE 26.43 0.01 1859.85 09.36 80.74 0.10 26.88 0.10 1929.08 05.41 83.27 0.67
BRITS 17.16 0.17 908.03 15.14 44.44 0.19 19.82 0.18 1114.30 18.58 55.00 0.72
GRIL 15.37 0.07 798.9 11.31 35.78 0.56 15.69 0.22 834.58 25.78 36.84 0.58
Table 1: Results on the air datasets. Performance averaged over independent runs.

Table 1 shows the experimental results. Note that for Mean and KNN we consider only the out-of-sample scenario, since they do not need pretraining to be used on unseen sequences. Conversely, MF can operate only in-sample. As can be seen, GRIL largely outperforms the baselines in all the settings. In particular, in the out-of-sample case, the gain of GRIL over BRITS (in terms of MAE) exceeds the in the higher dimensional dataset AQI. Notice that the results of the baselines (including BRITS) on AQI-36 are different from the ones published in Cao et al. (2018). This might be due to differences in the data preprocessing and on the computation of the metrics. Indeed, here we evaluate the performances by averaging the error over the windows. In Cao et al. (2018) instead, the evaluation is carried out on the whole test sequence, in which missing values are imputed using the mean of multiple imputations. We point out that, for the implementation of BRITS, we used the code provided by the authors.

5.2 Traffic

The study of traffic networks is key for the development of intelligent transportation systems and a relevant application field for network science. While previous works (Yu et al., 2017; Li et al., 2018) have assessed spatio-temporal deep learning methods for the traffic forecasting task, we focus on reconstruction. We use as benchmark the PEMS-BAY and METR-LA datasets from Li et al. (2018). PEMS-BAY contains months of data from traffic sensors in San Francisco Bay Area, while METR-LA contains months of sensor readings from detectors in the Los Angeles County Highway (Jagadish et al., 2014); for both datasets, the sampling rate corresponds to minutes. We use the geographical distance of the sensors and a thresholded Gaussian kernel to obtain adjacency matrices. We test our imputation method in two different settings. In the first one (indicated as Block missing

), we use the following policy: at each step, for each sensor, we simulate a failure with probability

. In case of a failure, we sample its duration uniformly in the interval , where and are the number of time steps corresponding to and hours respectively. Furthermore, we simulate the presence of noisy communications (e.g., random interruptions in the communication channel) by dropping randomly of the available data. The final result is a mask that removes around of the original dataset. In the second setting (indicated as Noise missing), we simply randomly mask out of the available data. We fix the seeds of the random number generators used to sample these masks across baselines and training runs. For both datasets, we consider only out-of-sample imputation except for MICE and MF. We use the last of the datasets as test set, while we use the first and as training and validation sets, respectively. Note that previous works (e.g., Li et al. (2018) and Shang et al. (2020)) showed that linear models are inadequate to capture the dynamics of these systems.

D Block missing Noise missing


Mean 5.48 0.00 88.07 0.00 13.96 0.00 5.43 0.00 87.07 0.00 13.89 0.00
KNN 4.30 0.00 49.90 0.00 9.96 0.00 4.30 0.00 49.80 0.00 9.88 0.00
MF 3.28 0.01 49.95 0.52 9.44 0.04 3.30 0.01 51.63 0.58 9.60 0.04
MICE 2.75 0.01 25.29 0.35 6.28 0.06 2.74 0.02 25.65 0.25 6.37 0.04
BRITS 1.74 0.00 10.93 0.04 3.72 0.01 1.53 0.00 8.46 0.09 3.31 0.02
GRIL 1.27 0.00 8.46 0.08 2.71 0.02 0.69 0.00 1.63 0.00 1.31 0.00


Mean 7.43 0.00 140.18 0.00 24.31 0.00 7.52 0.00 142.82 0.00 24.90 0.00
KNN 7.79 0.00 124.61 0.00 20.65 0.00 7.88 0.00 129.29 0.00 21.26 0.00
MF 5.34 0.01 105.16 3.12 19.23 0.37 5.52 0.02 112.22 0.95 20.50 0.12
MICE 4.19 0.06 50.52 1.37 11.71 0.26 4.38 0.03 54.46 0.74 12.54 0.09
BRITS 2.37 0.01 17.10 0.22 5.98 0.04 2.37 0.00 16.89 0.26 6.04 0.07
GRIL 2.18 0.00 15.58 0.07 5.25 0.02 1.95 0.00 10.90 0.00 4.56 0.01
Table 2: Results on the traffic datasets. Performance averaged over independent runs.

Results are shown in Table 2. GRIL outperforms BRITS (and the other baselines) by a wide margin in all the considered settings while using a much lower number of parameters (see Appendix A.2). On average, GRIL reduces MAE by w.r.t. BRITS and, in particular, in the Noise missing setting of the PEMS-BAY dataset, the error is more than halved.

5.3 Smart grids

As a last benchmark, we tackle the problem of forecasting energy consumption from smart metering data. In particular, we consider data from the Irish Commission for Energy Regulation (CER) Smart Metering Project111See (from now on CER-Energy, for short). We select only the subset of the available smart meters monitoring energy consumption of small and medium-sized enterprises (SMEs), i.e., time series with samples acquired every minutes. We build an adjacency matrix by extracting a k-nearest neighbor graph (with ) from the similarity matrix built by computing the correntropy (Liu et al., 2007) among the time series. We generate missing data using the same approach used for the traffic datasets, but we select the values of and corresponding to hours and days, respectively. As in the previous case, we use a split for training, validation and testing. Table 3 shows the result of the compared approaches on the dataset. Here, we do not consider MAPE as a metric due to the large number of null values in the dataset. As in the previous case, GRIL largely outperforms the baselines. Besides showing the effectiveness of our approach in a relevant application field, this experiment also shows that GRIL can be exploited in settings where relational information is not readily available.

D Block missing Noise missing


Mean 1.50 0.00 6.05 0.00 1.52 0.00 6.19 0.00
KNN 1.13 0.00 5.91 0.00 1.19 0.00 6.49 0.00
MF 0.97 0.00 4.40 0.04 1.01 0.01 4.69 0.06
MICE 0.90 0.00 2.69 0.03 0.93 0.00 2.85 0.01
BRITS 0.64 0.00 1.61 0.01 0.64 0.00 1.60 0.00
GRIL 0.46 0.00 1.36 0.01 0.31 0.00 0.58 0.00
Table 3: Results on the smart grid dataset. Performance averaged over independent runs.

6 Conclusions

In this paper, we presented GRIL, a novel approach for MTSI exploiting modern graph neural networks. Our method imputes missing data by leveraging the relational information characterizing the underlying sensor network. Compared against state-of-the-art baselines, our framework offers higher modeling flexibility and expressiveness and achieves better reconstruction accuracy on all the considered scenarios. There are several possible directions for future works. From a theoretical perspective, it would be interesting to study the properties of the graph and the underlying data generating process that would guarantee an accurate reconstruction. Furthermore, future work should study extensions able to deal with a non-stationary setting and further assess the ability of GRIL w.r.t. virtual sensing. Finally, it would be interesting to evaluate the performance of GRIL-like models against different sequence models (e.g., convolutional or attention-based architectures).


  • J. Atwood and D. Towsley (2016) Diffusion-convolutional neural networks. In Advances in neural information processing systems, pp. 1993–2001. Cited by: §2, §5.
  • S. Bai, J. Z. Kolter, and V. Koltun (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §1.
  • P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. (2018) Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §1, §3.
  • L. Beretta and A. Santaniello (2016) Nearest neighbor imputation algorithms: a critical evaluation. BMC medical informatics and decision making 16 (3), pp. 197–208. Cited by: §2.
  • M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst (2017) Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §1.
  • L. Cai, K. Janowicz, G. Mai, B. Yan, and R. Zhu (2020) Traffic transformer: capturing the continuity and periodicity of time series for traffic forecasting. Transactions in GIS 24 (3), pp. 736–755. Cited by: §2.
  • W. Cao, D. Wang, J. Li, H. Zhou, Y. Li, and L. Li (2018) BRITS: bidirectional recurrent imputation for time series. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 6776–6786. Cited by: §A.1, Appendix A, §1, §2, §5.1, §5.1, §5.
  • Z. Che, S. Purushotham, K. Cho, D. Sontag, and Y. Liu (2018) Recurrent neural networks for multivariate time series with missing values. Scientific reports 8 (1), pp. 1–12. Cited by: §1, §2.
  • K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §4.1.
  • J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §1, §2.
  • A. Cichocki and A. Phan (2009)

    Fast local algorithms for large scale nonnegative matrix and tensor factorizations

    IEICE transactions on fundamentals of electronics, communications and computer sciences 92 (3), pp. 708–721. Cited by: §2.
  • M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems 29, pp. 3844–3852. Cited by: §2.
  • J. Durbin and S. J. Koopman (2012) Time series analysis by state space methods. Oxford university press. Cited by: §2.
  • W. Fedus, I. Goodfellow, and A. M. Dai (2018)

    MaskGAN: better text generation via filling in the _

    In International Conference on Learning Representations, Cited by: §1, §2.
  • M. Fey and J. E. Lenssen (2019)

    Fast graph representation learning with pytorch geometric

    arXiv preprint arXiv:1903.02428. Cited by: 3rd item.
  • Z. Ghahramani and M. Jordan (1994) Supervised learning from incomplete data via an em approach. In Advances in Neural Information Processing Systems, J. Cowan, G. Tesauro, and J. Alspector (Eds.), Vol. 6. Cited by: §2.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In

    International conference on machine learning

    pp. 1263–1272. Cited by: §4.1.
  • I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio (2016) Deep learning. Vol. 1, MIT Press. Cited by: §1.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. Advances in neural information processing systems 27. Cited by: §2.
  • C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, et al. (2020) Array programming with numpy. Nature 585 (7825), pp. 357–362. Cited by: 2nd item.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
  • H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan, and C. Shahabi (2014) Big data and its technical challenges. Communications of the ACM 57 (7), pp. 86–94. Cited by: §5.2.
  • J. Kihoro, K. Athiany, et al. (2013) Imputation of incomplete nonstationary seasonal time series data. Mathematical Theory and Modeling 3 (12), pp. 142–154. Cited by: §2.
  • T. Kipf, E. Fetaya, K. Wang, M. Welling, and R. Zemel (2018) Neural relational inference for interacting systems. In International Conference on Machine Learning, pp. 2688–2697. Cited by: §2, §3.
  • Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
  • Y. Li, R. Yu, C. Shahabi, and Y. Liu (2018) Diffusion convolutional recurrent neural network: data-driven traffic forecasting. In International Conference on Learning Representations, Cited by: Appendix A, §2, §4.1, §5.2, §5.
  • Z. C. Lipton, D. Kale, and R. Wetzel (2016) Directly modeling missing data in sequences with rnns: improved classification of clinical time series. In Proceedings of the 1st Machine Learning for Healthcare Conference, F. Doshi-Velez, J. Fackler, D. Kale, B. Wallace, and J. Wiens (Eds.), Proceedings of Machine Learning Research, Vol. 56, Northeastern University, Boston, MA, USA, pp. 253–270. Cited by: §2.
  • W. Liu, P. P. Pokharel, and J. C. Principe (2007) Correntropy: properties and applications in non-gaussian signal processing. IEEE Transactions on signal processing 55 (11), pp. 5286–5298. Cited by: §5.3.
  • Y. Liu, R. Yu, S. Zheng, E. Zhan, and Y. Yue (2019) NAOMI: non-autoregressive multiresolution sequence imputation. Advances in Neural Information Processing Systems 32, pp. 11238–11248. Cited by: §1, §2.
  • Y. Luo, X. Cai, Y. Zhang, J. Xu, and Y. Xiaojie (2018)

    Multivariate time series imputation with generative adversarial networks

    In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, pp. . Cited by: §2.
  • Y. Luo, Y. Zhang, X. Cai, and X. Yuan (2019) E²gan: end-to-end generative adversarial network for multivariate time series imputation. In

    Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19

    pp. 3094–3100. Cited by: §2.
  • J. Mei, Y. De Castro, Y. Goude, and G. Hébrail (2017) Nonnegative matrix factorization for time series recovery from a few temporal aggregates. In International Conference on Machine Learning, pp. 2382–2390. Cited by: §2.
  • S. Moritz and T. Bartz-Beielstein (2017) ImputeTS: time series missing value imputation in r. The R Journal 9 (1), pp. 207–218. Cited by: §2.
  • F. V. Nelwamondo, S. Mohamed, and T. Marwala (2007) Missing data: a comparison of neural network and expectation maximization techniques. Current Science, pp. 1514–1521. Cited by: §2.
  • B. Paassen, D. Grattarola, D. Zambon, C. Alippi, and B. E. Hammer (2020) Graph edit networks. In International Conference on Learning Representations, Cited by: §2.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. Advances in neural information processing systems 32, pp. 8026–8037. Cited by: 1st item.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011) Scikit-learn: machine learning in python. the Journal of machine Learning research 12, pp. 2825–2830. Cited by: 4th item.
  • E. Rossi, B. Chamberlain, F. Frasca, D. Eynard, F. Monti, and M. Bronstein (2020) Temporal graph networks for deep learning on dynamic graphs. arXiv preprint arXiv:2006.10637. Cited by: §2.
  • A. Rubinsteyn and S. Feldman (2016) Fancyimpute: an imputation library for python External Links: Link Cited by: 5th item.
  • F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2008) The graph neural network model. IEEE transactions on neural networks 20 (1), pp. 61–80. Cited by: §1.
  • J. Schmidhuber (2015) Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §1.
  • Y. Seo, M. Defferrard, P. Vandergheynst, and X. Bresson (2018) Structured sequence modeling with graph convolutional recurrent networks. In International Conference on Neural Information Processing, pp. 362–373. Cited by: §2, §4.1.
  • C. Shang, J. Chen, and J. Bi (2020) Discrete graph structure learning for forecasting multiple time series. In International Conference on Learning Representations, Cited by: §2, §5.2.
  • D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst (2013)

    The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains

    IEEE signal processing magazine 30 (3), pp. 83–98. Cited by: §A.1, §5.1.
  • I. Spinelli, S. Scardapane, and A. Uncini (2020) Missing data imputation with adversarially-trained graph convolutional networks. Neural Networks 129, pp. 249–260. Cited by: §2.
  • O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman (2001)

    Missing value estimation methods for dna microarrays

    Bioinformatics 17 (6), pp. 520–525. Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.
  • I. R. White, P. Royston, and A. M. Wood (2011) Multiple imputation using chained equations: issues and guidance for practice. Statistics in medicine 30 (4), pp. 377–399. Cited by: §1, §5.
  • Z. Wu, S. Pan, G. Long, J. Jiang, X. Chang, and C. Zhang (2020) Connecting the dots: multivariate time series forecasting with graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, pp. 753–763. Cited by: §2.
  • Z. Wu, S. Pan, G. Long, J. Jiang, and C. Zhang (2019) Graph wavenet for deep spatial-temporal graph modeling. arXiv preprint arXiv:1906.00121. Cited by: §2.
  • X. Yi, Y. Zheng, J. Zhang, and T. Li (2016) ST-mvl: filling missing values in geo-sensory time series data. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16, pp. 2704–2710. External Links: ISBN 9781577357704 Cited by: §1, §2, §5.1.
  • J. Yoon, J. Jordon, and M. Schaar (2018a) Gain: missing data imputation using generative adversarial nets. In International Conference on Machine Learning, pp. 5689–5698. Cited by: §1, §2.
  • J. Yoon, W. R. Zame, and M. van der Schaar (2018b) Estimating missing data in temporal data streams using multi-directional recurrent neural networks. IEEE Transactions on Biomedical Engineering 66 (5), pp. 1477–1490. Cited by: §1, §2.
  • J. You, X. Ma, D. Y. Ding, M. Kochenderfer, and J. Leskovec (2020) Handling missing data with graph representation learning. Neural Information Processing Systems (NeurIPS. Cited by: §2.
  • B. Yu, H. Yin, and Z. Zhu (2017) Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting. arXiv preprint arXiv:1709.04875. Cited by: §2, §5.2.
  • D. Zambon, D. Grattarola, L. Livi, and C. Alippi (2019) Autoregressive models for sequences of graphs. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §2.
  • J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D. Y. Yeung (2018) GaAN: gated attention networks for learning on large and spatiotemporal graphs. In 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, Cited by: §2.
  • Y. Zheng, L. Capra, O. Wolfson, and H. Yang (2014) Urban computing: concepts, methodologies, and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 5 (3), pp. 1–55. Cited by: §5.1.
  • Y. Zheng, X. Yi, M. Li, R. Li, Z. Shan, E. Chang, and T. Li (2015) Forecasting fine-grained air quality based on big data. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 2267–2276. Cited by: §5.1.

Appendix A Detailed experimental settings

In this appendix, we give more details on the experimental settings used to evaluate our approach. We train BRITS and GRIL by sampling at random batches of

elements for each epoch, we fix the maximum number of epochs to

and we use early stopping on the validation set with a patience of

epochs. The loss function used for training


where each is of the form of Equation 1 and the element-wise error function is MAE. While training both models, we randomly mask out an additional of the input data for each batch at random to increase robustness to noise and missing data. We consider the estimation error on these data in the computation of the loss.

For BRITS we use the same network hyperparameters of

Cao et al. [2018]. For GRIL we use a hidden dimension of neurons for both the spatio-temporal encoder and the spatial decoder and of neurons for the last hidden layer of the decoder. We use diffusion convolution as message-passing operation, with a diffusion step . Furthermore, GRIL uses a cosine learning rate scheduler. Note that due to architectural differences, BRITS has a number of parameters that is far higher than GRIL (depending on the considered dataset, up to against ). For data processing we use the same steps of Li et al. [2018], data are normalized across the feature dimension (which means graph-wise for GRIL and node-wise for BRITS). Data masked out to simulate the presence of missing data are never used to train the models.

All the models were developed in python using the following open-source libraries:

  • PyTorch [Paszke et al., 2019];

  • numpy [Harris et al., 2020];

  • PyTorch Geometric [Fey and Lenssen, 2019];

  • scikit-learn [Pedregosa et al., 2011];

  • fancyimpute [Rubinsteyn and Feldman, 2016].

For the implementation of BRITS, we used the code provided by the authors222 We plan to release the code used to run the experiments together with a library for spatio-temporal time-series processing in the near future.

a.1 Air quality

We select windows of data of length for AQI and for AQI-36 (in line with Cao et al. [2018]). To evaluate the imputation performances, we mask out from the test set and use as ground-truth the value if: (1) the value is not missing () and (2) the value is missing at the same hour and day in the following month. To obtain an adjacency matrix from the geographical distances of the nodes, we make use of a thresholded Gaussian kernel [Shuman et al., 2013]. The weight of the edge connecting -th and -th node is

where is the geographical distance operator,

is the standard deviation of the distances and

is the threshold. Here, we applied threshold directly to the computed , to increase the sparsity.

a.2 Traffic

For this setting, we increase the hidden size of BRITS to account for the higher input dimensionality (from neurons to ). We use input sequences of steps, which corresponds to hours of data.

a.3 Smart grids

For the smart grid setting, we use the same hyperparameters of the traffic benchmarks. In order to compute the similarity among channels, we use week-wise correntropy. Data were normalized using standard scaling as in the previous settings and we did not perform additional preprocessing steps.