1 Introduction
Imputation of missing values is a prominent problem in multivariate timeseries analysis (TSA) from both theoretical and practical perspectives (Yi et al., 2016; White et al., 2011; Che et al., 2018; Fedus et al., 2018). In a world of complex interconnected systems such as those characterizing sensor networks or internet of things, in fact, faulty sensors, network failures and physical interferences are widespread phenomena that cause disruptions in the data acquisition process. Luckily, failures of this type are often sparse and localized at the single sensor level, i.e., they do not compromise the whole sensor network at once. In other terms, it is often the case that, at a certain time step, missing data appear only at some of the channels of the resulting multivariate time series. There, missing data can affect multiple channels, are not necessarily synchronized in time and show different transient or even permanent dynamics. In this context, spatiotemporal imputation methods (Yi et al., 2016; Yoon et al., 2018b) aim at reconstructing the missing parts of the signal by exploiting both temporal and spatial dependencies. In particular, effective spatiotemporal approaches would reconstruct missing values by not only taking into account past and future values, but also the concurrent measurements of spatially close neighbouring sensors. Here, spatial similarity does not necessarily mean physical (e.g., geographic) proximity, but rather indicates that the considered sensors are related w.r.t. a generic (quantifiable) functional dependency (e.g., Pearson correlation or Granger causality) and/or that are close in a certain latent space. Relation information, then, can be interpreted as a set of constraints that allow replacing the malfunctioning sensor with a virtual one.
Among different imputation methods, approaches based on deep learning (LeCun et al., 2015; Schmidhuber, 2015; Goodfellow et al., 2016) have become increasingly popular (Yoon et al., 2018a; Cao et al., 2018; Liu et al., 2019). These methods often completely disregard available relational information or rely on rather simplistic modifications of standard neural architectures tailored for sequence processing (Hochreiter and Schmidhuber, 1997; Chung et al., 2014; Bai et al., 2018; Vaswani et al., 2017). We argue that stronger, structural, inductive biases are needed to advance the state of the art in this direction and allow to build effective inference engines in the context of large and complex sensor networks found in realworld applications.
In this preliminary work, we model input time series as sequences of graphs where edges represent the relationships among different channels at each time step. We propose graph neural networks (GNNs) (Scarselli et al., 2008; Bronstein et al., 2017; Battaglia et al., 2018)
as the building block of a novel, bidirectional, recurrent neural network for multivariate time series imputation (MTSI). Our method, named
Graph Recurrent Imputation Layer (GRIL), has at its core a recurrent neural network cell where gates are implemented by messagepassing neural networks (MPNNs). Two of these networks process the input multivariate time series in both forward and backward directions at each node, while hidden states are processed by a messagepassing imputation layer exploiting information at neighboring nodes to perform imputation.Our contributions are manifold: 1) we introduce graph neural networks in the context of MTSI, 2) we propose a novel, practical and effective implementation of a GNNbased architecture for MTSI and 3) we achieve stateoftheart results on several and varied MTSI benchmarks. Our method does not rely on any assumption on the distribution of the missing values (e.g., presence and duration of transient dynamics and/or length of missing sequences) other than stationarity of the underlying process. The rest of the paper is organized as follows. In Section 2 we discuss the related works. Then, in Section 3, we formally introduce the problem settings and the task of MTSI. We present our approach to MTSI in Section 4, by describing the novel methodological framework to implement imputation architectures based on GNNs. We proceed with an empirical evaluation of the presented method against stateoftheart baselines in Section 5 and, finally, we draw out conclusions in Section 6.
2 Related works
Time series imputation
There exists a large literature addressing missing value imputation in time series. Besides the simple and standard interpolation methods based on linear and polynomial curve fitting, popular approaches aim at filling up missing values by taking advantage of the similarities among time series. For example, several approaches rely on knearest neighbors
(Troyanskaya et al., 2001; Beretta and Santaniello, 2016) or matrix factorization (Cichocki and Phan, 2009; Mei et al., 2017). Other standard methods exploit the expectationmaximization algorithm (Ghahramani and Jordan, 1994; Nelwamondo et al., 2007) or linear predictors and statespace models (Durbin and Koopman, 2012; Kihoro et al., 2013; Moritz and BartzBeielstein, 2017). A more advanced method, namely STMVL (Yi et al., 2016), combines spatial and temporal information by using a hybrid solution consisting of standard forecasting techniques and methods from the recommender systems literature to impute geographically tagged time series.More recently, several deep learning approaches have been proposed for MTSI. Among the others, methods based on recurrent neural networks found widespread success (Lipton et al., 2016; Yoon et al., 2018b; Che et al., 2018; Luo et al., 2018; Liu et al., 2019; Luo et al., 2019). In particular, Cao et al. (2018) propose BRITS, a bidirectional neural network for multivariate time series imputation that takes into account the correlation among different channels to perform the imputation. Other successful strategies in the literature have been proposed that exploit adversarial training to generate realistic reconstructed sequences (Yoon et al., 2018a; Fedus et al., 2018; Luo et al., 2018, 2019). For example, Yoon et al. (2018a) use a GAN (Goodfellow et al., 2014) to learn models that match the underlying data generating process. Liu et al. (2019), instead, uses adversarial learning to train a multiscale model that imputes highly sparse time series in a hierarchical fashion. However, we argue that none of the abovecited methods is able to take full advantage of relational information and spatiotemporal dependencies unless time series are generated by linear timeinvariant statespace models. Most importantly, the above methods lack the flexibility and expressiveness enabled by operating in the context of graph processing.
Graph neural networks for TSA
Graph neural networks has been exploited in TSA mostly to implement spatiotemporal forecasting methods. The basic idea behind most of the methods present in the literature is to modify standard neural network architectures for sequential data by relying on operators that work in the graph domain. For example, Seo et al. (2018) propose a gated recurrent graph neural network where the standard gates used by a GRU cell (Chung et al., 2014) are implemented by spectral GNNs (Defferrard et al., 2016); Li et al. (2018) propose an analogous architecture replacing spectral GNNs with a diffusionconvolutional network (Atwood and Towsley, 2016). Yu et al. (2017) and Wu et al. (2019)
propose, instead, spatiotemporal convolutional neural networks that alternate convolutions on the temporal and spatial dimensions. Similar approaches have also been studied in the context of attentionbased sequence models
(Vaswani et al., 2017) with Transformerlike architectures alternating spatial and temporal attention layers (Zhang et al., 2018; Cai et al., 2020). Another particularly interesting line of research is related to the problem of learning, endtoend, the best graph structure to process an input multivariate time series (Kipf et al., 2018; Wu et al., 2020; Shang et al., 2020).While the above approaches focus on multivariate time series prediction and use the graph structure only as a representational tool, other methods aim at predicting changes in the graph topology (Zambon et al., 2019; Paassen et al., 2020). Finally, methods such as Temporal Graph Networks (Rossi et al., 2020) are tailored to learn node embeddings in dynamical graphs and adopt an eventbased paradigm that is out of the scope of this work. Finally, recent works have proposed GNNs for imputing missing features in the context of i.i.d. data. Among the others, Spinelli et al. (2020) propose an adversarial framework to train GNNs on the data reconstruction task, while You et al. (2020) propose a bipartite graph representation for feature imputation. To the best of our knowledge, no previous GNNbased method targeted missing value imputation in the context of sequential data streams.
3 Preliminaries
Sequences of graphs
We consider sequences of weighted directed graphs where at each time step we observe a graph with nodes. The graph is a couple , where is the nodeattribute matrix which has as th row the
dimensional nodeattribute vector
associated with the th node; the adjacency matrix , on the other hand, has entries which indicate the weight of the edge (if any) connecting the th and th node. We assume that nodes are identified, i.e., have a unique id that enables timewise consistent processing. This problem setting can be easily extended to more general classes of graphs with attributed edges and global attributes (Battaglia et al., 2018). In this work, we mainly focus on problems where the topology of the graph is fixed and does not change over time, i.e., at each time step and .Any generic multivariate time series can be trivially described in the above framework by letting each channel of the sequence correspond to a node and using the identity matrix as adjacency if no relational information is available. However, this would defeat the purpose of the formulation: a more proper choice of
could be made using any standard similarity measure (e.g., Pearson correlation) or using a (thresholded) kernel; a more advanced approach instead could aim at learning a proper adjacency directly from data by using, for instance, spatial attention scores or resorting to graph learning techniques, e.g., Kipf et al. (2018).Multivariate time series imputation
To model the presence of missing values, we augment, at each step, the graph with a binary mask where each row indicates which of the corresponding node attributes of are available in . In particular, implies that is not valid, conversely, if , stores the actual sensor reading. We indicate with the ground truth nodeattribute matrix, i.e., the complete nodeattribute matrix without any missing data. We do not make any special assumption on the distribution of the missing values.
The objective of MTSI is to impute missing values in a sequence of input data. More formally, given a sequence of length , we can define the reconstruction error as
(1) 
where is the reconstructed , and are respectively the logical negation of and , is an elementwise error function (e.g., absolute or squared error) and indicates the standard dot product. Note that in practice, it is impossible to have access to and, as a consequence, it is necessary to define a surrogate optimization objective by, for example, using a prediction loss or generating synthetic missing values.
In the context of trainable, parametric, imputation methods we consider two different operational settings. In the first one, that we call insample imputation, the model is trained on the same input sequence on which it will be evaluated. Differently, in the second one (referred to as outofsample imputation), the model is trained and evaluated on disjoint sequences. Note that in both cases, clearly, the model does not have access to the groundtruth data used for evaluation. The first operational setting simulates the case where a practitioner fits the model directly on the target sequence to fill up its missing parts. The second, instead, simulates the case where one wishes to use a model fitted on historical data to impute missing values in an unseen target sequence.
4 Graph Recurrent Imputation Layer
In this section, we present our approach, the Graph Recurrent Imputation Layer (GRIL), a graphbased, recurrent neural network for MTSI. Given a multivariate time series with mask , our objective is to reconstruct the input sequence by combining the information coming from both the temporal and spatial dimensions. In order to do so, we design an architecture based on two blocks: a spatiotemporal encoder and a spatial decoder. The spatiotemporal encoder maps the input sequence in a spatiotemporal representation by exploiting an ad hoc designed recurrent GNN. The spatial decoder, instead, projects the learned representations into . In particular, the decoder is implemented by a MPNN which learns to infer, at each time step , the observed value at the th node by only considering – locally – and the value observed at time at its neighboring nodes. In practice, in order to benefit from both the forward and backward temporal dimensions, we duplicate this encoderdecoder architecture and add a last layer to combine the two representations. More precisely, the final imputation depends on the output two parallel networks and , being the family of GRIL models.
An overview of GRIL is given in Figure 2. As can be seen from the schema, we proceed by imputing missing values iteratively, using at each time step previously imputed values as input. In the following, we describe in detail the architecture of the encoder and decoder blocks, and the computational steps required to perform the imputation. Then, we describe the unidirectional version of the model, and finally we provide the straightforward bidirectional extension.
4.1 Spatiotemporal Encoder
In the encoder, the input sequence and mask, and are processed one step at a time sequentially, by means of a recurrent neural network with gates implemented by massagepassing layers.
Messagepassing gates
The gates of the GRIL encoder can be implemented by any MPNN (Gilmer et al., 2017). In particular, given , i.e., the node features vector at layer, we consider the general class of message passing layers described as
(2) 
where is the set of neighbors of the th node in , and
are generic, differentiable, update and message functions (e.g., multilayer perceptrons), and
is a permutation invariant, differentiable aggregation function (e.g., or ). In particular, the function specifies how to aggregate the messages received from the neighbors. For simplicity sake, from now on we indicate with MPNN the forward pass of a generic layered messagepassing neural network. In the following, we use MPNNs as the building blocks for our spatiotemporal feature extractors.Spatiotemporal recurrent cell
To learn the dynamics of the system, we leverage gated recurrent units (GRUs)
(Cho et al., 2014). As previously mentioned, similarly to Seo et al. (2018) and Li et al. (2018), we implement the GRU gates by relying on the messagepassing layers defined above. At the node level, the elements of the messagepassing GRU (MPGRU) can be described by the following computational steps:(3)  
(4)  
(5)  
(6) 
where are the reset and update gates, respectively,
is the hidden representation of the
th node at time , and is the output of the decoding block at the previous timestep (see the next subsection). The symbols and denote the Hadamard product and the concatenation operation, respectively. The initial representation can either be initialized as a constant or with a learnable embedding. Note that for the steps where input data are missing, the encoder is fed with predictions from the decoder block, as explained in the next subsection. By carrying out the computation above time and graph wise, we get the encoded sequence .4.2 Spatial Decoder
As a first decoding step, we generate onestep predictions from the hidden representations of the MPGRU by means of a linear readout
(7) 
where is a learnable weight matrix and
is a learnable bias vector. We then define the
filler operator as(8) 
intuitively, the filler operator replaces the missing values in the input with the values at the same positions in . By feeding to the filler operator, we get the matrix such that the output is with missing values replaced by the onestepahead predictions . The resulting nodelevel predictions are then concatenated to the mask and the hidden representation , and processed by a final onelayer MPNN which computes for each node an imputation representation as
(9) 
Notice that, as previously highlighted, the imputation representations only depend on messages received from neighboring nodes: In fact, by aggregating only messages from the onehop neighborhood, the representations , are independent from the input features of the th node itself. This constraint forces the model to learn how to reconstruct a target input by taking into account spatial dependencies.
Afterwards, we concatenate the imputation representation with the hidden representation and the mask , and we generate the secondstage imputations by using a second linear readout
(10) 
Note that concatenating , does not change the fact that representation of each node (and the final prediction) is still independent from its features at time step . We use the filler operator to get an updated prediction of the reconstructed input. Then, we use as input to update the state of the MPGRU (Eq. 3 – 6) and proceed to the next input graph .
4.3 Bidirectional Model
Extending GRIL to account for both forward and backward dynamics is straightforward. We simply duplicate the architecture described in the two previous subsections. The first network processes the sequence in the forward direction (from to ), while the second one the backward direction (from back to
). The final imputation is then obtained with a MultiLayer Perceptron (MLP) aggregating the representations extracted by the two networks:
where and denote the forward and backward models, respectively. The final output can then be easily obtained as .
5 Empirical evaluation
In this section, we empirically evaluate our approach against stateoftheart baselines in three relevant application domains. In particular, we consider BRITS (Cao et al., 2018) as a principal competing alternative among nonadversarial approaches, as it shares architectural similarities with our method. However, it is worth noting that BRITS can only operate on multivariate timeseries with a fixed size, as opposite to GRIL. For all the experiments, we use as messagepassing operator the diffusion convolution introduced by Atwood and Towsley (2016), which makes our recurrent GNN cell similar to DCRNN (Li et al., 2018). As additional baselines we consider: 1) MEAN, i.e., imputation using the nodelevel average; 2) KNN, i.e., imputation by averaging values of the neighboring nodes with the highest weight in the adjacency matrix ; 3) MICE (White et al., 2011), limiting the maximum number of iterations to and the number of nearest features to ; 4) Matrix Factorization (MF) with .
We consider four different datasets coming from relevant applications; our approach achieves stateoftheart performance on all of them. As performance metrics, we use the mean absolute error (MAE), mean squared error (MSE) and mean absolute percentage error (MAPE) computed over the imputation window.
5.1 Air quality
The problem of air pollution is nowadays ubiquitous in the world. The Urban Computing project (Zheng et al., 2014, 2015) published several datasets containing real measurements of different indices affecting human life in the urban spaces. We consider as a benchmark the dataset regarding the air quality index (AQI). The complete dataset contains hourly measurements of six pollutants from 437 air quality monitoring stations, spread over 43 cities in China, over a period of one year (from May 2014 to April 2015). For this experiment, we consider only the pollutant PM25. Prior works on imputation (Yi et al., 2016; Cao et al., 2018) considered a reduced version of this dataset, including only 36 sensors (in the following, AQI36). We evaluate our model on both datasets (i.e., the reduced and complete one), to further show the advantages of our approach over the baselines when the number of dimensions increases. This dataset is particularly interesting as a benchmark for imputation due to the high rate of missing values ( in AQI and in AQI36). In accordance to Yi et al. (2016), we consider as the test set the months of March, June, September and December. We consider both the insample and outofsample scenarios. In this last case, we do not consider windows overlapping with any of the test months. We use the same procedure of Yi et al. (2016) to simulate the presence of missing data for evaluation. Besides air quality readings, the dataset provides geographical coordinates of each monitoring station. We create a weighted adjacency matrix by considering each station as a node and weighting edges according to a thresholded Gaussian kernel based on pairwise distances (Shuman et al., 2013). For more details on the dataset and experimental settings, please refer to subsection A.1.
D  Insample  Outofsample  
MAE  MSE  MAPE(%)  MAE  MSE  MAPE(%)  
AQI36 
Mean  –  –  –  55.08 0.00  4715.70 0.00  307.30 0.00 
KNN  –  –  –  30.20 0.00  2892.30 0.00  129.69 0.00  
MF  30.34 0.13  2784.14 19.24  139.36 1.37  –  –  –  
MICE  29.82 0.11  2572.46 10.22  149.22 0.44  30.34 0.10  2592.56 04.02  159.41 0.28  
BRITS  14.01 0.22  711.05 28.37  44.57 2.75  14.98 0.46  719.57 72.55  52.04 2.87  
GRIL  11.52 0.24  455.88 24.36  35.67 2.62  12.28 0.33  497.09 42.43  42.67 4.20  
AQI 
Mean  –  –  –  40.26 0.00  3276.70 0.00  149.73 0.00 
KNN  –  –  –  30.23 0.00  2973.81 0.00  76.59 0.00  
MF  26.82 0.15  2023.92 14.67  73.86 0.48  –  –  –  
MICE  26.43 0.01  1859.85 09.36  80.74 0.10  26.88 0.10  1929.08 05.41  83.27 0.67  
BRITS  17.16 0.17  908.03 15.14  44.44 0.19  19.82 0.18  1114.30 18.58  55.00 0.72  
GRIL  15.37 0.07  798.9 11.31  35.78 0.56  15.69 0.22  834.58 25.78  36.84 0.58 
Table 1 shows the experimental results. Note that for Mean and KNN we consider only the outofsample scenario, since they do not need pretraining to be used on unseen sequences. Conversely, MF can operate only insample. As can be seen, GRIL largely outperforms the baselines in all the settings. In particular, in the outofsample case, the gain of GRIL over BRITS (in terms of MAE) exceeds the in the higher dimensional dataset AQI. Notice that the results of the baselines (including BRITS) on AQI36 are different from the ones published in Cao et al. (2018). This might be due to differences in the data preprocessing and on the computation of the metrics. Indeed, here we evaluate the performances by averaging the error over the windows. In Cao et al. (2018) instead, the evaluation is carried out on the whole test sequence, in which missing values are imputed using the mean of multiple imputations. We point out that, for the implementation of BRITS, we used the code provided by the authors.
5.2 Traffic
The study of traffic networks is key for the development of intelligent transportation systems and a relevant application field for network science. While previous works (Yu et al., 2017; Li et al., 2018) have assessed spatiotemporal deep learning methods for the traffic forecasting task, we focus on reconstruction. We use as benchmark the PEMSBAY and METRLA datasets from Li et al. (2018). PEMSBAY contains months of data from traffic sensors in San Francisco Bay Area, while METRLA contains months of sensor readings from detectors in the Los Angeles County Highway (Jagadish et al., 2014); for both datasets, the sampling rate corresponds to minutes. We use the geographical distance of the sensors and a thresholded Gaussian kernel to obtain adjacency matrices. We test our imputation method in two different settings. In the first one (indicated as Block missing
), we use the following policy: at each step, for each sensor, we simulate a failure with probability
. In case of a failure, we sample its duration uniformly in the interval , where and are the number of time steps corresponding to and hours respectively. Furthermore, we simulate the presence of noisy communications (e.g., random interruptions in the communication channel) by dropping randomly of the available data. The final result is a mask that removes around of the original dataset. In the second setting (indicated as Noise missing), we simply randomly mask out of the available data. We fix the seeds of the random number generators used to sample these masks across baselines and training runs. For both datasets, we consider only outofsample imputation except for MICE and MF. We use the last of the datasets as test set, while we use the first and as training and validation sets, respectively. Note that previous works (e.g., Li et al. (2018) and Shang et al. (2020)) showed that linear models are inadequate to capture the dynamics of these systems.D  Block missing  Noise missing  

MAE  MSE  MAPE(%)  MAE  MSE  MAPE(%)  
PEMSBAY 
Mean  5.48 0.00  88.07 0.00  13.96 0.00  5.43 0.00  87.07 0.00  13.89 0.00 
KNN  4.30 0.00  49.90 0.00  9.96 0.00  4.30 0.00  49.80 0.00  9.88 0.00  
MF  3.28 0.01  49.95 0.52  9.44 0.04  3.30 0.01  51.63 0.58  9.60 0.04  
MICE  2.75 0.01  25.29 0.35  6.28 0.06  2.74 0.02  25.65 0.25  6.37 0.04  
BRITS  1.74 0.00  10.93 0.04  3.72 0.01  1.53 0.00  8.46 0.09  3.31 0.02  
GRIL  1.27 0.00  8.46 0.08  2.71 0.02  0.69 0.00  1.63 0.00  1.31 0.00  
METRLA 
Mean  7.43 0.00  140.18 0.00  24.31 0.00  7.52 0.00  142.82 0.00  24.90 0.00 
KNN  7.79 0.00  124.61 0.00  20.65 0.00  7.88 0.00  129.29 0.00  21.26 0.00  
MF  5.34 0.01  105.16 3.12  19.23 0.37  5.52 0.02  112.22 0.95  20.50 0.12  
MICE  4.19 0.06  50.52 1.37  11.71 0.26  4.38 0.03  54.46 0.74  12.54 0.09  
BRITS  2.37 0.01  17.10 0.22  5.98 0.04  2.37 0.00  16.89 0.26  6.04 0.07  
GRIL  2.18 0.00  15.58 0.07  5.25 0.02  1.95 0.00  10.90 0.00  4.56 0.01 
Results are shown in Table 2. GRIL outperforms BRITS (and the other baselines) by a wide margin in all the considered settings while using a much lower number of parameters (see Appendix A.2). On average, GRIL reduces MAE by w.r.t. BRITS and, in particular, in the Noise missing setting of the PEMSBAY dataset, the error is more than halved.
5.3 Smart grids
As a last benchmark, we tackle the problem of forecasting energy consumption from smart metering data. In particular, we consider data from the Irish Commission for Energy Regulation (CER) Smart Metering Project^{1}^{1}1See https://www.ucd.ie/issda/data/commissionforenergyregulationcer (from now on CEREnergy, for short). We select only the subset of the available smart meters monitoring energy consumption of small and mediumsized enterprises (SMEs), i.e., time series with samples acquired every minutes. We build an adjacency matrix by extracting a knearest neighbor graph (with ) from the similarity matrix built by computing the correntropy (Liu et al., 2007) among the time series. We generate missing data using the same approach used for the traffic datasets, but we select the values of and corresponding to hours and days, respectively. As in the previous case, we use a split for training, validation and testing. Table 3 shows the result of the compared approaches on the dataset. Here, we do not consider MAPE as a metric due to the large number of null values in the dataset. As in the previous case, GRIL largely outperforms the baselines. Besides showing the effectiveness of our approach in a relevant application field, this experiment also shows that GRIL can be exploited in settings where relational information is not readily available.
D  Block missing  Noise missing  

MAE  MSE  MAE  MSE  
CEREnergy 
Mean  1.50 0.00  6.05 0.00  1.52 0.00  6.19 0.00 
KNN  1.13 0.00  5.91 0.00  1.19 0.00  6.49 0.00  
MF  0.97 0.00  4.40 0.04  1.01 0.01  4.69 0.06  
MICE  0.90 0.00  2.69 0.03  0.93 0.00  2.85 0.01  
BRITS  0.64 0.00  1.61 0.01  0.64 0.00  1.60 0.00  
GRIL  0.46 0.00  1.36 0.01  0.31 0.00  0.58 0.00 
6 Conclusions
In this paper, we presented GRIL, a novel approach for MTSI exploiting modern graph neural networks. Our method imputes missing data by leveraging the relational information characterizing the underlying sensor network. Compared against stateoftheart baselines, our framework offers higher modeling flexibility and expressiveness and achieves better reconstruction accuracy on all the considered scenarios. There are several possible directions for future works. From a theoretical perspective, it would be interesting to study the properties of the graph and the underlying data generating process that would guarantee an accurate reconstruction. Furthermore, future work should study extensions able to deal with a nonstationary setting and further assess the ability of GRIL w.r.t. virtual sensing. Finally, it would be interesting to evaluate the performance of GRILlike models against different sequence models (e.g., convolutional or attentionbased architectures).
References
 Diffusionconvolutional neural networks. In Advances in neural information processing systems, pp. 1993–2001. Cited by: §2, §5.
 An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §1.
 Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §1, §3.
 Nearest neighbor imputation algorithms: a critical evaluation. BMC medical informatics and decision making 16 (3), pp. 197–208. Cited by: §2.
 Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §1.
 Traffic transformer: capturing the continuity and periodicity of time series for traffic forecasting. Transactions in GIS 24 (3), pp. 736–755. Cited by: §2.
 BRITS: bidirectional recurrent imputation for time series. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 6776–6786. Cited by: §A.1, Appendix A, §1, §2, §5.1, §5.1, §5.
 Recurrent neural networks for multivariate time series with missing values. Scientific reports 8 (1), pp. 1–12. Cited by: §1, §2.
 Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §4.1.
 Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §1, §2.

Fast local algorithms for large scale nonnegative matrix and tensor factorizations
. IEICE transactions on fundamentals of electronics, communications and computer sciences 92 (3), pp. 708–721. Cited by: §2.  Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems 29, pp. 3844–3852. Cited by: §2.
 Time series analysis by state space methods. Oxford university press. Cited by: §2.

MaskGAN: better text generation via filling in the _
. In International Conference on Learning Representations, Cited by: §1, §2. 
Fast graph representation learning with pytorch geometric
. arXiv preprint arXiv:1903.02428. Cited by: 3rd item.  Supervised learning from incomplete data via an em approach. In Advances in Neural Information Processing Systems, J. Cowan, G. Tesauro, and J. Alspector (Eds.), Vol. 6. Cited by: §2.

Neural message passing for quantum chemistry.
In
International conference on machine learning
, pp. 1263–1272. Cited by: §4.1.  Deep learning. Vol. 1, MIT Press. Cited by: §1.
 Generative adversarial nets. Advances in neural information processing systems 27. Cited by: §2.
 Array programming with numpy. Nature 585 (7825), pp. 357–362. Cited by: 2nd item.
 Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
 Big data and its technical challenges. Communications of the ACM 57 (7), pp. 86–94. Cited by: §5.2.
 Imputation of incomplete nonstationary seasonal time series data. Mathematical Theory and Modeling 3 (12), pp. 142–154. Cited by: §2.
 Neural relational inference for interacting systems. In International Conference on Machine Learning, pp. 2688–2697. Cited by: §2, §3.
 Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
 Diffusion convolutional recurrent neural network: datadriven traffic forecasting. In International Conference on Learning Representations, Cited by: Appendix A, §2, §4.1, §5.2, §5.
 Directly modeling missing data in sequences with rnns: improved classification of clinical time series. In Proceedings of the 1st Machine Learning for Healthcare Conference, F. DoshiVelez, J. Fackler, D. Kale, B. Wallace, and J. Wiens (Eds.), Proceedings of Machine Learning Research, Vol. 56, Northeastern University, Boston, MA, USA, pp. 253–270. Cited by: §2.
 Correntropy: properties and applications in nongaussian signal processing. IEEE Transactions on signal processing 55 (11), pp. 5286–5298. Cited by: §5.3.
 NAOMI: nonautoregressive multiresolution sequence imputation. Advances in Neural Information Processing Systems 32, pp. 11238–11248. Cited by: §1, §2.

Multivariate time series imputation with generative adversarial networks
. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), Vol. 31, pp. . Cited by: §2. 
E²gan: endtoend generative adversarial network for multivariate time series imputation.
In
Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence, IJCAI19
, pp. 3094–3100. Cited by: §2.  Nonnegative matrix factorization for time series recovery from a few temporal aggregates. In International Conference on Machine Learning, pp. 2382–2390. Cited by: §2.
 ImputeTS: time series missing value imputation in r. The R Journal 9 (1), pp. 207–218. Cited by: §2.
 Missing data: a comparison of neural network and expectation maximization techniques. Current Science, pp. 1514–1521. Cited by: §2.
 Graph edit networks. In International Conference on Learning Representations, Cited by: §2.
 Pytorch: an imperative style, highperformance deep learning library. Advances in neural information processing systems 32, pp. 8026–8037. Cited by: 1st item.
 Scikitlearn: machine learning in python. the Journal of machine Learning research 12, pp. 2825–2830. Cited by: 4th item.
 Temporal graph networks for deep learning on dynamic graphs. arXiv preprint arXiv:2006.10637. Cited by: §2.
 Fancyimpute: an imputation library for python External Links: Link Cited by: 5th item.
 The graph neural network model. IEEE transactions on neural networks 20 (1), pp. 61–80. Cited by: §1.
 Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §1.
 Structured sequence modeling with graph convolutional recurrent networks. In International Conference on Neural Information Processing, pp. 362–373. Cited by: §2, §4.1.
 Discrete graph structure learning for forecasting multiple time series. In International Conference on Learning Representations, Cited by: §2, §5.2.

The emerging field of signal processing on graphs: extending highdimensional data analysis to networks and other irregular domains
. IEEE signal processing magazine 30 (3), pp. 83–98. Cited by: §A.1, §5.1.  Missing data imputation with adversariallytrained graph convolutional networks. Neural Networks 129, pp. 249–260. Cited by: §2.

Missing value estimation methods for dna microarrays
. Bioinformatics 17 (6), pp. 520–525. Cited by: §2.  Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.
 Multiple imputation using chained equations: issues and guidance for practice. Statistics in medicine 30 (4), pp. 377–399. Cited by: §1, §5.
 Connecting the dots: multivariate time series forecasting with graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, pp. 753–763. Cited by: §2.
 Graph wavenet for deep spatialtemporal graph modeling. arXiv preprint arXiv:1906.00121. Cited by: §2.
 STmvl: filling missing values in geosensory time series data. In Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, IJCAI’16, pp. 2704–2710. External Links: ISBN 9781577357704 Cited by: §1, §2, §5.1.
 Gain: missing data imputation using generative adversarial nets. In International Conference on Machine Learning, pp. 5689–5698. Cited by: §1, §2.
 Estimating missing data in temporal data streams using multidirectional recurrent neural networks. IEEE Transactions on Biomedical Engineering 66 (5), pp. 1477–1490. Cited by: §1, §2.
 Handling missing data with graph representation learning. Neural Information Processing Systems (NeurIPS. Cited by: §2.
 Spatiotemporal graph convolutional networks: a deep learning framework for traffic forecasting. arXiv preprint arXiv:1709.04875. Cited by: §2, §5.2.
 Autoregressive models for sequences of graphs. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §2.
 GaAN: gated attention networks for learning on large and spatiotemporal graphs. In 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, Cited by: §2.
 Urban computing: concepts, methodologies, and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 5 (3), pp. 1–55. Cited by: §5.1.
 Forecasting finegrained air quality based on big data. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 2267–2276. Cited by: §5.1.
Appendix A Detailed experimental settings
In this appendix, we give more details on the experimental settings used to evaluate our approach. We train BRITS and GRIL by sampling at random batches of
elements for each epoch, we fix the maximum number of epochs to
and we use early stopping on the validation set with a patience ofepochs. The loss function used for training
GRIL iswhere each is of the form of Equation 1 and the elementwise error function is MAE. While training both models, we randomly mask out an additional of the input data for each batch at random to increase robustness to noise and missing data. We consider the estimation error on these data in the computation of the loss.
For BRITS we use the same network hyperparameters of
Cao et al. [2018]. For GRIL we use a hidden dimension of neurons for both the spatiotemporal encoder and the spatial decoder and of neurons for the last hidden layer of the decoder. We use diffusion convolution as messagepassing operation, with a diffusion step . Furthermore, GRIL uses a cosine learning rate scheduler. Note that due to architectural differences, BRITS has a number of parameters that is far higher than GRIL (depending on the considered dataset, up to against ). For data processing we use the same steps of Li et al. [2018], data are normalized across the feature dimension (which means graphwise for GRIL and nodewise for BRITS). Data masked out to simulate the presence of missing data are never used to train the models.All the models were developed in python using the following opensource libraries:

PyTorch [Paszke et al., 2019];

numpy [Harris et al., 2020];

PyTorch Geometric [Fey and Lenssen, 2019];

scikitlearn [Pedregosa et al., 2011];

fancyimpute [Rubinsteyn and Feldman, 2016].
For the implementation of BRITS, we used the code provided by the authors^{2}^{2}2https://github.com/caow13/BRITS. We plan to release the code used to run the experiments together with a library for spatiotemporal timeseries processing in the near future.
a.1 Air quality
We select windows of data of length for AQI and for AQI36 (in line with Cao et al. [2018]). To evaluate the imputation performances, we mask out from the test set and use as groundtruth the value if: (1) the value is not missing () and (2) the value is missing at the same hour and day in the following month. To obtain an adjacency matrix from the geographical distances of the nodes, we make use of a thresholded Gaussian kernel [Shuman et al., 2013]. The weight of the edge connecting th and th node is
where is the geographical distance operator,
is the standard deviation of the distances and
is the threshold. Here, we applied threshold directly to the computed , to increase the sparsity.a.2 Traffic
For this setting, we increase the hidden size of BRITS to account for the higher input dimensionality (from neurons to ). We use input sequences of steps, which corresponds to hours of data.
a.3 Smart grids
For the smart grid setting, we use the same hyperparameters of the traffic benchmarks. In order to compute the similarity among channels, we use weekwise correntropy. Data were normalized using standard scaling as in the previous settings and we did not perform additional preprocessing steps.