## 1 Introduction

Missing data imputation is a ubiquitous issue that arises in a variety of domains. Most supervised deep learning methods require complete datasets, but in the real world, datasets often suffer from incompleteness due to access problems or mistakes in data collection

(yoon2016discovery; yoon2018personalized; kreindler2006effects; van2018flexible). Numerous fields thus require missing data imputation (MDI) methods to reconstruct a complete dataset, including biostatistics (mackinnon2010use), epidemiology (sterne2009multiple), and irregular time-series analysis (kreindler2006effects).Classically, the underlying mechanism giving rise to missing data is categorized into three types (rubin1976inference)

. (1) If the probability of being missing is completely independent of the data, then the data are said to be missing at random (MCAR). (2) Data is said to be missing at random (MAR) if the probability of being missing is the same only within the data-defined groups. (3) If the probability of missing data depends on both observed and unobserved variables, then such data are missing not at random (MNAR).

Predictive approaches to MDI (bertsimas2017predictive) can be categorized in two families: (i) building a global model for data imputation, or (ii) inferring the missing components employing similar data points to the one having missing values. The second class typically uses advanced k-NN strategies (acuna2004treatment). The first class includes simple statistics from the entire dataset (e.g., medians), linear models (lakshminarayan1996imputation)

(wang2006missing) or, more recently, deep neural architectures (smieja2018processing; yoon2018gain; nazabal2020handling; spinelli2020missing).Recently, it was noted that the unification of the two paradigms (i.e., inferring similar data points for each imputation and building global models from the overall dataset) can be beneficial for MDI. This can be done by exploiting Graph Neural Networks (GNNs) (narang2013signal; chen2020learning; jiang2020incomplete; rossi2021unreasonable), a novel class of neural networks that can process graph-based data in a differentiable fashion. In particular, the graph imputer neural network (GINN) (spinelli2020missing) explored the assumption of endowing tabular data with a graph topology based on a pre-computed similarity between points in the feature space, and then exploiting a GNNs to tackle the MDI problem. However, the proposed GINN model had two limitations. First, the method is tested only on the entire dataset (i.e., no mini-batching is performed, which is challenging with GNNs models), and scaling it up to large datasets is unfeasible due to the quadratic cost of computing the similarity matrix. Second, graph connectivity must be computed beforehand in the feature space employing only those values observed by both data points, requiring the definition of a suitable distance metric and a customized procedure to sparsify the graph.

In general, building a connectivity beforehand as done in GINN was necessary, since the underlying assumption of most GNNs is that the graph topology is given and fixed; thus, convolution-like operations typically amount to modifying the node-wise features by averaging information from the neighbours. However, the hand-crafted connectivity might be sub-optimal and requires a number of hyper-parameters to be defined and manually optimized. Instead, automatically learning the latent graph structure can overcome the limitations of these methods by inferring the underlying graph relationships. Such latent graphs can capture the actual topology of structured data through the downstream tasks, which can be seen as a task-related topology, thus conveying model interpretability. Recently, latent graph imputation has become an important research topic in the GNNs literature, as described in Section 2.2.

Based on these considerations, the main contributions of this paper are as follows. (i) We introduce an end-to-end trainable network architecture to learn the optimal underlying latent graph for tabular data using a novel EdGe Generation Graph AutoEncoder (EGG-GAE) network module. The EGG-GAE module sampling scheme is optimized with respect to downstream task metrics utilizing a straight-through estimator

(jang2016categorical) in the backward pass to ensure its differentiability. (ii) We demonstrate that employing the latent graph predictions for the missing data imputation (MDI) problem induces consistent improvement across a number of datasets in a large experimental evaluation, demonstrating significant improvements over baselines and reaching state-of-the-art performance. (iii) We propose the concept of tabular graph mini-batching along with an ensembling technique to resolve the MDI scalability issues (miao2022experimental). (iv) We propose a novel concept of learnable*prototype nodes*which encodes a learnable data representation in the form of an additional set of nodes added to the imputed graph, to provide each data point in the mini-batch with a reliable neighbourhood.

## 2 Related Works

Before describing our proposed solution for tabular MDI, we provide a brief overview of related works among three lines: MDI methods for tabular data (Section 2.1), latent graph imputation for GNNs (Section 2.2), and methods to employ GNNs on graphs with missing data (Section 2.3).

### 2.1 Tabular missing data imputation

MDI algorithms can be categorized depending on whether they are discriminative or generative, univariate or multivariate, and on whether they provide one or multiple imputations for each missing data point. In this work we present a generative model which performs multivariate imputations and can provide multiple imputations for each missing datum.

The imputation strategies can be divided into three categories: 1) statistical; 2) machine learning (ML), and 3) deep learning (DL) based. Statistical methods exploit the observed data to obtain mean, median, or mode estimation of missing data points

(farhangfar2007novel). Among traditional machine learning imputation approaches, multiple imputation using chained equations (MICE) (van2011mice) is considered one of the most flexible and powerful. MICE iteratively imputes each dataset variable while keeping the others constant, selecting one or more observations from a predictive distribution on that variable. Although MICE has performed well in some cases, its underlying assumptions may result in biased forecasts and lower accuracy (azur2011multiple). ML algorithms include k-nearest neighbours (KNN)

(acuna2004treatment)lakshminarayan1996imputation, support vector techniques

(wang2006missing) and several others. In practice, these approaches have mixed results compared to more straightforward techniques such as mean imputation (bertsimas2017predictive)

. KNN is generally limited to weighted averaging among similar feature vectors, whereas other algorithms are required to build a global dataset model for imputation. DL models include deep denoising autoencoders

(gondara2018mida)(bengio1995recurrent), and generative models (yoon2018gain; nazabal2020handling). Multilayer nonlinear computation allows these methods to capture more complex correlations in data, however, they still require building a global model from the dataset while ignoring potentially significant contributions from similar points. GINN (spinelli2020missing) addressed the problem of leveraging both the global aspect of the dataset and local dependencies between different data points utilizing GNNs. GINN requires the calculation of a pre-defined similarity matrix on the entire dataset in feature space, which is unfeasible for most real-world databases.### 2.2 Latent graph learning

GNNs exploit the general idea of localized message-passing, e.g., using graph convolutional layers (kipf2016semi), GraphSAGE (hamilton2017inductive), edge convolutions (gilmer2017neural), and graph attention (velivckovic2018graph). In their basic formulation, GNNs layers require graph connectivity to be provided as input to the model. Mini-batches of data can be formed by sampling several graphs from a pool of graphs or sampling sub-graphs with a fixed number of nodes or connections from the original graph (zou2019layer; cong2020minimal). The common disadvantages of these methods are that they require a given or fixed input graph and the requirement of a sparse graph to make the approach computationally feasible, especially for large graphs. Recently, methods which do not assume given or fixed graph connectivity were proposed. Such methods construct the graph dynamically during training. wang2019dynamic proposed Dynamic Graph CNNs (DGCNN) using KNN to construct the graph on-the-fly in the feature space of the neural network. Later, a graph learning model was proposed in cosmo2020latent that builds a probability graph as a weighted adjacency matrix for an optimal classification result. Thus, the graph is built in a fully connected manner, implying that the model cannot exploit the possible sparseness of the graph. A more general Differential Graph Module (DGM) (kazi2022differentiable) explicitly models sparse latent graphs employing the Gumbel-top-k trick (kool2019stochastic), overcoming the dense graph limitations by fixing the number of neighbours for each node which allows working with bigger graphs. We take inspiration from DGM for our sampling scheme, and we detail the major differences in Section 4.

### 2.3 Node feature imputation in GNNs

GNNs models typically cannot deal with attribute-incomplete graph data directly, where rows represent nodes and columns feature channels. However, in real-world scenarios, features are often only observed for a subset of the nodes. Several works address missing node features in graph machine learning tasks. SAT (chen2020learning) uses transformer-like models for feature completion followed by independent GNNs to solve the downstream task, which leads to a sub-optimal solution. GCNMF (taguchi2021graph)

overcomes this limitation by employing a Gaussian mixture model (GMM) to represent missing node features and jointly learns the GMM and GNNs parameters, however, this significantly increases the number of trainable parameters, implying high computational cost. PaGNNs introduced partial aggregation functions to propagate only the observed features

(jiang2020incomplete). However, these cannot scale to large graphs (rossi2021unreasonable). Recently a discrete diffusion-based feature reconstructions framework was proposed, which leads to a simple, fast and scalable iterative algorithm (rossi2021unreasonable), though the method is designed to work for homophilic graphs.## 3 Problem Formulation

Let the matrix denote a -dimensional dataset and represents its corresponding target vector. Without loss of generality, we assume each data vector contains numerical variables referred to

and categorical variables indexed by

with . We assume that each categorical variable takes values among classes. The dataset is referred as mixed if , numerical if and categorical when . By definition some percentage of dataset entries are missing (corrupted). We associate a binary matrix to identify missing and observed variables, where corresponds to the observed values, and indicates missing ones. Note that the corruption process can be of many types: MCAR, MNAR, MAR, and it is generally unknown to the user. The imputation process aims to provide a plausible estimation for unobserved values , such that (i) the imputed dataset would be as close as possible to the real complete dataset (if such exists), (ii) the imputed dataset has to achieve strong downstream task performance if adopting as input to predict the corresponding target vector .##### Dataset preprocessing

We assume that the dataset , by definition, is properly normalized and corrupted in advance, hence the corresponding corruption matrix is determined. Training different imputation approaches discussed in this paper requires distinct data preprocessing strategies. A straightforward way is to employ statistics of observed values. In our work, numerical values are initially assessed with *mean* statistics, while categorical ones are approximated with the corresponding *most frequent class*

unless otherwise stated. Some of the imputation baselines discussed in this paper also require one-hot encoding of categorical values. The preprocessed dataset is denoted as

, which is the input to the subsequent EGG-GAE module.## 4 Method

We propose a general solution for the tabular data MDI problem based on graph representation learning, whose overall pipeline is depicted in Fig. 1. At every iteration, a mini-batch of data is sampled and preprocessed, according to the procedure described in Section 4.1. This mini-batching is necessary, as it allows the algorithm to scale to large datasets. Next, we build a graph where each node is a row of the sampled mini-batch, and its corresponding node features are a non-linear mapping of the original inputs. The connectivity between nodes is learned through a differentiable sampling procedure, instead of being fixed as in previous works (e.g., (spinelli2020missing)). We describe different variations of this last component in Section 4.2. We also propose two extensions to this basic architecture, i.e., ensembling and what we call prototype nodes, in Section 4.3. The entire system is trained end-to-end with a combination of imputation losses and a downstream classification loss, as described in Section 4.4.

### 4.1 Mini-batch preprocessing

Like in spinelli2020missing, we turn MDI into a predictive task by employing a surrogate task, we randomly remove elements from the mini-batch and impute them to enable our model to reconstruct missing values. In contrast to denoising auto-encoders (vincent2008extracting), we only predict the removed elements rather than reconstructing the entire input. In order to obtain a surrogate batch-level corruption matrix we use the MCAR mechanism to mask a certain percentage of the sampled batch

. Precomputed statistics replace numerical masked values, while unobserved categorical entries are replaced with auxiliary tokens. We replace each categorical variable with a trainable dense embedding, whose size is a hyperparameter. Note that initial missing values, which have to be imputed during the inference, are represented as observed variables in the surrogate batch-level corruption matrix

. The concatenation of the preprocessed batch parts: categorical and numerical , yields the final preprocessed batch. In order to avoid cumbersome notation, we refer to the final batch representation as .### 4.2 Architecture

The core feature of the proposed EGG-GAE model is the EGG block, which endows the tabular representation of the sampled mini-batch with a graph topology in order to predict the missing values with a GNN module. The general EGG module comprises three components, as depicted schematically in Fig. 2: (i) a node projector transforms input features by projecting each row into a new space that we call the graph embedding space; (ii) A sampler, which obtains the edge set by sparsifying all possible edges between nodes; (iii) a GNN head that operates on the obtained graph . The EGG blocks can be stacked subsequently while their outputs concatenated to obtain the final representation . In the following, we describe two variations of EGG blocks: the standard EGG blocks sample each edge independently, and a restricted -EGG block that samples exactly neighbors for each node.

Before the first EGG block, the preprocessed batch is encoded via an initial mapping function:

(1) |

where

is a two-layer MLP with a ReLU activation and batch normalization in between.

##### Egg

Each EGG block first projects the input with an additional row-wise operation, obtaining:

(2) |

where has the same architecture as . The sampler block represents pairwise edge relations of the projected by first forming a probability matrix where the -th, -th element is computed as:

(3) |

where are -th and -th rows of matrix . Each element of represents the probability of sampling edge in the output graph. In order to sample an undirected graph, corresponding to a strictly upper triangular adjacency matrix , we combine a Gumbel-Softmax trick (jang2016categorical) for sampling from with masking:

(4) |

where is a separate temperature hyperparameter, is a strictly upper triangular matrix with ones above the main diagonal, is the Hadamard product, and . We obtain a sparse matrix by then thresholding at . The final sparse unweighted adjacency matrix is computed as:

(5) |

where

is the identity matrix. In order to have a valid gradient for back-propagation with respect to the thresholding operation, we use a straight-through estimator in the backward pass

(jang2016categorical) to allow the gradient flow through the sampling scheme. The final input of the GCN head is then a sparse unweighted adjacency matrix along with the corresponding feature representation . The updated node representation is computed as:(6) | ||||

(7) |

where is a graph convolutional layer (kipf2016semi) or any other message-passing layer that operates on the graph connectivity.

##### -Egg

We also explore a variant of EGG, called -EGG, which is more directly inspired to the sampling procedure in kool2019stochastic. It forces a sparsity of the adjacency matrix by limiting the number of neighbours sampled per node to a fixed constant . For each row of , instead of thresholding each entry, we extract the first edges corresponding to the highest values, obtaining and filling with ones the corresponding positions of matrix . The unweighted adjacency matrix is computed as in Eq. (5).

##### Comparisons with DGM

Our graph sampling procedure is inspired to the recently proposed DGM block (kazi2022differentiable), with a number of important differences that we mention here. First, DGM was designed for a graph scenario where the full set of nodes is available beforehand and a single underlying graph connectivity is assumed to exist. Instead, we work on randomly sampled mini-batches of data coming from a generic tabular dataset. Second, since the size of mini-batch is a user-defined parameter, we use directly the straight-through estimator in the backward pass, avoiding an additional surrogate loss as in kazi2022differentiable. The proposed module does not require to limit the output dimensions of the node projector to small values, hence obtaining greater model expressivity. Finally, the GNN head in this paper follows a design with skip-connections and layer normalization.

### 4.3 Extensions to the basic formulation

We describe here two simple extensions of the model that we have found to work consistently better in our experimental evaluation.

##### Ensembling

The proposed model has two sources of stochasticity: mini-batch sampling and continuous relaxation of discrete random variables (the edge sampling procedure). We propose to exploit this stochasticity during inference by sampling the batches from the test set until each datum (node) has the desired number of predictions, so that each node has multiple predictions relying on different neighbourhoods. In the classification case, we select the maximum average soft prediction, whereas we use the mean prediction for the regression case.

##### Prototype nodes

In this paper, we also propose to use trainable prototype nodes which we refer to as . Applying mini-batching to tabular or graph-structured data may lead in exceptional cases to a lack of data expressiveness during the forward pass due to the same sources of stochasticity mentioned above. For instance, in the case of tabular data, there is no guarantee that the sampled mini-batch composes a reliable set of neighbours for the particular datum prediction. Therefore, prototype nodes are designed to encode common data patterns and allow each data point to have reliable neighbours regardless of the sampled mini-batch. The number of prototypes nodes is a hyperparameter. We initialize the prototypes nodes randomly. However, other strategies such as data mean can be applied. The prototype nodes are then added to before every EGG block. does not participate explicitly in the objective function while contributing as sampled neighbours.

### 4.4 Objective

The objective function consists of two main terms: downstream loss and imputation loss. The downstream loss paired with the model construction allows to perform end-to-end training, including sampling scheme optimization (in contrast to the DGM paper (kazi2022differentiable)). In the experiments, we use classification task datasets. Therefore, the proposed model is optimized through a cross-entropy loss.

The imputed data along with the network predictions are obtained through representation as:

(8) | ||||

(9) |

where are linear projectors, acting row-wise on the input matrix . The downstream task loss depends on the dataset, in our case that is cross entropy. Note that downstream task loss implicitly contributes to finding the best solution for the imputation problem and latent graph representation learning. The prototype nodes do not participate in any losses explicitly. However, implicit contribution through neighbour batch nodes allows the gradient to flow backwards and learn their representation. The mini-batch masking procedure introduces a surrogate objective to simulate the presence of missing data for which the reconstruction loss is computed. We optimize the MDI solution with the sum of Eq. (10) and (11), are numerical and categorical parts of surrogate batch-level corruption matrix corresponding to subsequently.

(10) | ||||

(11) | ||||

(12) |

Graph learning generally assumes a homophilic structure of the data, i.e., similar patterns are connected with a higher probability, which in our case means that we expect patterns of the same class to be linked. We can explicitly enforce homophily by penalising interclass entries within the sampled adjacency matrix . The target vector provides an idealized adjacency matrix , where if and belong to the same class, otherwise zero. The complement of the idealized adjacency is denoted as . Equation (12) represents the penalization term, where and . The final objective is computed as a weighted combination of losses (10-12):

(13) |

where , , are loss weights.

## 5 Experiments and Results

The primary experimental comparison is discussed in Section 5.1, whereas in Section 5.2, we concentrate on architectural ablations. All experiments are conducted at the following noise levels: , and .

##### Datasets

We validate our method on 15 datasets from the UCI repository by artificially introducing missing values using MCAR, MNAR, or MAR mechanisms (the setup for MAR and MNAR is replicated from muzellec2020missing). Table 1 displays the aggregate statistics for the datasets. The training and test parts of the SUSY dataset were subsampled so that they could be compared to the majority of imputation baselines. The remaining datasets are divided into training sets (70%) and validation sets (30%). To satisfy a real-world scenario, we optimise the model with respect to additional noise introduced into the validation set (using the MCAR mechanism regardless of the source of initial missingness) and report the results concerning the initially missing values. Following the relevant literature, we evaluate the imputation and post-imputation prediction performance for each experiment.

Data type | Dataset | #Samples | #Num.Fs. | #Cat.Fs. |
---|---|---|---|---|

Numerical | Yeast | 1484 | 8 | 0 |

Wireless | 2000 | 7 | 0 | |

Abalone | 4177 | 8 | 0 | |

Wine-quality | 4898 | 11 | 0 | |

Page blocks | 5473 | 10 | 0 | |

Electrical grid stability | 10000 | 14 | 0 | |

SUSY (small) | 25000 | 18 | 0 | |

Mixed |
Anuran | 7195 | 22 | 3 |

Default credit card | 30000 | 14 | 10 | |

Adult | 32561 | 6 | 8 | |

Categorical | Car | 1728 | 0 | 6 |

Phishing websites | 2456 | 0 | 9 | |

Letter | 20000 | 0 | 16 | |

Chess | 28056 | 0 | 6 | |

Connect | 67557 | 0 | 42 |

##### Algorithms

In the experiments the proposed architectures EGG-GAE and -EGG-GAE, are compared with two statistical imputation methods: Mean (little2019statistical), KNN (troyanskaya2001missing), two machine learning imputation approaches: MICE (van2011mice), MissForest (MF) (stekhoven2012missforest) and four deep learning approaches: MIDA (gondara2018mida), GINN (spinelli2020missing), GAIN (yoon2018gain), and NN (the proposed architecture wherein EGG block is substituted with an MLP one that is identical to ). To provide a fair comparison with the baselines, we apply the hyper-parameters and data preprocessing steps from the original papers for all datasets. The proposed models and NN baseline utilize the data pipeline described in this paper.

##### Proposed architecture details

The surrogate batch level corruption is introduced with the MCAR mechanism and . The number of EGG blocks is equal to

. The hidden representations of all MLPs (feature propagation block and node mapper) are equal to

. The batch size is equal to . The regularization homophily parameter is equal to . The temperature parameter linearly decrease from to . During inference, the ensembling parameter is equal to . EGG-GAE and -EGG-GAE share the majority of architectural hyperparameters, and for -EGG-GAE we fix the number of sampled neighbours per node at. We utilize RMSprop for the optimization with the learning rate equal to

.### 5.1 Imputation

The main set of experiments addresses the imputation reconstruction and predictive performance of the proposed networks in comparison to baseline algorithms utilising the MCAR, MNAR, and MAR mechanisms. The predictive performance of an MDI solution for tabular data is typically measured by classical machine learning (ML) algorithms for a downstream task. To assess post-imputation downstream task performance we employ random forest

breiman2001random. We show a schematic result with a unified count of wins and a unified average ranking that takes all levels of noise into account.The unified count of wins shows the summary for each level of noise and missing mechanism (MCAR, MNAR and MAR). It represents the unified number of times that each method achieves the best performance with respect to the imputation or post-imputation task metrics, i.e., the lowest for RMSE, MAE and the highest in imputation and predictive accuracy. To compute the average ranking we first rank the model for each dataset according to the performance metric. The average ranking is then calculated by averaging the obtained rankings across all datasets. We obtain a separate average ranking for each level of noise, resulting in a matrix consisting of average rankings for every level of noise. The unified average ranking represents multiple average rankings based on a variety of performance metrics and is displayed as a bar graph with error bars. The bar height represents the mean of average rankings regarding the performance metrics and noise levels, while the error bars show corresponding variations. Note that GINN does not participate in the unified count of wins or unified average ranking, because execution time of GINN for big datasets is unfeasible within a reasonable amount of time. The full results for the scenario in which of the entries are missing can be found in the supplementary material, Appendix B, where GINN model is represented as well.

In Fig. 3, it is evident that the suggested models outperform the baselines for every missing mechanism, especially for higher noise levels. Fig. 2(a) shows that -EGG-GAE achieves the best score considerably more often than EGG-GAE, especially for MCAR and MNAR scenarios. Aggregating the results across every noise level, we observe that the EGG-GAE and -EGG-GAE together accumulate of the best cases against the , and of its best competitor for the MCAR, MNAR and MAR mechanisms, respectively. The machine learning baselines (KNN, MICE, and MF) are the strongest competitors, achieving the best performance in , and of the cases (cumulatively) for the MCAR, MNAR, and MAR scenarios, accordingly. In Fig.2(b), we can see that the proposed model framework using MLP instead of EGG (referred to as NN) performs just as well as MF on average, even though it rarely achieves the highest score. In addition, we can see that the proposed EGG-GAE and -EGG-GAE models stay roughly on par.

### 5.2 Ablation experiments

We perform ablation studies over the numerical datasets (reported in Table 1). The ablation study examines model architecture alterations, evaluating ensembling and prototype nodes. The results are averaged across five runs and presented as unified average rankings based on end-to-end accuracy, RMSE and MAE. In Section 5.2.1 we analyse the proposed methods further by comparing training/inference timings for various neural baselines.The architectural experiments can be found in the suplementary materials, Appendix A.

Figure 3(a) shows that performing ensembling during inference enhances the performance of downstream and MDI problems regardless of missingness mechanisms. We can see that increasing the number of ensembling iterations on average causes improvement for every type of missingness, as hypothesized. Both EGG-GAE-20 and EGG-GAE-50 have close mean values, while the corresponding variations share the same interval (around ); this suggests that the plateau is reached when the number of iterations surpasses 20. We argue that performing ensembling during the inference leads to (i) reliable performance (ii) enhancing the performance of downstream and MDI problems. Figure 3(b) demonstrates that the models with learnable prototype nodes (EGG-GAE-10, EGG-GAE-20, and EGG-GAE-50) have a lower mean average ranking compared to the model without them (EGG-GAE-0), independent of the missingness mechanisms scenario. Increasing the number of prototype nodes helps to achieve better performance. However, we can see that introducing a significant number of prototype nodes might worsen the performance (EGG-GAE-50). We believe that the number of prototype nodes should be at least equal to the number of classes of downstream tasks. However, adding too many prototype nodes can impair the performance. We believe it affects the sampling scheme by increasing the likelihood that the graphs will be built primarily with prototype nodes, resulting in a poorer solution.

#### 5.2.1 Time comparison

Figure 4(a) represents the average training time until convergence of the validation loss for the numerical datasets, while Figure 4(b) depicts the average models inference time. Note that the proposed architecture EGG-GAE does not vary a lot between average time training along with average inference time. This follows from the batch size (300) that was used to train EGG-GAE and -EGG-GAE, which is fixed for all experiments. Increasing the batch size will result in quadratic time consumption growth due to the pairwise distance calculation. The average training time is approximately the same as MIDA and five times slower compared to GAIN. Average inference time is approximately 5-6 times greater compared to NN, MIDA and GAIN models, mostly due to the ensembling procedure. In Figures 4(c) and 4(d) we illustrate the evolution of accuracy and RMSE throughout training time in seconds for SUSY, averaged over 10 runs.

## 6 Discussion and Conclusion

In this paper, we propose a generic framework for handling missing values in tabular data, employing graph deep learning. Particularly, we presented an end-to-end trainable graph autoencoder (EGG-GAE) model for learning the underlying graph representation of tabular data applied to the MDI problem. We performed extensive experiments with real-world datasets (from different fields) and determined that our model outperformed current state-of-the-art algorithms in terms of imputation and downstream task performance. We described several improvements to our model by demonstrating that ensembling improves MDI and dataset task performances; we further introduced novel learnable prototype nodes to encode common data patterns and serve as a generic, reliable subset of nodes for the predicted graphs. Finally, we introduced a regularization method that forces homophily in the learned latent graph representation.

As future works, the proposed EGG-GAE network can be applied to any type of data to introduce a graph topology for, e.g., imputing missing data over images, audio, or other types of high-dimensional data, exploiting the modularity of modern deep learning architectures. In addition, the Euclidean distance calculation can be substituted with a trainable network to construct probabilistic graphs based on a learned metric distance function. The assumption and limitations of the proposed framework are described in supplementary materials, Appendix

C.Supplementary Materials for EGG-GAE: scalable graph neural networks for tabular data imputation

## Appendix A Architectural Experiments

We perform architectural experiments over the numerical datasets (reported in Table 1). The results are averaged across five runs and presented as unified average rankings based on end-to-end accuracy, RMSE and MAE. In Section A.1 we analyse the proposed homophily penalization term, evaluate different GNN heads in Sec. A.2. The effect of extra manipulation of the embedding space obtained by the node projector is investigated in Sec. A.3. Examine the impact of varying the number of neighbours sampled per node in the restricted sampling scheme of the -EGG-GAE model in Section A.4.

### a.1 Homophily experiment

We argue that boosting in Eq.13 enhances the sampling scheme of the EGG-GAE model by restricting the sampling of non-homophilic neighbours. We further investigate the influence of the proposed homophily loss adapted to EGG-GAE model. Fig. 6 demonstrates that, on average, using the homophily regularisation term is beneficial. Increasing the regularisation hyperparameter results in an improved unified solution on average. Although high penalization improves the performance, the variation of EGG-GAE- indicates that the performance enhancement has plateaued.

### a.2 Heads experiment

Here we inspect the performance change under different GNNs heads. We explore four heads: GCN (kipf2016semi), EdgeConv (wang2019dynamic), ARMAConv (bianchi2021graph) and SGConv (wu2019simplifying). As can be seen in Fig. 7, ARMAConv and EdgeConv on average perform better than GCNConv and SGConv, which achieve roughly the same results, further improving the results from Section 5.1. We hypothesize a potential explanation of ArmaConv and EdgeConv superior performance compared to GCN and SGConv as follows. ArmaConv is more resistant to noise, which increases its resilience to incorrectly sampled connectivity, while EdgeConv intrinsically weights the contribution of each neighbour, providing additional noise resistance and reducing the contribution of not similar examples (which were sampled due to stochasticity) for concrete datum prediction. As a result, an additional filter is applied to the sampled nodes.

### a.3 Metric learning experiment

In this part, we investigate the possibility of influencing the embedding space acquired by the node projector. We add additional regularization on the embeddings obtained by Eq. 2 using triplet loss schroff2015facenet which is calculated as:

(14) |

where is a margin and equal to , is a regularization hyperparameter and are the triplets formed from embeddings forcing the homophily by selecting and from the same class and from the other. We mine the triplets with distance weighted margin-based approach (wu2017sampling). Fig. 8 demonstrates applying further regularization on node embedding space can lead to a better solution; nevertheless, the scale parameter has to be carefully chosen, since high values of result in suboptimal solutions.

### a.4 Restricted sampling

In this section we investigate restrictive sapling procedure by varying the number of sampled neighbours k per node. Figure 9 demonstrates the corresponding experiment, where the model -EGG-GAE-0 is a model which has only self-nodes. Models that rely on the sampled neighbourhood consistently outperform models with only self-nodes in terms of MDI solution and predictive accuracy. Next, we observe that increasing the number of sampled neighbours improves the performance on average, and that the optimal number of neighbours is . In addition, as the number of sampled neighbours increases, both the average ranking and the variation increase. We hypothesise that this indicates that as the number of sampled neighbours increases, so does the proportion of noisy neighbours, which degrades performance.

## Appendix B Imputation Experiment

The main set of experiments addresses the imputation reconstruction and predictive performance of the proposed networks in comparison to baseline algorithms utilising the MCAR, MNAR, and MAR mechanisms. The predictive performance of an MDI solution for tabular data is typically measured by classical machine learning (ML) algorithms for a downstream task. To assess post-imputation downstream task performance we employ random forest for all models breiman2001random and provide the findings in Table 2. Tables 3, 4 and 5 display the MDI reconstruction error in terms of RMSE, MAE, and accuracy for numerical and categorical values, respectively, when 20% of values are missing. Due to the fact that the execution time exceeds 24 hours, some GINN results are unavailable and denoted by “-” in the table.

According to Tables 2-5 we can see that for the majority of datasets the proposed EGG-GAE and -EGG-GAE prevail as the best or second best solution in terms of post-imputation predictive performance and MDI solution, regardless of the missingness mechanism. Tables 2 and 5 demonstrates that for categorical data, algorithms inferring similar data points (EGG-GAE, -EGG-GAE, GINN, and KNNI) achieve the best predictive and MDI performance. Regarding MDI performance, the cumulative number of wins of models employing similar datapoints is 23 out of 24 cases. Additionally, it is noticeable in Table 3, that -EGG-GAE model dominates in 50% of cases in total, and 60% cases considering MNAR missigness mechanism. From Table 4, we can see that the machine learning algorithm MF achieves the best result in 10 out of 24 cases, compared to the cumulative win of EGG-GAE and -EGG-GAE models (the cases when the proposed models shares first and second place): in 9 out of 24 instances; however, from the schematic representation (Figure 2(a)), it is evident that -EGG-GAE dominates over the MF algorithm when all performance metrics are considered (predictive and imputation accuracy, RMSE, MAE).

## Appendix C Assumptions and Limitations

1) The pairwise calculation in the sampling procedure of the EGG block requires operations, so increasing the batch size results in a quadratic increase in training/inference time. In addition, we believe that the procedure could generally be replaced with a learnable block. In fact, we believe that the sampling process should be iterable, such that the first iteration provides an initial approximation of the neighbourhood and subsequent iterations eliminate noisy neighbours.

2) In the paper we rely on a pretty simple graph construction approach: from the sampled batch we construct for each node its neighbourhood and pass the obtained graph through a GNN head. There are a number of modifications that will allow to obtain better gradients, resulting in a better solution. For example, we can extract combinations of rows from the obtained matrix (with/without repetitions), and then carry out the subsequent operations described in Section 4 without modification. Such modification will allows us to obtain multiple predictions for the same data point in a single pass, resulting in a theoretically better gradient.