1. Introduction
Given a graph with partial observations of node features, how can we estimate the missing features accurately? Many realworld data are represented as graphs to model the relationships between entities. Social networks, selleritem graphs in electronic commerce, and usermovie graphs in a streaming service are all examples of graph data that have been studied widely in literature (Yoo2017; Kipf2017; Velickovic2018; Shchur2018; Hu2020)
. Such graphs become more powerful when combined with feature vectors that describe the diverse properties of nodes
(Duong2019; Yang2020).However, node features are commonly missing in a realworld graph. Users in an online social network set their profiles private, and sellers in electronic commerce often register items without an informative description. In such cases, even the observed features cannot be used properly due to the missing ones, since many graph algorithms assume the full observation of node features. Figure 1
illustrates the feature estimation problem in an example graph. An accurate estimation of missing features not only provides diverse information of node properties but also improves the performance of essential tasks such as node classification or link prediction by providing important evidence for training a classifier.
However, accurate estimation of missing features is challenging due to the following reasons. First, target nodes have no specific information that describes their properties. The main evidence for estimation is the graph structure, which gives only partial information of nodes based on the relationships with the other nodes. Second, the target variables are highdimensional vectors containing up to thousands of elements. This requires large representation power for accurate estimation, involving a high risk of overfitting as a consequence. Existing approaches (Huang2019; Chen2019; Chen2020) failed to address such challenges effectively, resulting in limited performance.
We propose SVGA (Structured Variational Graph Autoencoder), an accurate method for missing feature estimation. The main idea for addressing the challenges is to run structured variational inference to effectively regularize the distribution of latent variables by modeling their correlations from the structure of a graph. We first propose stochastic inference, which models the prior of latent variables as Gaussian Markov random field (GMRF). Then, we improve the stability of inference with our proposed deterministic modeling, which results in a new graphbased regularizer. These allow us to avoid the overfitting without degrading the representation power, achieving stateoftheart performance in realworld datasets.
Our contributions are summarized as follows:

Method. We propose SVGA, an accurate method for missing feature estimation. SVGA introduces a new way to run variational inference on graphstructured data with modeling the correlations between target variables as GMRF.

Theory. We analyze the theoretical properties of structured variational inference with the stochastic and deterministic modeling. We also analyze the time and space complexities of our SVGA, which are both linear with the number of nodes and edges of a given graph, showing its scalability.

Experiments. Extensive experiments on eight realworld datasets show that SVGA provides stateoftheart performance with up to 16.3% higher recall and 14.0% higher nDCG scores in feature estimation, and up to 10.4% higher accuracy in node classification compared to the best competitors.
The rest of this paper is organized as follows. In Section 2, we introduce the problem definition and preliminaries of SVGA. In Section 3, we propose SVGA and discuss its theoretical properties. We present experimental results in Section 4 and describe related works in Section 5. We conclude in Section 6. The code and datasets are available at https://github.com/snudatalab/SVGA.
2. Preliminaries
We introduce the problem definition and preliminaries, including Gaussian Markov random field and variational inference.
2.1. Missing Feature Estimation
The feature estimation problem is defined as follows. We have an undirected graph , where and represent the sets of nodes and edges, respectively. A feature vector exists for every node , but is observable only for a subset of nodes. Our goal is to predict the missing features of test nodes using the structure of and the observations for . The problem differs from generative learning (Kingma2014) in that there exist correct answers; generative learning is typically an unsupervised problem.
We also assume that the label of each node can be given as an additional input for a set of nodes such that . Such labels improve the accuracy of feature estimation, especially when they provide information for the test nodes: . This is based on the idea that categorical labels are often easier to acquire than highdimensional features, and knowing the labels of target nodes gives a meaningful advantage for estimation. Thus, we design our framework to be able to work with , although we consider as a base setup of experiments for the consistency with previous approaches that take only the observed features.
2.2. Gaussian Markov Random Field
Gaussian Markov random field (GMRF) (Koller2009) is a graphical model that represents a multivariate Gaussian distribution. Given a graph whose nodes have continuous signals that are correlated by the graph structure, GMRF represents the distribution of signals with two kinds of potential functions and for every node and edge , respectively. We assume the signal of each node
as a random variable
with a possible value .Specifically, the node potential for each node and the edge potential for each edge are defined as follows:
(1)  
(2) 
where and are the parameters of the GMRF, and is the number of nodes. The nonzero elements of correspond to the edges of the graph as depicted in Figure 2.
Then, the joint probability
is given as the multiplication of all potential functions:(3) 
where is a normalization constant. Each potential measures how likely or appears with the current probabilistic assumption with the parameters and , and the joint probability is computed by multiplying the potentials for all nodes and edges.
The roles of parameters and can be understood with respect to the distribution that GMRF represents. Lemma 2.1 shows that GMRF is equivalent to a multivariate Gaussian distribution whose mean and covariance are determined by and . is the inverse of the covariance , and a pair of signals and is more likely to be observed if is small. determines the mean of the signals if is fixed, and is typically set to zero as we assume no initial bias of signals for the simplicity of computation.
Lemma 2.1 ().
The joint probability of Equation (3
) is the same as the probability density function of a multivariate Gaussian distribution
, where and .Proof.
See Appendix A.1. ∎
We utilize GMRF to incorporate a realworld graph in a probabilistic framework. Specifically, we generate a multivariate Gaussian distribution that models the probabilistic relationships between nodes by designing and from the adjacency matrix of the given graph. GMRF plays a crucial role in our proposed approach, which aims to run variational inference in graphstructured data without ignoring the correlations between target variables.
2.3. Variational Inference for Joint Learning
Variational inference (Kingma2014; Kipf2016; Tomczak2020) is a technique for approximating intractable posterior distributions, which has been used widely for generative learning. Given the adjacency matrix of a graph, our goal is to find optimal parameters that maximize the likelihood of observed features and labels . We introduce a latent variable for each node and denote the realization of all latent variables by , where is the number of nodes and is the size of variables. The latent variable represents the characteristic of each node for estimating its feature .
With variational inference, we change the problem into maximizing the evidence lower bound (ELBO):
(4) 
where is the ELBO, is a parameterized distribution of , and is a parameterized distribution of and . The first term of is the likelihood of observed variables given , while the second term measures the difference between and the prior distribution by the KL divergence.
We assume the conditional independence between , , and given , expecting that each variable has sufficient information of node to generate its feature and label . Then, the first term of in Equation (4) is rewritten as follows:
(5) 
where and are the sets of nodes whose features and labels are observed, respectively, and denotes .
Equation (5) represents the conditional likelihood of observed features and labels given . Thus, maximizing Equation (5) is the same as minimizing the reconstruction error of observed variables in typical autoencoders. On the other hand, the KL divergence term in Equation (4) works as a regularizer that forces the distribution of latent variables to be close to the prior . The characteristic of regularization depends on how we model the prior , which plays an essential role in our framework.
3. Proposed Method
We propose SVGA (Structured Variational Graph Autoencoder), an accurate method for missing feature estimation. The main ideas of SVGA are summarized as follows:

GNN with identity node features (Sec. 3.1). We address the deficiency of input features by utilizing a graph neural network (GNN) with identity node features as an encoder function, which allows us to learn an independent embedding vector for each node during the training.

Structured variational inference (Sec. 3.2). We propose a new way to run variational inference on graphstructured data without ignoring the correlations between target examples. This is done by modeling the prior distribution of latent variables with Gaussian Markov random field (GMRF).

Unified deterministic modeling (Sec. 3.3). We improve the stability of inference by changing the stochastic sampling of latent variables into a deterministic process. This makes the KL divergence term of ELBO as a general regularizer that controls the space of node representations.
In Section 3.1, we introduce the overall structure of SVGA and the objective function for its training. Then in Sections 3.2 and 3.3, we induce our graphbased regularizer from structured variational inference. Specifically, we propose the basic parameterization of structured inference in Section 3.2 and improve its stability with the deterministic modeling of latent variables in Section 3.3.
3.1. Overall Structure of SVGA
Figure 3 shows the overall structure of SVGA, which consists of an encoder and two decoder networks and . The networks , and are designed to estimate the target distributions of the ELBO of Equation (4): , and , respectively, where , , and are their parameters. The encoder generates latent representations of nodes, and the decoders and use the generated representations to estimate the features and labels of nodes. The feature decoder makes the final estimation of missing features, while the label decoder helps the training of and is not used if no labels are observed.
3.1.1. Encoder Network
The encoder network aims to model the latent distribution with parameters . We propose to use a graph neural network (GNN) as , because the main functionality required for is to generate an embedding vector for each node following the graphical structure. In experiments, we adopt a simple graph convolutional network (GCN) (Kipf2017) as , which works well even when the amount of training data is insufficient.
Still, it is required that every node contains a feature vector to run the GNN encoder on the given graph. Only a few nodes have observed features in our case, and it makes an imbalance between nodes with and without observed features. Thus, we use the identity matrix
as the input of , using the observed features only as the answer for the training of SVGA. This allows to learn an independent embedding for each node at its first layer and to have sufficient capacity to generate diverse node representations.If we use a GCN with two layers as in previous work (Kipf2017), the encoder function is defined as , where is the normalized adjacency matrix, is the adjacency matrix with selfloops, is the degree matrix such that , and
is the ReLU function.
and are the weight matrices of layers 1 and 2, respectively, where is the number of nodes, and is the size of latent variables. We do not represent the bias terms for brevity. Note that the node feature matrix of the original formulation of GCN (Kipf2017) is replaced with the identity matrix based on our idea of identity node features.Unit normalization. A possible limitation of introducing the identity feature matrix is the large size of , which can make the training process unstable. Thus, we project the latent representations generated from the encoder into a unit hypersphere by normalizing each vector of node as . This does not alter the main functionality of making diverse representations of nodes for making highdimensional features, but improves the stability of training by restricting the output space (Ying2018).
3.1.2. Decoder Networks
We propose two decoder networks and to model and , respectively. We assume that latent variables
have sufficient information to construct the observed features and labels. Thus, we minimize the complexity of decoder networks by adopting the simplest linear transformation as
and , where , , and are learnable weights and biases, is the number of features, and is the number of classes.3.1.3. Optimization
We update the parameters of all the networks , , and in an endtoend way. We rewrite the ELBO of Equation (4) as the following objective function to be minimized:
(6) 
where and are loss terms for features and labels, respectively. is our proposed regularizer, whose details are described in Sections 3.2 and 3.3 through the process of structured inference. We use a hyperparameter for the amount of regularization.
The loss terms and are determined by how we model the distributions and following the distribution of true data. Common distributions for features include Gaussian, Bernoulli, and categorical (or onehot) distributions:
(7) 
where the specific loss terms are defined as follows:
(8)  
(9)  
(10) 
is the output of the feature decoder, and
is the logistic sigmoid function. We introduce
in Equation (9) to balance the effects of zero and nonzero entries of true features based on their occurrences (Chen2020); is the ratio of zero entries in the observed feature matrix. For the output of the label decoder, we use the categorical loss, i.e., , due to the property of labels.Algorithm 1 summarizes the training process of SVGA. It makes latent variables and predictions in lines 1 and 2, respectively, and computes the error between predictions and observations in line 3. Then, it computes our regularizer function in lines 4 and 5, whose information is described in the following subsections, to update the parameters of all three networks in an endtoend way.
3.2. Structured Variational Inference
Previous works utilizing variational inference (Kingma2014; Kipf2016) assume the prior of latent variables as a multivariate Gaussian distribution with identity covariance matrices, and run inference independently for each variable. This assumption is inappropriate in our case, since the correlations between variables, represented as a graph, are the main evidence in our graphbased learning.
We thus model the prior distribution of Equation (4) as Gaussian Markov random field (GMRF) to incorporate the graph structure in the probabilistic modeling of variables. Specifically, we model as GMRF with parameters and . We make the information matrix from as a graph Laplacian matrix with symmetric normalization (Zhang2015):
(11) 
where is the identity matrix, and is the degree matrix such that . The resulting preserves the structural information of the graph as a positivesemidefinite matrix that satisfies the constraint of GMRF; the nonzero entries of except the diagonal ones correspond to those of . Note that is a constant, since it represents the fixed prior distribution of variables.
We also model our target distribution as a multivariate Gaussian distribution , where and are the mean and covariance matrices of size and , respectively. We assume that all elements at each node share the same covariance matrix. and are generated from encoder functions and , respectively, which contain the set of learnable parameters.
Given the Gaussian modelings of and , the KL divergence is formulated as follows:
(12) 
where is a constant related to and . The goal of minimizing the KL divergence is to update of encoder functions to make similar to as a regularizer of latent variables.
The computational bottleneck of Equation (12) is , whose computation is (Han2015). Thus, we decompose the covariance as with a rectangular matrix , where and are hyperparameters such that (Tomczak2020). As a result, is computed efficiently by the matrix determinant lemma (Harville1998):
(13) 
where and are the identity matrices of sizes and , respectively. The computation of Equation (13) is , which is tractable even in graphs with a large number of nodes.
For each inference, we sample randomly from based on and generated from and , respectively. Since the gradientbased update is not possible with the direct sampling of , we use the reparametrization trick of variational autoencoders (Kingma2014; Tomczak2020):
(14) 
where and are matrices of standard normal variables, which are sampled randomly at each time to simulate the sampling of
while supporting the backpropagation. The detailed process of inference is described in Appendix
B.We verify that the variables sampled from Equation (14) follow the target distribution by Lemmas 3.1 and 3.2.
Lemma 3.1 ().
Let be a latent variable sampled from Equation (14) for node , and be the th row of . Then, .
Proof.
See Appendix A.2. ∎
Lemma 3.2 ().
Assume that the size of latent variables is one. Let and be latent variables sampled from Equation (14) for nodes and , respectively. Then, .
Proof.
See Appendix A.3. ∎
Metric  Model  Cora  Citeseer  Computers  Photo  Steam  

Recall  NeighAgg  .0906  .1413  .1961  .0511  .0908  .1501  .0321  .0593  .1306  .0329  .0616  .1361  .0603  .0881  .1446 
VAE  .0887  .1228  .2116  .0382  .0668  .1296  .0255  .0502  .1196  .0276  .0538  .1279  .0564  .0820  .1251  
GNN*  .1350  .1812  .2972  .0620  .1097  .2058  .0273  .0533  .1278  .0295  .0573  .1324  .2395  .3431  .4575  
GraphRNA  .1395  .2043  .3142  .0777  .1272  .2271  .0386  .0690  .1465  .0390  .0703  .1508  .2490  .3208  .4372  
ARWMF  .1291  .1813  .2960  .0552  .1015  .1952  .0280  .0544  .1289  .0294  .0568  .1327  .2104  .3201  .4512  
SAT  .1653  .2345  .3612  .0811  .1349  .2431  .0421  .0746  .1577  .0427  .0765  .1635  .2536  .3620  .4965  
SVGA  .1718  .2486  .3814  .0943  .1539  .2782  .0437  .0769  .1602  .0446  .0798  .1670  .2565  .3620  .4996  
nDCG  NeighAgg  .1217  .1548  .1850  .0823  .1155  .1560  .0788  .1156  .1923  .0813  .1196  .1998  .0955  .1204  .1620 
VAE  .1224  .1452  .1924  .0601  .0839  .1251  .0632  .0970  .1721  .0675  .1031  .1830  .0902  .1133  .1437  
GNN*  .1791  .2099  .2711  .1026  .1423  .2049  .0673  .1028  .1830  .0712  .1083  .1896  .3366  .4138  .4912  
GraphRNA  .1934  .2362  .2938  .1291  .1703  .2358  .0931  .1333  .2155  .0959  .1377  .2232  .3437  .4023  .4755  
ARWMF  .1824  .2182  .2776  .0859  .1245  .1858  .0694  .1053  .1851  .0727  .1098  .1915  .3066  .3877  .4704  
SAT  .2250  .2723  .3394  .1385  .1834  .2545  .1030  .1463  .2346  .1047  .1498  .2421  .3585  .4400  .5272  
SVGA  .2381  .2894  .3601  .1579  .2076  .2892  .1068  .1509  .2397  .1084  .1549  .2472  .3567  .4391  .5299 
3.3. Unified Deterministic Modeling
The reparametrization trick of variational inference requires us to sample different at each inference to approximate the expectation term in Equation (5). However, this sampling process makes the training unstable, considering the characteristics of our feature estimation problem where 1) the inference is done for all nodes at once, not for each node independently, and 2) only a part of target variables have meaningful observations. Even a small perturbation for each node can result in a catastrophic change of the prediction, since we consider the correlations between nodes in .
We propose two ideas for improving the basic parameterization. First, we unify the parameter matrices and as a single matrix , and generate it from an encoder function . This is based on the observation that and have similar roles of representing target nodes as lowdimensional vectors based on the graphical structure. Second, we change the stochastic sampling of from into a deterministic process that returns at every inference, which has the largest probability in the distribution of . This improves the stability of inference, while still allowing us to regularize the distribution of with the KL divergence. Figure 4 depicts the difference between the basic parameterization and the unified modeling.
This unified modeling makes the KL divergence of Equation (12) into a general regularizer function that works with deterministic inference of node representations. First, we show in Lemma 3.3 that the first two terms of the right hand side of Equation (12) become equivalent as we assume by the unified modeling.
Lemma 3.3 ().
Let , , and . Then, , where is a constant unrelated to .
Proof.
See Appendix A.4. ∎
Then, we propose our regularizer function used in Algorithm 1 by rewriting the KL divergence of Equation (12) as follows:
(15) 
where is a hyperparameter that controls the effect of the log determinant term. We set is all of our experiments.
The first term of is called the graph Laplacian regularizer and has been widely used in graph learning (Ando06; Pang2017). Its minimization makes adjacent nodes have similar representations in , and the symmetric normalization of alleviates the effect of node degrees in the regularization. The second term of can be considered as measuring the amount of space occupied by . In other words, its maximization makes distributed sparsely, alleviating the effect of that squeezes the embeddings into a small space. The hyperparameter controls the balance between the two terms having opposite goals; the second term is ignored if , which means that the target nodes have no correlations.
3.4. Complexity Analysis
We analyze the time and space complexities of SVGA, assuming a graph convolutional network having two layers as the encoder function . We define a space complexity as the amount of space required to store intermediate data during each inference. Let , , and be the size of latent variables, the number of node features, and the number of labels, respectively.
Lemma 3.4 ().
Given a graph , the time complexity of SVGA is for each inference.
Proof.
See Appendix A.5. ∎
Lemma 3.5 ().
Given a graph , the space complexity of SVGA is for each inference.
Proof.
See Appendix A.6. ∎
Lemmas 3.4 and 3.5 show that SVGA is an efficient method whose complexity is linear with both the numbers of nodes and edges of the graph. The GMRF regularizer does not affect the inference of SVGA, because it is used only at the training time. Still, the time and space complexities of the GMRF loss of Equation (15) are and , respectively, which are linear with both the numbers of nodes and edges.
Dataset  Type  Nodes  Edges  Feat.  Classes 

Cora^{1}  Binary  2,708  5,429  1,433  7 
Citeseer^{1}  Binary  3,327  4,732  3,703  6 
Photo^{2}  Binary  7,650  119,081  745  8 
Computers^{2}  Binary  13,752  245,861  767  10 
Steam^{3}  Binary  9,944  266,981  352  1 
Pubmed^{1}  Continuous  19,717  44,324  500  3 
Coauthor^{2}  Continuous  18,333  81,894  6,805  15 
Arxiv^{4}  Continuous  169,343  1,157,799  128  40 
^{1} https://github.com/kimiyoung/planetoid
^{2} https://github.com/shchur/gnnbenchmark
^{3} https://github.com/xuChenSJTU/SATmasteronline
^{4} https://ogb.stanford.edu/docs/nodeprop/
4. Experiments
We perform experiments to answer the following questions:

Feature estimation (Section 4.2). Does SVGA show higher accuracy in feature estimation than those of baselines?

Node classification (Section 4.3). Are the features generated by SVGA meaningful for node classification?

Effect of observed labels (Section 4.4). Does the observation of labels help generating more accurate features?

Scalability (Section 4.5). How does the computational time of SVGA increase with the number of edges?

Ablation study (Section 4.6). How does the performance of SVGA for feature estimation change by the GMRF regularizer and the deterministic modeling of inference?
Model  Pubmed  Coauthor  Arxiv  

RMSE  CORR  RMSE  CORR  RMSE  CORR  
NeighAgg  0.0186  0.2133  0.0952  0.2279  0.1291  0.4943 
VAE  0.0170  0.0236  0.0863  0.0237  0.1091  0.4773 
GNN*  0.0168  0.0010  0.0850  0.0179  0.1091  0.0283 
GraphRNA  0.0172  0.0352  0.0897  0.1052  0.1131  0.0419 
ARWMF  0.0165  0.0434  0.0827  0.0710  o.o.m.  o.o.m. 
SAT  0.0165  0.0378  0.0820  0.0958  0.1055  0.0868 
SVGA  0.0158  0.1169  0.0798  0.1488  0.1005  0.1666 
Model  Cora  Citeseer  Computers  Photo  Pubmed  Coauthor  Arxiv  

MLP  GCN  MLP  GCN  MLP  GCN  MLP  GCN  MLP  GCN  MLP  GCN  MLP  GCN  
NeighAgg  .6248  .8365  .5150  .6494  .8715  .6564  .5549  .8846  .7562  .5413  .9010  .8031  .3979  .6493 
VAE  .2826  .3747  .4008  .3011  .4023  .4007  .2551  .2598  .2317  .2663  .3781  .2335  .1633  .1965 
GNN*  .4852  .3747  .4013  .5779  .4034  .4203  .3933  .2598  .2317  .4278  .3789  .2335  .2607  .4721 
GraphRNA  .7581  .6968  .6035  .8198  .8650  .8172  .6320  .8407  .7710  .6394  .9207  .8851  .1609  .1859 
ARWMF  .7769  .5608  .6180  .8205  .7400  .8089  .2267  .4675  .2320  .2764  .6146  .8347  o.o.m.  o.o.m. 
SAT  .7937  .8201  .4618  .8579  .8766  .7439  .6475  .8976  .7672  .6767  .9260  .8402  .3144  .5677 
SVGA (proposed)  .8493  .8806  .6227  .8533  .8854  .8808  .6757  .9209  .8293  .6879  .9264  .9037  .4394  .6644 
4.1. Experimental Setup
We introduce our experimental setup including datasets, baseline methods, evaluation metrics, and training processes.
Datasets. We use graph datasets summarized in Table 2, which were used in previous works (McAuley2015; Yang2016; Shchur2018; Chen2020). Node features in Cora, Citeseer, Photo, Computers, and Steam are zeroone binary vectors, while those in Pubmed, Coauthor, and ArXiv are continuous. Each node has a single discrete label. All nodes in Steam have the same class, and thus the dataset is not used for node classification.
Baselines. We compare SVGA with existing models for feature estimation. NeighAgg (Simsek2008) is a simple approach that aggregates the features of neighboring nodes through mean pooling. VAE (Kingma2014) is a generative model that learns latent representations of examples. GCN (Kipf2017), GraphSAGE (Hamilton2017), and GAT (Velickovic2018) are popular graph neural networks that have been used in various domains. We report the best performance among the three models as GNN* for brevity.
GraphRNA (Huang2019) and ARWMF (Chen2019) are recent methods for representation learning, which can be applied for generating features. SAT (Chen2020) is the stateoftheart model for missing feature estimation, which trains separate autoencoders with a shared latent space for the features and graphical structure, respectively. We use GAT and GCN as the backbone network of SAT in datasets with discrete and continuous features, respectively, which are the settings that show the best performance in the original paper (Chen2020).
The accuracy of SVGA for feature estimation with additional observations of labels. We show the average and standard deviation of ten runs. SVGA effectively utilizes the given labels, making more accurate predictions.
Evaluation metrics. We evaluate the performance of feature estimation with four evaluation metrics. For binary features, we treat each nonzero entry as a target item, considering the task as a ranking problem to find all nonzero entries. Recall at measures the ratio of true entries contained in the top predictions for each node, while nDCG at measures the overall quality of predicted scores in terms of information retrieval. We vary over in the Steam dataset and in the other datasets, because Steam has fewer features and thus a prediction is generally easier. For continuous features, we compare the predictions and the true features in an elementwise way with the root mean squared error (RMSE) and the square of the correlation coefficient (CORR). The definitions of evaluation metrics are in Appendix C.
Experimental process. We take different processes of experiments for feature estimation and node classification. For feature estimation, we split all nodes at each dataset into the training, validation, and test sets by the 4:1:5 ratio as in previous work (Chen2020). We train each model based on the observed features of training nodes and find the parameters that maximize the validation performance. We run each experiment ten times and report the average.
For node classification, we take only the test nodes of feature estimation, whose features are generated by our SVGA or baseline models. Then, we perform the 5fold crossvalidation in the target nodes, evaluating the quality of generated features with respect to the accuracy of node classification. We use a multilayer perceptron (MLP) and GCN as classifiers. For the training and evaluation of GCN, we use the induced subgraph of target nodes.
Even though our SVGA can utilize observed labels as additional evidence, we do not assume the observation of labels unless otherwise noted. This is to make a fair comparison between SVGA and baseline models that assume only the observation of features. We perform experiments in Section 4.4 with observed labels.
Hyperparameters. The hyperparameter setting of our SVGA is described in Appendix D. For baselines, we take the experimental results from a previous work (Chen2020) that optimized the hyperparameters for the feature estimation problem on our datasets.
4.2. Performance on Feature Estimation (Q1)
Table 1 compares SVGA and baseline models for feature estimation. SVGA outperforms all baselines with a significant margin in most cases; SVGA shows up to 16.3% and 14.0% higher recall and nDCG, respectively, compared with the best competitors. The amount of improvement over baselines is the largest in Cora and Citeseer, which are similar citation graphs, and the smallest in Steam. This is because the citation graphs have highdimensional features with sparse graph structures, increasing the difficulty of estimation for the baseline methods. On the other hand, Steam has the smallest number of features, while having the densest structure.
Table 3 presents the result of feature estimation for continuous features. SVGA still outperforms all baselines, and the amount of improvement is similar in all three datasets. The combination of Tables 1 and 3 shows that SVGA works well with various types of node features, providing stable performance. ARWMF causes an outofmemory error in 256GB memory, due to the computation of of the adjacency matrix with large .
4.3. Performance on Node Classification (Q2)
Table 4 shows the accuracy of node classification with two types of classifiers: MLP and GCN. MLP relies on the generated features for prediction, while GCN utilizes also the graph structure. SVGA outperforms all baselines in most cases, making a consistency with the results of feature estimation in Tables 1 and 3; SVGA achieves up to 10.4% and 7.8% higher accuracy in MLP and GCN, respectively, compared to the best competitors. The Steam dataset is excluded from Table 4, since it has the same label for all nodes.
4.4. Effect of Observed Labels (Q3)
Figure 5 shows the performance of SVGA for feature estimation with different ratios of observed labels. For instance, if the ratio is 0.5, half of all nodes have observed labels: . Note that the experiments for Tables 1, 3, and 4 are done with no labels for a fair comparison with the baseline models; the results of these experiments correspond to the leftmost points in Figure 5. We also report the performance of SAT for comparison.
SVGA shows higher accuracy with more observations of labels in both datasets, demonstrating its ability to use labels to improve the performance of feature estimation. Since the parameters need to be optimized to predict both features and labels accurately, the observed labels work as an additional regularizer that guides the training of latent variables to avoid the overfitting.
4.5. Scalability (Q4)
Figure 6 shows the scalability of SVGA with respect to the number of edges on the five largest datasets in Table 2. For each dataset, we sample nine random subgraphs of different sizes from to , where denotes the number of original edges. We measure the inference time of SVGA in each graph ten times and report the average. The figure shows the linear scalability of SVGA with the number of edges in all datasets, supporting Lemma 3.4. Arxiv and Coauthor take the longest inference times, as Arxiv and Coauthor have the largest numbers of edges and features, respectively.
4.6. Ablation Study (Q5)
Figure 7 shows an ablation study that compares SVGA with its two variants SVGAU and SVGAR. SVGAU runs stochastic inference described in Section 3.2, without our idea of unified deterministic modeling. The detailed process of stochastic inference is described also in Algorithm 2 of Appendix B. SVGAR runs the deterministic inference but removes the regularizer term of Equation (15); it follows Algorithm 1 as in SVGA except for lines 4 and 5.
SVGA shows the best test accuracy during the training with a stable curve. The training accuracy is the best with SVGAR, since it overfits to training nodes without the regularizer term. On the other hand, the training accuracy of SVGAU is the lowest among the three methods, while its test accuracy becomes similar to that of SVGAR at the later epochs. This is because SVGAU fails even at maximizing the training accuracy due to the unstable training. The standard deviation of training accuracy is very small with all three methods, despite their different modelings.
5. Related Works
Graph neural networks. Graph neural networks (GNN) refer to deep neural networks designed for graphstructured data (Hamilton2017; Velickovic2018; Velickovic2019; Kipf2017; You2020; You2021). Since GNNs require the features of all nodes, one needs to generate artificial features to apply a GNN to a graph with missing features. Derr2018 and Cui2021 generate features from the graph structure. Kipf2017 model the missing features as onehot vectors, while Zhao2020 leave them as zero vectors and propose a new regularizer function.
Our SVGA enables a GNN to be applied to graphs with partial observations by estimating missing features. The main advantage of feature estimation is that the modification of a GNN classifier is not required, regardless of the number of observations given in the original graph. Previous works that directly deal with partially observed graphs require finding new hyperparameters (Zhao2020) or even making a new weight matrix (Kipf2017) when the number of observations changes, making it difficult to reuse a trained model.
Missing feature estimation. There are recent works that can be used directly for our feature estimation problem (Huang2019; Chen2019; Chen2020). Such methods are adopted as the main competitors in our experiments. The main advantage of SVGA over the previous approaches is the strong regularizer that allows us to effectively propagate the partial observations to the entire graph, avoiding the overfitting problem even with large representation power for feature estimation.
GRAPE (You2020) estimates missing features in tabular data by learning a graph between examples and features. The main difference from our work is that GRAPE assumes partial observations of feature elements, not feature vectors. In other words, GRAPE cannot be used to estimate the features of nodes that have no partial observations, which is the scenario assumed in our experiments.
Node representation learning. Unsupervised node representation learning (Hamilton2017b) is to represent each node as a lowdimensional vector that summarizes its properties embedded in the structure and node features (Perozzi2014; Wang2016; Grover2016; Hamilton2017b; Velickovic2019). Such methods make embeddings in a latent space, while we aim to learn the representations of nodes in a highdimensional feature space; the node features generated from our SVGA are interpretable in the feature domain.
Probabilistic modeling of graphs. Previous works model realworld graphs as pairwise Markov random fields with discrete variables and run graphical inference for node classification (Yoo2017; conf/wsdm/JoYK18; Yoo2019b; Yoo2020; Yoo2021). Our work can be considered as a generalization of such works into the challenging task of missing feature estimation, which requires us to estimate highdimensional continuous variables.
6. Conclusion
We propose SVGA (Structured Variational Graph Autoencoder), an accurate method for missing feature estimation. SVGA estimates highdimensional features of nodes from a graph with partial observations, and its framework is carefully designed to model the target distributions of structured variational inference. The main idea of SVGA is the structural regularizer that assumes the prior of latent variables as Gaussian Markov random field, which considers the graph structure as the main evidence for modeling the correlations between nodes in variational inference. SVGA outperforms previous methods for feature estimation and node classification, achieving the stateoftheart accuracy in benchmark datasets. Future works include extending the domain of SVGA into directed or heterogeneous graphs that are common in realworld datasets.
Acknowledgements.
This work was supported by Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) [No.2020000894, Flexible and Efficient Model Compression Method for Various Applications and Environments], [No.2021001343, Artificial Intelligence Graduate School Program (Seoul National University)], and [NO.202100268, Artificial Intelligence Innovation Hub (Artificial Intelligence Institute, Seoul National University)]. The Institute of Engineering Research at Seoul National University provided research facilities for this work. The ICT at Seoul National University provides research facilities for this study. This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2021R1C1C1008526). U Kang is the corresponding author.
References
Appendix A Proofs of Lemmas
a.1. Proof of Lemma 2.1
Proof.
The probability density function of is
where is a constant.
We rewrite as follows with and :
where is also a constant.
By the definition of GMRF, only if edge exists in the given graph . Then, we rewrite as
where . We prove the lemma by substituting and for the first and second terms, respectively. ∎
a.2. Proof of Lemma 3.1
Proof.
The random variables included in Equation (14) are and . Since and are filled with standard normal values, it is satisfied that and , regardless of the actual values of and . Thus, . ∎
a.3. Proof of Lemma 3.2
Proof.
Then, the covariance between and is given as
Since every element of and
follows the standard normal distribution, the following are satisfied. First,
if and zero otherwise. Second, the second and third elements of the right hand side are zero. Third, . As a result, we have the following equality:where is an indicator function that returns one if the condition holds, and zero otherwise. This equation is the same as the definition of in the matrix form. ∎
a.4. Proof of Lemma 3.3
Proof.
due to the definition of . The cyclic property of a trace makes . Thus, , and is a constant. ∎
a.5. Proof of Lemma 3.4
Proof.
SVGA consists of an encoder and two decoders and . The complexity of is assuming the identity feature matrix. The complexities of and are and , respectively. ∎
a.6. Proof of Lemma 3.5
Proof.
SVGA consists of an encoder and two decoders and . The complexity of is assuming the identity feature matrix. The complexities of and are and , respectively. ∎
Appendix B Details of Stochastic Inference
We present a detailed algorithm of structured variational inference in Algorithm 2, which performs stochastic sampling described in Section 3.2. It uses two encoder functions for generating embedding matrices and , respectively, in line 1. Then, it samples from the Gaussian distribution with the reparametrization trick in lines 2 to 4. The prediction is done as in the deterministic inference, but the regularizer term works differently in line 8, taking and as its inputs. The parameters of all four networks are updated.
Appendix C Evaluation Metrics
We use four metrics for the evaluation of estimated features: recall at and nDCG at for binary features, and RMSE and CORR for continuous features. Categorical features are not included in our datasets in Table 2, but we can use classification accuracy as done for evaluating labels. The symbols used in this section are defined as follows: is the number of nodes, is the number of features, is the true feature vector of node , is the prediction for , is the th element of , and is a binary function that returns one if the condition holds and zero otherwise.
Evaluation of binary features. We consider the feature estimation for binary features as a ranking problem, which is to find the nonzero elements at each feature vector of node . Let be the index having the th largest score in . Then, we use the top predictions with the largest scores, i.e., , at each node , where is chosen in . This is done also in previous work for binary feature estimation (Chen2020) to focus more on the predictive performance of the top predictions.
Recall at measures the ratio of true entries contained in the top predictions for each node:
(16) 
where is the number of nonzero entries in . For instance, recall @ 3 is computed as in the following example:
since , , and , and two of the nonzero entries of are included in the top 3 predictions with the largest scores.
nDCG at measures also the quality of order in the top predictions with respect to information retrieval. nDCG is computed by normalizing DCG at , which is defined as
(17) 
Evaluation of continuous features. We evaluate predictions for continuous features with simple metrics of RMSE and CORR. RMSE measures an error between predictions and true features:
(18) 
CORR measures how much predictions and true features are correlated, and is defined as follows:
(19) 
where is the mean of the th feature of all nodes. CORR is higher the better, while RMSE is lower the better.
Appendix D Hyperparameter Settings
We search the hyperparameters of our SVGA as follows: the size of latent variables in , the dropout probability in , the regularization parameters and in , and the unit normalization of latent variables in . We also use the Adam (Kingma2015) optimizer with the learning rate in Steam and
in all other datasets. The early stopping is used with the validation performance, and all of our experiments were done at a workstation with RTX 2080 based on PyTorch. More detailed information can be found in our official implementation.
^{1}^{1}1https://github.com/snudatalab/SVGA