1 Introduction
In recent years, representation learning on graphs has attracted great research interest because of its great potential in numerous machine learning tasks, such as semisupervised learning
kipf2016semi and zero/fewshot learning fu2015transductive ; garcia2017few , as well as numerous realworld applications, such as recommendation systems ying2018graph and action recognition li2019actional . A standard supervisedlearning paradigm on graph is to predict some class information based on observed graph structures and node attributes. For example, kipf2016semi ; kipf2016variationalproposed graph convolutional neural networks to learn graph embeddings based on both graph structures and node attributes. In practice, either graph structures or node attributes could be inaccurate, incomplete or missing, which would cause significant deterioration to the performance. To make more complete description about graph structures,
you2018graphrnn ; bojchevski2018netganconsider graph structure generation by using random walks and Recurrent Neural Networks (RNNs).
In this paper, we focus on generating node attributes for missing data. Specifically, in a fixed and given graph structure, only partial nodes have attributes and attributes of the rest nodes are completely missing and inaccessible. We aim to generate attributes for those nodes that do not have any observed attribute. Compared to the completion problem kalofolias2014matrix where all nodes could have partially observed attributes, in our task, attributes of some nodes are completely unknown, we thus call it node attribute generation. This task is related to numerous realworld challenges. For example, in citation networks, raw attributes and detailed descriptions of papers may be missing due to the copyright protection. In item cooccurrence graph, descriptive tags for items may be missing because of the expensive tagging labour. The task of node attribute generation not only retrieves unknown information, but also benefits many supervisedlearning tasks, such as profiling and node classification. The generated node attributes can also serve as additional inputs to improve subsequent tasks.
Recently, a series of deep generative methods have been proposed to generate realworld data, such as variational autoencoders (VAE) kingma2013auto and generative adversarial networks (GAN) goodfellow2014generative . However, both VAE and GAN are not able to generate node attributes that are associated with complex and irregular graphs. The adversarial learning idea from GAN has been developed as many other generative approaches. For example, UNIT liu2017unsupervised
makes unsupervised imagetoimage translation based on the sharedlatent space assumption where all images come from the same latent space. However, simply extending this method to the graph domain is infeasible because: (1) the graph structures and node attributes come from two heterogeneous spaces; and (2) the node attributes can either be realvalued or categorical, which makes adversarial learning in data space infeasible.
To this end, we propose a latentspace adversarial learning based method to generate node attributes; called node attribute neural generator (NANG). The proposed NANG is compatible to generate both realvalued and categorical node attributes. Specifically, we assume that the attribute and structure information come from the same latent factor, which can be translated to both attribute modality and structure modality; we call it the shared latent factor assumption. We then use this latent factor as a bridge to convert information from one modality to another, and further generate node attributes based on graph structures. The main contributions of our work can be summarized as follows:

Under the shared latent factor assumption, we propose novel node attribute neural generator (NANG), which is compatible to both realvalued and categorical attributes;

We introduce practical measures to evaluate the quality of generated node attributes;

Empirical results on four realworld datasets show that our method can handle the node classification and profiling task for those nodes whose attributes are completely inaccessible.
2 Related Work
2.1 Deep Generative Models
Deep generative models have attracted a lot of attention recently. One of the earliest deep generative model is wakesleep algorithm hinton1995wake which trains the wake phase denoted as the generative model and the sleep phase denoted as the inference model, separately. VAE kingma2013auto and its variants Zheng2018DegenerationIV ; Zheng2019UnderstandingVI ; zhao2017infovae are proposed to learn the generative model and inference model jointly by maximizing the variational lower bound with reparameterization tricks.
In recent years, GAN goodfellow2014generative , as another family of deep generative models hu2017unifying , has emerged as a hot research topic. It contains a generator and a discriminator, where the discriminator tries to distinguish the real samples and the fake samples and the generator tries to confuse the discriminator. Several works goodfellow2014generative ; nowozin2016f ; chen2016infogan have pointed that the adversarial loss in GAN actually minimizes a lower bound on JensenShannon divergence (JSD) between the data distribution and the generator distribution. Original GAN has risk facing the vanishing gradient and mode collapse problem. To handle this, a lot of works have been developed to improve the objective function such as fGAN nowozin2016f , LSGAN mao2017least and WGAN arjovsky2017wasserstein . The idea of adversarial learning from GAN is not only limited to generation setting but also can be applied to many other applications such as domain adaptation ganin2016domain and domain translation isola2017image ; liu2017unsupervised .
2.2 Deep Representation Learning on Graphs
With the great success of deep neural networks in modeling speech signals, images and text contents, many researchers start to apply deep neural networks to the graph domain. For example, DeepWalk perozzi2014deepwalk and Node2Vec grover2016node2vec learn node embedding by random walks and skipgram mikolov2013efficient model. Graph convolution networks (GCN) is proposed in defferrard2016convolutional ; kipf2016semi and it is successfully applied in semisupervised node classification problem. A hierarchical and differentiable pooling is proposed in ying2018hierarchical to learn latent representation for graphs. GraphSAGE hamilton2017inductive introduces neighbour sampling and different aggregation manners to make inductive graph convolution on large graphs.
In many realworld applications, graph structures may be missing, incomplete and inaccessible. To handle this issue, graph structure generation has been studied recently. A junction tree variational autoencoder is proposed in jin2018junction to generate molecular graphs. MolGAN de2018molgan is an implicit and likelihoodfree generative model to generate small molecular graphs. GraphRNN you2018graphrnn and NetGAN bojchevski2018netgan generate realistic graphs by the combination of random walks and RNNs. Despite the great potential in many applications, few works consider generating node attributes associated with graphs.
3 Node Attribute Neural Generator (NANG)
3.1 Problem Statement
Let be a graph with node set , be the graph adjacent matrix and be the node attribute matrix. Note that the element in could has either categorical value or real value. Let be the set of nodes that are associated with observed attributes. The corresponding node attributes matrix is . Let be the set of nodes that are associated with unobserved attributes. The corresponding node attributes matrix is , which is unknown. To clarify more clearly, we have , and . Our task aims to generate the unobserved node attributes based on the observed node attributes and the graph adjacent matrix .
3.2 Model Formulation
Graphstructured data include two aspects: node attributes and graph structures. We assume that they share the same latent factor , which can be translated to either the node attributes or graph structures. Based on this, the marginal distributions of attributes and structures satisfy:
(1) 
where the conditional distributions and are referred to the probabilistic decoders that decode to and , respectively. And is the prior distribution for latent factor . In order to better utilize the prior to guide the learning process, we learn an aggregated distribution for the posteriors and based on the observed data. Minimizing the distance between and can encourage the shared latent factor to match the whole distribution of .
Therefore, our goal is to learn the aggregated distribution and the probabilistic decoders , based on the partial observed node attributes and graph structure . In the attribute generation stage, we use the structure information to generate the missing attributes, more specifically, we infer the attribute information for the attribute missing nodes by encoding their structure information into latent factor , then the factor will be feed into the probabilistic decoder to get the prediction of their unobserved attributes. Our NANG achieves this goal in a way of distribution matching makhzani2015adversarial ; zhao2017infovae . In the following parts, we will give the details about distribution matching, followed by the objective function and the implementation
3.2.1 Inference via Distribution Matching
NANG learns in a way of distribution matching. Specifically, two parameterized encoders and are used to encode the attribute information and structure information, respectively. Then we introduce adversarial learning to encourage distribution matching between and with prior . In this way, the latent variable and are supposed to come from the same aggregated posterior distribution makhzani2015adversarial ; makhzani2018implicit . Besides, the distance between and is also minimized, encouraging the shared latent factor to match the whole distribution of . Let be the latent codes encoded from structure information for all nodes and represent the structure information for attributeobserved nodes. Let and denote the latent codes respectively encoded from attributes and structures for attributeobserved nodes. To keep consistency with our shared latent factor assumption for decoders and , they need to reconstruct (resp. ) to (resp. ), and translate (resp. ) to (resp. ). The architecture is shown in Figure 1. The objective function is formulated as:
(2) 
where is the network parameters including the encoders (generators) and and the decoders and . And is the shared discriminator for adversarial learning and . indicates true samples we sampled from the prior distribution for adversarial learning (we apply standard Gaussian prior here).
is a hyperparameter to emphasize the second two terms. The first two terms represent the selfreconstruction stream, which means information from attribute to attribute and from structure to structure. The second two terms represent the crossreconstruction stream, which means information from structure to attribute and from attribute to structure. The last two terms are the adversarial loss defined in Eq.
3, which naturally introduce regularization to the objective function.(3) 
For the objective function in Eq. 2, the selfreconstruction stream and the crossreconstruction stream together with our adversarial learning indicate a coupling mechanism which makes information supplement and restores the original nonindependent characteristic of node attributes and structures. In this way, the aggregated distribution , probabilistic decoders and can be well captured and we can generate the unobserved node attributes through cross reconstruction stream.
3.2.2 Implementation
We formulate our method NANG in an adversarial learning manner that encourages learned from observed data to match the whole distribution of , the method architecture is shown in Figure 1. NANG consists of three modules: (1) selfreconstruction stream, (2) crossreconstruction stream and (3) adversarial regularization.
In the selfreconstruction stream, both the observed attributes and structural are encoded into the latent space by two encoders and . The corresponding latent codes and are decoded as and by two decoders and . This appears in Figure 1, in which is a twolayer MLP, is a twolayer GCN, is a twolayer MLP and is a twolayer MLP followed by ( represents the output of twolayer MLP in and
denotes the sigmoid function). Let
be the structure information for attributeobserved nodes and be the corresponding latent codes encoded by . In the crossreconstruction stream, the latent codes from is decoded as by and the latent codes from is decoded as by . In the adversarial regularization module, we apply adversarial learning between , and samples from standard Gaussian prior , sharing the same twolayer MLP discriminator network. These three modules work together and encourage the learning process of , and .4 Experiments
4.1 Experimental Setup
Datasets. We evaluate the proposed NANG on four realworld datasets to quantify the performance.

Cora. Cora mccallum2000automating
a citation graph whose nodes are papers and edges are citation links. It contains 10,556 edges and 2,708 papers that are categorized into 7 classes. Each node attribute vector indicates whether the corresponding paper contains certain word tokens and it is represented as a multihot vector with dimension 1,433.

Citeseer. Citeseer sen2008collective is also a citation graph which contains 9,228 edges and 3,327 papers that are categorized into 6 classes. After stemming and stop word removal operation of the content, 3,704 distinct words make up the attribute corpus. Each node attribute vector is formed from the corpus and represented as a multihot vector with dimension 3,704.

Steam. Steam^{2}^{2}2https://store.steampowered.com/ is a dataset we collected from a game official website with the userbought behavior history of 9,944 items and 352 labels for these items. We count the copurchase frequency between every two games and make a sparse item copurchase graph through binarilization operation with the threshold as 10. After that, we obtain 533,962 edges for this graph. The label corpus constructs the multihot attribute vector for each item with dimension 352.

Pubmed. Pubmed namata2012query is a citation graph where nodes are categorized into 3 classes. There are 19,717 nodes and 88,651 edges in this graph. Each node attribute vector is described by a Term FrequencyInverse Document Frequency (TFIDF) vector from 500 distinct terms.
Among these datasets, attributes of Cora, Citeseer and Steam are categorical and represented as multihot vectors. For Pubmed, node attributes are real valued and represented as scalars. The statistics of these datasets is illustrated in Appendix A Table 3.
Baselines of Generation Methods. To evaluate the generation performance of our method, we compare it with three baselines.

NeighAggre. NeighAggre aims to directly aggregate the neighbors’ attributes for nodes without attributes through mean pooling. When a neighbor’ attributes are missing, we do not regard it as the aggregation node.

VAE. Although VAE kingma2013auto does not naturally suit this task, we consider it as a baseline with some tricks. We make normal VAE for the attributeobserved nodes, encoding attributes of these nodes as latent codes. For the nodes without any attribute, we use neighbour aggregation like NeighAggre in the latent space. Then, the decoder in VAE can be used to generate node attributes.

GCN. For GCN kipf2016semi as a baseline, only the structure information is used as input and encoded as latent embeddings. Then the latent embeddings are decoded by an additional twolayer MLP, being supervised by the observed attributes. In the test stage, we use the latent embeddings of test nodes to generate node attributes through the twolayer MLP decoder.
4.2 Applications and Evaluation Measures
Generative models such as VAE based kingma2013auto ; higgins2017beta and GAN based isola2017image ; liu2017unsupervised
strike in generating reallike data sharing the same distribution as true data. They mainly use reconstruction error or visual effect to evaluate the generation performance. Especially for GAN based generative methods, other objective functions such as MMD, total variance and Wasserstein distance are proposed to measure the distance between true data distribution
and generated data distribution hu2017unifying . However, in our problem, whether the generated attributes can benefit realworld applications is more considerable. Consequently, we propose to measure the quality of generated node attributes from both node level and attribute level with two realworld applications.
Node classification. This node classification task aims to criticize whether the generated node attributes serve as data augmentation and benefit the classification model. In this task, we use the generated attributes of test nodes to make node classification comparison among different methods, taking accuracy as the evaluation metric. In other words, this task evaluates the overall quality of generated node attributes by classification methods, which is also termed as node level evaluation. We implement this task on Cora, Citeseer and Pubmed since they have node class information.

Profiling. Profile provides cognitive description for objects such as key terms for papers on Cora, Citeseer and labels for items on Steam. Profiling aims to predict the possible profile for test nodes, we use Recall@k and NDCG@k as the evaluation metrics. In other words, this task evaluates the recall and ranking of generated node attributes, which is also termed as attribute level evaluation. For this task, we compare different methods on Cora, Citeseer and Steam since attributes of these datasets are categorical.
atts. generation method  classification method  Cora  Citeseer  Pubmed  
X  NeighAggre  MLP  0.6248  0.5539  0.5150 
VAE  MLP  0.2826  0.2551  0.4008  
GCN  MLP  0.3943  0.3768  0.3992  
NANGCross  MLP  0.7074  0.4976  0.4000  
NANGSelf  MLP  0.3036  0.2289  0.4023  
NANG  MLP  0.7644  0.6010  0.4652  
True atts.  MLP  0.7618  0.7174  0.656  
A    DeepWalk+MLP  0.7149  0.4802  0.6917 
  Node2Vec+MLP  0.6830  0.4422  0.6721  
  GCN  0.7631  0.5651  0.7125  
A+X  NeighAggre  GCN  0.6494  0.5413  0.6564 
VAE  GCN  0.3011  0.2663  0.4007  
GCN  GCN  0.4387  0.4079  0.4203  
NANGCross  GCN  0.7727  0.5358  0.4197  
NANGSelf  GCN  0.3402  0.2698  0.4204  
NANG  GCN  0.8327  0.6599  0.7537  
True atts.  GCN  0.8493  0.7348  0.8723 
4.3 Node Classification
In the task of node classification, the generated node attributes are split into 80% train data and 20% test data with fivefold validation performed 10 times. We consider two classifiers, including MLP and GCN which both use the class information as supervision. Three settings are designed to conduct the comparisons: the nodeattributeonly approach, the graphstructureonly approach, and the fused approach. In the nodeattributeonly approach, we directly use the generated attributes and a twolayer MLP as the classifier to do the classification task; in the graphstructureonly approach, we only use the graph structure without considering the node attributes, which has been studied by many methods such as DeepWalk
perozzi2014deepwalk , Node2Vec grover2016node2vec and GCN kipf2016semi . DeepWalk and Node2Vec both aim to learn node embeddings and then a MLP classifier is used. While GCN is an endtoend method which learns the node embeddings supervised by the classification loss. In the fused approach, we combine the generated node attributes and structure information with GCN classifier.Table 1 shows the classification performance, in which "X" indicates that only generated node attributes are used, "A" indicates that only structural information is used and "A+X" is the fused one. We can summarize that: (1) When only using the generated attributes to do node classification task, the proposed NANG can obtain significant gain over baseline methods: NeighAggre, VAE and GCN. Compared to the most competitive method NeighAggre, the proposed NANG reaches nearly 14% and 5% gain on Cora and Citeseer, respectively. On Pubmed, it seems that NeighAggre suits this dataset and this setting well, but it deteriorates quickly when less attributeobserved nodes exist, which will be shown in Section 4.5. (2) The performance of NANG gets closer to that of true attributes. NANG even gets better than the true attributes on Cora mainly because the generated attributes of NANG may contain some graph structure information that is beneficial to the classification task. (3) Both NANGCross and NANGSelf perform worse than NANG, because the incompleteness of our objective function cannot guarantee the shared latent factor assumption.
Apart from the generated attributes, we can also use graph structure information to make node classification and the result is shown in the Table 1 with the "A" signed row. Among these methods, GCN performs better, which is in accordance with recent works. This also inspires us that whether the generated attributes of our method could augment the GCN classification performance. Therefore, we conduct the fused "A+X" experiment. From the comparison between the "A" row and the "A+X" row in Table 1, we can summarize that: (1) The generated attributes from NANG can augment the GCN classification performance with 6.96%, 9.48% and 4.12% gain on Cora, Citeseer and Pubmed, respectively. While NeighAggre fails and harms the GCN performance with the figure as 11.37%, 2.38% and 5.61% on Cora, Citeseer and Pubmed, respectively. (2) The generated attributes of other methods such as GCN and VAE are of inferior quality and they hurt the GCN performance a lot, because they cannot capture the complex translation pattern between attribute and structure modality.
Cora  
Method  Recall@10  Recall@20  Recall@50  NDCG@10  NDCG@20  NDCG@50 
NeighAggre  0.0906  0.1413  0.1961  0.1217  0.1548  0.1850 
VAE  0.0887  0.1228  0.2116  0.1224  0.1452  0.1924 
GCN  0.1271  0.1772  0.2962  0.1736  0.2076  0.2702 
NANGCross  0.1378  0.2018  0.3339  0.1931  0.2360  0.3052 
NANGSelf  0.1224  0.1724  0.2823  0.1686  0.2023  0.2599 
NANG  0.1508  0.2182  0.3429  0.2112  0.2546  0.3212 
Citeseer  
Method  Recall@10  Recall@20  Recall@50  NDCG@10  NDCG@20  NDCG@50 
NeighAggre  0.0511  0.0908  0.1501  0.0823  0.1155  0.1560 
VAE  0.0382  0.0668  0.1296  0.0601  0.0839  0.1251 
GCN  0.0620  0.1097  0.2052  0.1026  0.1423  0.2049 
NANGCross  0.0679  0.1163  0.2140  0.1167  0.1570  0.2209 
NANGSelf  0.0564  0.1013  0.1963  0.0863  0.1238  0.1860 
NANG  0.0764  0.1280  0.2377  0.1298  0.1729  0.2447 
Steam  
Method  Recall@3  Recall@5  Recall@10  NDCG@3  NDCG@5  NDCG@10 
NeighAggre  0.0603  0.0881  0.1446  0.0955  0.1204  0.1620 
VAE  0.0564  0.0820  0.1251  0.0902  0.1133  0.1437 
GCN  0.2392  0.3258  0.4575  0.3366  0.4025  0.4848 
NANGCross  0.2429  0.3116  0.4614  0.3414  0.3969  0.4889 
NANGSelf  0.2382  0.3381  0.4611  0.3282  0.4057  0.4835 
NANG  0.2527  0.3560  0.4933  0.3544  0.4332  0.5215 
4.4 Profiling
For this profiling task, the generated attributes on Cora, Citeseer and Steam are probabilities that the node may have in each attribute dimension. Good generated attributes should have high probability in specific attribute dimension as the true attributes. Accordingly, taking the recall ability and ranking ability into consideration, we use Recall@k and NDCG@k to evaluate the attribute generation performance in attribute level. The result is shown in Table
2.From this table, it is clear that NeighAggre performs the worst among all methods on the three datasets, especially for Steam since it is not a learning algorithm and cannot generate reliable attributes for profiling in attribute level. However, NANG generates attributes based on the translation knowledge from structure information to attribute information, which is more adaptable and flexible. Indeed, results in Table 2 show NANG achieves superior performance over other methods for profiling on this attribute level evaluation. Compared to GCN, NANG reaches a 4.67% and 3.25% gain of Recall@50 on Cora and Citeseer, respectively.
4.5 Less Attributeobserved Nodes
The partial attributeobserved nodes are necessary and supervise this node attribute generation task. In some scenarios, this supervised information could be less, so it is necessary to see whether our NANG can still generate reliable and highquality node attributes when less attributed nodes are observed. We conduct an experiment to explore the node classification and profiling performance under this condition. The result is shown in Figure 2.
In Figure 2 (a)(d) for node classification when only "X" is used, we can see that NANG performs much better than other methods on Citeseer. The gap is more obvious when the attributeobserved nodes are less. On Pubmed, although NeighAggre performs better than NANG when attributesobserved nodes are more, it is not robust when the attributesobserved nodes are less. Figure 2 (b)(e) demonstrate the node classification performance when "A+X" is used. In these two figures, the dotted line represents only "A" is used by a GCN classifier, which is denoted as a criterion to criticize whether the generated attributes could enhance the GCN classifier. It is clear that, in the "A+X" setting, our NANG reaches superior performance than other methods. Besides, it is robust when the attributeobserved nodes are less while NeighAggre fails to enhance the GCN classifier. Figure 2 (c)(f) show the profiling performance on Citeseer and Steam. It is clear that NANG can perform better than other methods when attributeobserved nodes are less, because NANG learns the translation knowledge from attribute to structure modality which is more adaptive and more flexible.
5 Conclusions
In this paper, we propose an adversarial learning method called NANG for the node attribute generation problem. In NANG, we assume that both node attribute information and structure information share the same latent factor which can be translated as different modalities. The implicit distribution for this latent factor is modeled in an autoencoding Bayes and adversarial learning manner. We further evaluate the quality of generated node attributes on both node level and attribute level through practical applications. Empirical results validate the superiority of our method on node classification and profiling for those nodes whose attributes are completely inaccessible.
References
 [1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214–223, 2017.
 [2] Aleksandar Bojchevski, Oleksandr Shchur, Daniel Zügner, and Stephan Günnemann. Netgan: Generating graphs via random walks. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 610–619, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
 [3] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
 [4] Nicola De Cao and Thomas Kipf. MolGAN: An implicit generative model for small molecular graphs. In ICML 2018 workshop on Theoretical Foundations and Applications of Deep Generative Models, 2018.
 [5] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pages 3844–3852, 2016.
 [6] Yanwei Fu, Timothy M Hospedales, Tao Xiang, and Shaogang Gong. Transductive multiview zeroshot learning. IEEE transactions on pattern analysis and machine intelligence, 37(11):2332–2345, 2015.
 [7] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domainadversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
 [8] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [9] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864. ACM, 2016.
 [10] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
 [11] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.
 [12] Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The" wakesleep" algorithm for unsupervised neural networks. Science, 268(5214):1158–1161, 1995.
 [13] Zhiting Hu, Zichao Yang, Ruslan Salakhutdinov, and Eric P. Xing. On unifying deep generative models. In International Conference on Learning Representations, 2018.

[14]
Phillip Isola, JunYan Zhu, Tinghui Zhou, and Alexei A Efros.
Imagetoimage translation with conditional adversarial networks.
InProceedings of the IEEE conference on computer vision and pattern recognition
, pages 1125–1134, 2017. 
[15]
Wengong Jin, Regina Barzilay, and Tommi Jaakkola.
Junction tree variational autoencoder for molecular graph generation.
In Proceedings of the 35th International Conference on Machine Learning, 2018.  [16] Vassilis Kalofolias, Xavier Bresson, Michael Bronstein, and Pierre Vandergheynst. Matrix completion on graphs. arXiv preprint arXiv:1408.1717, 2014.
 [17] Diederik P Kingma and Max Welling. Autoencoding variational bayes. In International Conference on Learning Representations, 2014.

[18]
Thomas N Kipf and Max Welling.
Variational graph autoencoders.
In
NIPS Workshop on Bayesian Deep Learning
, 2016.  [19] Thomas N. Kipf and Max Welling. Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017.
 [20] Maosen Li, Siheng Chen, Xu Chen, Ya Zhang, Yanfeng Wang, and Qi Tian. Actionalstructural graph convolutional networks for skeletonbased action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2019.
 [21] MingYu Liu, Thomas Breuel, and Jan Kautz. Unsupervised imagetoimage translation networks. In Advances in Neural Information Processing Systems, pages 700–708, 2017.
 [22] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using tsne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
 [23] Alireza Makhzani. Implicit autoencoders. arXiv preprint arXiv:1805.09804, 2018.
 [24] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian Goodfellow. Adversarial autoencoders. In International Conference on Learning Representations, 2016.
 [25] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2794–2802, 2017.
 [26] Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 3(2):127–163, 2000.

[27]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.
Efficient estimation of word representations in vector space.
In International Conference on Learning Representations, 2013.  [28] Galileo Namata, Ben London, Lise Getoor, Bert Huang, and UMD EDU. Querydriven active surveying for collective classification. In 10th International Workshop on Mining and Learning with Graphs, 2012.
 [29] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. fgan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pages 271–279, 2016.
 [30] Bryan Perozzi, Rami AlRfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710. ACM, 2014.
 [31] Victor Garcia Satorras and Joan Bruna Estrach. Fewshot learning with graph neural networks. In International Conference on Learning Representations, 2018.
 [32] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina EliassiRad. Collective classification in network data. AI magazine, 29(3):93–93, 2008.
 [33] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graph convolutional neural networks for webscale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 974–983. ACM, 2018.
 [34] Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L. Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. In Advances in neural information processing systems, pages 4800–4810, 2018.
 [35] Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. GraphRNN: Generating realistic graphs with deep autoregressive models. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5708–5717, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.

[36]
Shengjia Zhao, Jiaming Song, and Stefano Ermon.
Infovae: Information maximizing variational autoencoders.
In
Proceedings of the 33rd Association for the Advancement of Artificial Intelligence
, 2019.  [37] Huangjie Zheng, Jiangchao Yao, Ya Zhang, and Ivor WaiHung Tsang. Degeneration in vae: in the light of fisher information loss. ArXiv, abs/1802.06677, 2018.
 [38] Huangjie Zheng, Jiangchao Yao, Ya Zhang, Ivor WaiHung Tsang, and Jia Wang. Understanding vaes in fishershannon plane. In Proceedings of the 33rd Association for the Advancement of Artificial Intelligence, 2019.
Appendix A Supplementary Experimental Setup
Dataset Statistics. The statistics of used datasets is shown in Table 3.
Cora  Citeseer  Steam  Pubmed  
#nodes  2,708  3,327  9,944  19,717 
#edges  10,556  9,228  533,962  88,651 
#graph sparsity  0.14%  0.08%  0.53%  0.02% 
#atts dim  1,433  3,703  352  500 
#avg nnz atts num  18.17  31.6  8.45   
#class  7  6    3 
atts form  categorical  categorical  categorical  realvalued 
Parameter Setting. For each dataset, we randomly sample nodes with attributes as training data and as validation data and the rest as test data that need to generate attributes. For NeighAggre, we directly use the onehop neighbours as a node’s neighbours. For all learning based methods (VAE, GCN, NANG), we set the latent dimension as 64 with 0.005 as the learning rate. Dropout rate equals 0.5 is utilized and the maximum iteration number is 1,000. Adam optimizer is applied for them to learn the model parameters. Remind that different datasets may have different attribute forms including categorical and realvalued. Therefore, for the datasets with categorical attributes such as Cora, Citeseer and Steam, weighted Binary Cross Entropy loss (BCE) is applied. The weight put on nonzero samples equals calculated from the training node attribute matrix. And for Pubmed with realvalued attributes, Mean Square Error (MSE) loss is used.
For our adversarial NANG, We set the generation step as 2 and discriminator step as 1 for Cora, Citeseer while the generation step as 5 and discriminator step as 1 for Steam and Pubmed. The hyperparameters is for Cora, Citeseer, Steam while
for Pubmed. The experiments are conducted with multi times and the mean value is adopted as the performance. The method is implemented by Pytorch on one computer with one Nvidia TitanX GPU.
Appendix B Latent Embedding Visualization
The involved methods in this paper, VAE, GCN and NANG encode node information into lowdimensional embedding and decode it as node attributes. Good representation ability means that a method can learn representative embedding where nearby nodes correspond to similar objects. Therefore, we conduct an experiment to visualize the learned node embeddings by tSNE [22]. Specifically, the latent embeddings for all test nodes are sampled and we use tSNE to make dimension reduction and visualize them in 2D space on Cora dataset. Nodes in the same class are expected to be clustered together. Note that for all methods, they do not use label information in the training process. Therefore, the tSNE visualization result is learned without class supervision for all methods. Figure 3 shows the corresponding result.
For VAE with Gaussian prior in Figure 3 (a), we can clearly see that the nodes of different classes are mixed together, which means it cannot distinguish the nodes belonging to different classes. For GCN in Figure 3 (b), it seems that the nodes are encoded into a narrow and stream like space, where different nodes are mixed and overlapped. Compared to VAE, the narrow and stream like space of GCN happens mainly because it has no prior assumption, which makes it lose distributed constraint. As for our NANG in Figure 3 (c), we can clearly see that different nodes are clustered well in accordance with their classes. Although Gaussian prior is both imposed on the coding space of VAE and NANG, our NANG can make information supplement between attribute modality and structure modality, yet capture more complex pattern for the latent space while VAE fails. Therefore, NANG presents better tSNE visualization result.
Appendix C Hyperparameter
In our NANG, we introduce to emphasize the crossreconstruction stream in our objective function. It is desirable to see how our method responds to this hyperparameter. Intuitively, we conduct an experiment about the node classification with "A+X" setting and profiling performance with different .
Figure 4 (ac) shows the result for node classification with "A+X" setting on Cora, Citeseer and Pubmed, respectively. And Figure 4 (df) indicates the result for profiling on Cora, Citeseer and Steam, respectively. From these figures, we can clearly see that the hyperparameter is important for our method since we rely on the crossreconstruction stream to generate node attributes. In Figure 4 (ac), it shows that we need a large to generate highquality node attributes which could augment GCN classifier with only "A" is used. In Figure 4 (df), NANG can mostly perform better than the most competitive baseline GCN in our paper. And for this result on Steam, too large could deteriorate the model performance because a large can weaken the importance of distribution matching. Therefore, should be chosen according to specific datasets.
Appendix D Learning Process Visualization
In order to understand the learning process of our method, we plot some learning curves including the train joint loss, train GAN loss, validation metric and MMD distance along the learning process. The result is indicated in Figure 5.
This figure shows that both the train joint loss and train GAN loss converges, and the validation Recall@10 increases step by step and finally converges at around epoch. The train and validation MMD distance is shown in Figure 5 (d), within the training process, two encoders and encode the input information and align them as the same latent factor whose distribution is an aggregated one from posteriors, the decreasing MMD distance indicates that the implicit distribution matches the whole distribution of step by step.
Comments
There are no comments yet.