Node Attribute Generation on Graphs

07/23/2019 ∙ by Xu Chen, et al. ∙ University of Technology Sydney The University of Texas at Austin Shanghai Jiao Tong University Carnegie Mellon University 0

Graph structured data provide two-fold information: graph structures and node attributes. Numerous graph-based algorithms rely on both information to achieve success in supervised tasks, such as node classification and link prediction. However, node attributes could be missing or incomplete, which significantly deteriorates the performance. The task of node attribute generation aims to generate attributes for those nodes whose attributes are completely unobserved. This task benefits many real-world problems like profiling, node classification and graph data augmentation. To tackle this task, we propose a deep adversarial learning based method to generate node attributes; called node attribute neural generator (NANG). NANG learns a unifying latent representation which is shared by both node attributes and graph structures and can be translated to different modalities. We thus use this latent representation as a bridge to convert information from one modality to another. We further introduce practical applications to quantify the performance of node attribute generation. Extensive experiments are conducted on four real-world datasets and the empirical results show that node attributes generated by the proposed method are high-qualitative and beneficial to other applications. The datasets and codes are available online.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, representation learning on graphs has attracted great research interest because of its great potential in numerous machine learning tasks, such as semi-supervised learning 

kipf2016semi and zero/few-shot learning fu2015transductive ; garcia2017few , as well as numerous real-world applications, such as recommendation systems ying2018graph and action recognition li2019actional . A standard supervised-learning paradigm on graph is to predict some class information based on observed graph structures and node attributes. For example, kipf2016semi ; kipf2016variational

proposed graph convolutional neural networks to learn graph embeddings based on both graph structures and node attributes. In practice, either graph structures or node attributes could be inaccurate, incomplete or missing, which would cause significant deterioration to the performance. To make more complete description about graph structures, 

you2018graphrnn ; bojchevski2018netgan

consider graph structure generation by using random walks and Recurrent Neural Networks (RNNs).

In this paper, we focus on generating node attributes for missing data. Specifically, in a fixed and given graph structure, only partial nodes have attributes and attributes of the rest nodes are completely missing and inaccessible. We aim to generate attributes for those nodes that do not have any observed attribute. Compared to the completion problem kalofolias2014matrix where all nodes could have partially observed attributes, in our task, attributes of some nodes are completely unknown, we thus call it  node attribute generation. This task is related to numerous real-world challenges. For example, in citation networks, raw attributes and detailed descriptions of papers may be missing due to the copyright protection. In item co-occurrence graph, descriptive tags for items may be missing because of the expensive tagging labour. The task of node attribute generation not only retrieves unknown information, but also benefits many supervised-learning tasks, such as profiling and node classification. The generated node attributes can also serve as additional inputs to improve subsequent tasks.

Recently, a series of deep generative methods have been proposed to generate real-world data, such as variational auto-encoders (VAE) kingma2013auto and generative adversarial networks (GAN) goodfellow2014generative . However, both VAE and GAN are not able to generate node attributes that are associated with complex and irregular graphs. The adversarial learning idea from GAN has been developed as many other generative approaches. For example, UNIT liu2017unsupervised

makes unsupervised image-to-image translation based on the shared-latent space assumption where all images come from the same latent space. However, simply extending this method to the graph domain is infeasible because: (1) the graph structures and node attributes come from two heterogeneous spaces; and (2) the node attributes can either be real-valued or categorical, which makes adversarial learning in data space infeasible.

To this end, we propose a latent-space adversarial learning based method to generate node attributes; called node attribute neural generator (NANG). The proposed NANG is compatible to generate both real-valued and categorical node attributes. Specifically, we assume that the attribute and structure information come from the same latent factor, which can be translated to both attribute modality and structure modality; we call it the shared latent factor assumption. We then use this latent factor as a bridge to convert information from one modality to another, and further generate node attributes based on graph structures. The main contributions of our work can be summarized as follows:

  • Under the shared latent factor assumption, we propose novel node attribute neural generator (NANG), which is compatible to both real-valued and categorical attributes;

  • We introduce practical measures to evaluate the quality of generated node attributes;

  • Empirical results on four real-world datasets show that our method can handle the node classification and profiling task for those nodes whose attributes are completely inaccessible.

2 Related Work

2.1 Deep Generative Models

Deep generative models have attracted a lot of attention recently. One of the earliest deep generative model is wake-sleep algorithm hinton1995wake which trains the wake phase denoted as the generative model and the sleep phase denoted as the inference model, separately. VAE kingma2013auto and its variants Zheng2018DegenerationIV ; Zheng2019UnderstandingVI ; zhao2017infovae are proposed to learn the generative model and inference model jointly by maximizing the variational lower bound with reparameterization tricks.

In recent years, GAN goodfellow2014generative , as another family of deep generative models hu2017unifying , has emerged as a hot research topic. It contains a generator and a discriminator, where the discriminator tries to distinguish the real samples and the fake samples and the generator tries to confuse the discriminator. Several works goodfellow2014generative ; nowozin2016f ; chen2016infogan have pointed that the adversarial loss in GAN actually minimizes a lower bound on Jensen-Shannon divergence (JSD) between the data distribution and the generator distribution. Original GAN has risk facing the vanishing gradient and mode collapse problem. To handle this, a lot of works have been developed to improve the objective function such as f-GAN nowozin2016f , LSGAN mao2017least and WGAN arjovsky2017wasserstein . The idea of adversarial learning from GAN is not only limited to generation setting but also can be applied to many other applications such as domain adaptation ganin2016domain and domain translation isola2017image ; liu2017unsupervised .

2.2 Deep Representation Learning on Graphs

With the great success of deep neural networks in modeling speech signals, images and text contents, many researchers start to apply deep neural networks to the graph domain. For example, DeepWalk perozzi2014deepwalk and Node2Vec grover2016node2vec learn node embedding by random walks and skip-gram mikolov2013efficient model. Graph convolution networks (GCN) is proposed in defferrard2016convolutional ; kipf2016semi and it is successfully applied in semi-supervised node classification problem. A hierarchical and differentiable pooling is proposed in ying2018hierarchical to learn latent representation for graphs. GraphSAGE hamilton2017inductive introduces neighbour sampling and different aggregation manners to make inductive graph convolution on large graphs.

In many real-world applications, graph structures may be missing, incomplete and inaccessible. To handle this issue, graph structure generation has been studied recently. A junction tree variational auto-encoder is proposed in jin2018junction to generate molecular graphs. MolGAN de2018molgan is an implicit and likelihood-free generative model to generate small molecular graphs. GraphRNN you2018graphrnn and NetGAN bojchevski2018netgan generate realistic graphs by the combination of random walks and RNNs. Despite the great potential in many applications, few works consider generating node attributes associated with graphs.

3 Node Attribute Neural Generator (NANG)

3.1 Problem Statement

Let be a graph with node set , be the graph adjacent matrix and be the node attribute matrix. Note that the element in could has either categorical value or real value. Let be the set of nodes that are associated with observed attributes. The corresponding node attributes matrix is . Let be the set of nodes that are associated with unobserved attributes. The corresponding node attributes matrix is , which is unknown. To clarify more clearly, we have , and . Our task aims to generate the unobserved node attributes based on the observed node attributes and the graph adjacent matrix .

3.2 Model Formulation

Graph-structured data include two aspects: node attributes and graph structures. We assume that they share the same latent factor , which can be translated to either the node attributes or graph structures. Based on this, the marginal distributions of attributes and structures satisfy:


where the conditional distributions and are referred to the probabilistic decoders that decode to and , respectively. And is the prior distribution for latent factor . In order to better utilize the prior to guide the learning process, we learn an aggregated distribution for the posteriors and based on the observed data. Minimizing the distance between and can encourage the shared latent factor to match the whole distribution of .

Therefore, our goal is to learn the aggregated distribution and the probabilistic decoders , based on the partial observed node attributes and graph structure . In the attribute generation stage, we use the structure information to generate the missing attributes, more specifically, we infer the attribute information for the attribute missing nodes by encoding their structure information into latent factor , then the factor will be feed into the probabilistic decoder to get the prediction of their unobserved attributes. Our NANG achieves this goal in a way of distribution matching makhzani2015adversarial ; zhao2017infovae . In the following parts, we will give the details about distribution matching, followed by the objective function and the implementation

3.2.1 Inference via Distribution Matching

NANG learns in a way of distribution matching. Specifically, two parameterized encoders and are used to encode the attribute information and structure information, respectively. Then we introduce adversarial learning to encourage distribution matching between and with prior . In this way, the latent variable and are supposed to come from the same aggregated posterior distribution  makhzani2015adversarial ; makhzani2018implicit . Besides, the distance between and is also minimized, encouraging the shared latent factor to match the whole distribution of . Let be the latent codes encoded from structure information for all nodes and represent the structure information for attribute-observed nodes. Let and denote the latent codes respectively encoded from attributes and structures for attribute-observed nodes. To keep consistency with our shared latent factor assumption for decoders and , they need to reconstruct (resp. ) to (resp. ), and translate (resp. ) to (resp. ). The architecture is shown in Figure 1. The objective function is formulated as:


where is the network parameters including the encoders (generators) and and the decoders and . And is the shared discriminator for adversarial learning and . indicates true samples we sampled from the prior distribution for adversarial learning (we apply standard Gaussian prior here).

is a hyperparameter to emphasize the second two terms. The first two terms represent the self-reconstruction stream, which means information from attribute to attribute and from structure to structure. The second two terms represent the cross-reconstruction stream, which means information from structure to attribute and from attribute to structure. The last two terms are the adversarial loss defined in Eq. 

3, which naturally introduce regularization to the objective function.


For the objective function in Eq. 2, the self-reconstruction stream and the cross-reconstruction stream together with our adversarial learning indicate a coupling mechanism which makes information supplement and restores the original non-independent characteristic of node attributes and structures. In this way, the aggregated distribution , probabilistic decoders and can be well captured and we can generate the unobserved node attributes through cross reconstruction stream.

Figure 1: Model architecture of NANG. NANG first transforms the node attributes and the graph structures into the latent space, then aligns the latent representation via distribution matching, and finally decodes to the original attributes and structures. In this procedure, information captured from graph structure can be used to recover the node attributes by cross data stream after training.

3.2.2 Implementation

We formulate our method NANG in an adversarial learning manner that encourages learned from observed data to match the whole distribution of , the method architecture is shown in Figure 1. NANG consists of three modules: (1) self-reconstruction stream, (2) cross-reconstruction stream and (3) adversarial regularization.

In the self-reconstruction stream, both the observed attributes and structural are encoded into the latent space by two encoders and . The corresponding latent codes and are decoded as and by two decoders and . This appears in Figure 1, in which is a two-layer MLP, is a two-layer GCN, is a two-layer MLP and is a two-layer MLP followed by ( represents the output of two-layer MLP in and

denotes the sigmoid function). Let

be the structure information for attribute-observed nodes and be the corresponding latent codes encoded by . In the cross-reconstruction stream, the latent codes from is decoded as by and the latent codes from is decoded as by . In the adversarial regularization module, we apply adversarial learning between , and samples from standard Gaussian prior , sharing the same two-layer MLP discriminator network. These three modules work together and encourage the learning process of , and .

4 Experiments

4.1 Experimental Setup

Datasets. We evaluate the proposed NANG on four real-world datasets to quantify the performance.

  • Cora. Cora mccallum2000automating

    a citation graph whose nodes are papers and edges are citation links. It contains 10,556 edges and 2,708 papers that are categorized into 7 classes. Each node attribute vector indicates whether the corresponding paper contains certain word tokens and it is represented as a multi-hot vector with dimension 1,433.

  • Citeseer. Citeseer sen2008collective is also a citation graph which contains 9,228 edges and 3,327 papers that are categorized into 6 classes. After stemming and stop word removal operation of the content, 3,704 distinct words make up the attribute corpus. Each node attribute vector is formed from the corpus and represented as a multi-hot vector with dimension 3,704.

  • Steam. Steam222 is a dataset we collected from a game official website with the user-bought behavior history of 9,944 items and 352 labels for these items. We count the co-purchase frequency between every two games and make a sparse item co-purchase graph through binarilization operation with the threshold as 10. After that, we obtain 533,962 edges for this graph. The label corpus constructs the multi-hot attribute vector for each item with dimension 352.

  • Pubmed. Pubmed namata2012query is a citation graph where nodes are categorized into 3 classes. There are 19,717 nodes and 88,651 edges in this graph. Each node attribute vector is described by a Term Frequency-Inverse Document Frequency (TF-IDF) vector from 500 distinct terms.

Among these datasets, attributes of Cora, Citeseer and Steam are categorical and represented as multi-hot vectors. For Pubmed, node attributes are real valued and represented as scalars. The statistics of these datasets is illustrated in Appendix A Table 3.

Baselines of Generation Methods. To evaluate the generation performance of our method, we compare it with three baselines.

  • NeighAggre. NeighAggre aims to directly aggregate the neighbors’ attributes for nodes without attributes through mean pooling. When a neighbor’ attributes are missing, we do not regard it as the aggregation node.

  • VAE. Although VAE kingma2013auto does not naturally suit this task, we consider it as a baseline with some tricks. We make normal VAE for the attribute-observed nodes, encoding attributes of these nodes as latent codes. For the nodes without any attribute, we use neighbour aggregation like NeighAggre in the latent space. Then, the decoder in VAE can be used to generate node attributes.

  • GCN. For GCN kipf2016semi as a baseline, only the structure information is used as input and encoded as latent embeddings. Then the latent embeddings are decoded by an additional two-layer MLP, being supervised by the observed attributes. In the test stage, we use the latent embeddings of test nodes to generate node attributes through the two-layer MLP decoder.

4.2 Applications and Evaluation Measures

Generative models such as VAE based kingma2013auto ; higgins2017beta and GAN based isola2017image ; liu2017unsupervised

strike in generating real-like data sharing the same distribution as true data. They mainly use reconstruction error or visual effect to evaluate the generation performance. Especially for GAN based generative methods, other objective functions such as MMD, total variance and Wasserstein distance are proposed to measure the distance between true data distribution

and generated data distribution  hu2017unifying . However, in our problem, whether the generated attributes can benefit real-world applications is more considerable. Consequently, we propose to measure the quality of generated node attributes from both node level and attribute level with two real-world applications.

  • Node classification. This node classification task aims to criticize whether the generated node attributes serve as data augmentation and benefit the classification model. In this task, we use the generated attributes of test nodes to make node classification comparison among different methods, taking accuracy as the evaluation metric. In other words, this task evaluates the overall quality of generated node attributes by classification methods, which is also termed as node level evaluation. We implement this task on Cora, Citeseer and Pubmed since they have node class information.

  • Profiling. Profile provides cognitive description for objects such as key terms for papers on Cora, Citeseer and labels for items on Steam. Profiling aims to predict the possible profile for test nodes, we use Recall@k and NDCG@k as the evaluation metrics. In other words, this task evaluates the recall and ranking of generated node attributes, which is also termed as attribute level evaluation. For this task, we compare different methods on Cora, Citeseer and Steam since attributes of these datasets are categorical.

atts. generation method classification method Cora Citeseer Pubmed
X NeighAggre MLP 0.6248 0.5539 0.5150
VAE MLP 0.2826 0.2551 0.4008
GCN MLP 0.3943 0.3768 0.3992
NANG-Cross MLP 0.7074 0.4976 0.4000
NANG-Self MLP 0.3036 0.2289 0.4023
NANG MLP 0.7644 0.6010 0.4652
True atts. MLP 0.7618 0.7174 0.656
A - DeepWalk+MLP 0.7149 0.4802 0.6917
- Node2Vec+MLP 0.6830 0.4422 0.6721
- GCN 0.7631 0.5651 0.7125
A+X NeighAggre GCN 0.6494 0.5413 0.6564
VAE GCN 0.3011 0.2663 0.4007
GCN GCN 0.4387 0.4079 0.4203
NANG-Cross GCN 0.7727 0.5358 0.4197
NANG-Self GCN 0.3402 0.2698 0.4204
NANG GCN 0.8327 0.6599 0.7537
True atts. GCN 0.8493 0.7348 0.8723
Table 1: Evaluation of generated attributes on node classification task. The first column with "X", "A" and "A+X" indicates three settings to do node classification, with only attributes, only structures and the fused one, respectively. NANG-Cross means only cross-reconstruction loss and GAN loss are applied. Similarly, NANG-Self represents only self-reconstruction loss and GAN loss are used. True atts. represents we use the ground truth attributes to do node classification.

4.3 Node Classification

In the task of node classification, the generated node attributes are split into 80% train data and 20% test data with five-fold validation performed 10 times. We consider two classifiers, including MLP and GCN which both use the class information as supervision. Three settings are designed to conduct the comparisons: the node-attribute-only approach, the graph-structure-only approach, and the fused approach. In the node-attribute-only approach, we directly use the generated attributes and a two-layer MLP as the classifier to do the classification task; in the graph-structure-only approach, we only use the graph structure without considering the node attributes, which has been studied by many methods such as DeepWalk 

perozzi2014deepwalk , Node2Vec grover2016node2vec and GCN kipf2016semi . DeepWalk and Node2Vec both aim to learn node embeddings and then a MLP classifier is used. While GCN is an end-to-end method which learns the node embeddings supervised by the classification loss. In the fused approach, we combine the generated node attributes and structure information with GCN classifier.

Table 1 shows the classification performance, in which "X" indicates that only generated node attributes are used, "A" indicates that only structural information is used and "A+X" is the fused one. We can summarize that: (1) When only using the generated attributes to do node classification task, the proposed NANG can obtain significant gain over baseline methods: NeighAggre, VAE and GCN. Compared to the most competitive method NeighAggre, the proposed NANG reaches nearly 14% and 5% gain on Cora and Citeseer, respectively. On Pubmed, it seems that NeighAggre suits this dataset and this setting well, but it deteriorates quickly when less attribute-observed nodes exist, which will be shown in Section 4.5. (2) The performance of NANG gets closer to that of true attributes. NANG even gets better than the true attributes on Cora mainly because the generated attributes of NANG may contain some graph structure information that is beneficial to the classification task. (3) Both NANG-Cross and NANG-Self perform worse than NANG, because the incompleteness of our objective function cannot guarantee the shared latent factor assumption.

Apart from the generated attributes, we can also use graph structure information to make node classification and the result is shown in the Table 1 with the "A" signed row. Among these methods, GCN performs better, which is in accordance with recent works. This also inspires us that whether the generated attributes of our method could augment the GCN classification performance. Therefore, we conduct the fused "A+X" experiment. From the comparison between the "A" row and the "A+X" row in Table 1, we can summarize that: (1) The generated attributes from NANG can augment the GCN classification performance with 6.96%, 9.48% and 4.12% gain on Cora, Citeseer and Pubmed, respectively. While NeighAggre fails and harms the GCN performance with the figure as 11.37%, 2.38% and 5.61% on Cora, Citeseer and Pubmed, respectively. (2) The generated attributes of other methods such as GCN and VAE are of inferior quality and they hurt the GCN performance a lot, because they cannot capture the complex translation pattern between attribute and structure modality.

Method Recall@10 Recall@20 Recall@50 NDCG@10 NDCG@20 NDCG@50
NeighAggre 0.0906 0.1413 0.1961 0.1217 0.1548 0.1850
VAE 0.0887 0.1228 0.2116 0.1224 0.1452 0.1924
GCN 0.1271 0.1772 0.2962 0.1736 0.2076 0.2702
NANG-Cross 0.1378 0.2018 0.3339 0.1931 0.2360 0.3052
NANG-Self 0.1224 0.1724 0.2823 0.1686 0.2023 0.2599
NANG 0.1508 0.2182 0.3429 0.2112 0.2546 0.3212
Method Recall@10 Recall@20 Recall@50 NDCG@10 NDCG@20 NDCG@50
NeighAggre 0.0511 0.0908 0.1501 0.0823 0.1155 0.1560
VAE 0.0382 0.0668 0.1296 0.0601 0.0839 0.1251
GCN 0.0620 0.1097 0.2052 0.1026 0.1423 0.2049
NANG-Cross 0.0679 0.1163 0.2140 0.1167 0.1570 0.2209
NANG-Self 0.0564 0.1013 0.1963 0.0863 0.1238 0.1860
NANG 0.0764 0.1280 0.2377 0.1298 0.1729 0.2447
Method Recall@3 Recall@5 Recall@10 NDCG@3 NDCG@5 NDCG@10
NeighAggre 0.0603 0.0881 0.1446 0.0955 0.1204 0.1620
VAE 0.0564 0.0820 0.1251 0.0902 0.1133 0.1437
GCN 0.2392 0.3258 0.4575 0.3366 0.4025 0.4848
NANG-Cross 0.2429 0.3116 0.4614 0.3414 0.3969 0.4889
NANG-Self 0.2382 0.3381 0.4611 0.3282 0.4057 0.4835
NANG 0.2527 0.3560 0.4933 0.3544 0.4332 0.5215
Table 2: Evaluation for generated attributes on profiling task. Note that the average non-zero attribute number for Cora and Citeseer is 18.17 and 31.6, respectively. Therefore, we use top 10, 20, 50 to evaluate the performance on these two datasets. Similarly, we use top 3, 5, 10 to evaluate the performance on Steam.

4.4 Profiling

For this profiling task, the generated attributes on Cora, Citeseer and Steam are probabilities that the node may have in each attribute dimension. Good generated attributes should have high probability in specific attribute dimension as the true attributes. Accordingly, taking the recall ability and ranking ability into consideration, we use Recall@k and NDCG@k to evaluate the attribute generation performance in attribute level. The result is shown in Table 


From this table, it is clear that NeighAggre performs the worst among all methods on the three datasets, especially for Steam since it is not a learning algorithm and cannot generate reliable attributes for profiling in attribute level. However, NANG generates attributes based on the translation knowledge from structure information to attribute information, which is more adaptable and flexible. Indeed, results in Table 2 show NANG achieves superior performance over other methods for profiling on this attribute level evaluation. Compared to GCN, NANG reaches a 4.67% and 3.25% gain of Recall@50 on Cora and Citeseer, respectively.

(a) X - Citeseer
(b) A+X - Citeseer
(c) Recall - Citeseer

(d) X - Pubmed
(e) A+X - Pubmed
(f) Recall - Steam
Figure 2: Node classification and profiling performance with different ratio of training nodes. (a)(d) illustrate the result for node classification with "X" setting on Citeseer and Pubmed, respectively. (b)(e) show the result for node classification with "A+X" setting on Citeseer and Pubmed, respectively. The dotted line is a criterion to criticize whether the generated attributes can enhance the GCN classifier. In other words, it represents only structure information "A" is used by GCN to do the classification task. (c)(f) show the result for profiling on Citeseer and Steam, respectively.

4.5 Less Attribute-observed Nodes

The partial attribute-observed nodes are necessary and supervise this node attribute generation task. In some scenarios, this supervised information could be less, so it is necessary to see whether our NANG can still generate reliable and high-quality node attributes when less attributed nodes are observed. We conduct an experiment to explore the node classification and profiling performance under this condition. The result is shown in Figure 2.

In Figure 2 (a)(d) for node classification when only "X" is used, we can see that NANG performs much better than other methods on Citeseer. The gap is more obvious when the attribute-observed nodes are less. On Pubmed, although NeighAggre performs better than NANG when attributes-observed nodes are more, it is not robust when the attributes-observed nodes are less. Figure 2 (b)(e) demonstrate the node classification performance when "A+X" is used. In these two figures, the dotted line represents only "A" is used by a GCN classifier, which is denoted as a criterion to criticize whether the generated attributes could enhance the GCN classifier. It is clear that, in the "A+X" setting, our NANG reaches superior performance than other methods. Besides, it is robust when the attribute-observed nodes are less while NeighAggre fails to enhance the GCN classifier. Figure 2 (c)(f) show the profiling performance on Citeseer and Steam. It is clear that NANG can perform better than other methods when attribute-observed nodes are less, because NANG learns the translation knowledge from attribute to structure modality which is more adaptive and more flexible.

5 Conclusions

In this paper, we propose an adversarial learning method called NANG for the node attribute generation problem. In NANG, we assume that both node attribute information and structure information share the same latent factor which can be translated as different modalities. The implicit distribution for this latent factor is modeled in an auto-encoding Bayes and adversarial learning manner. We further evaluate the quality of generated node attributes on both node level and attribute level through practical applications. Empirical results validate the superiority of our method on node classification and profiling for those nodes whose attributes are completely inaccessible.


  • [1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214–223, 2017.
  • [2] Aleksandar Bojchevski, Oleksandr Shchur, Daniel Zügner, and Stephan Günnemann. Netgan: Generating graphs via random walks. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 610–619, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
  • [3] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
  • [4] Nicola De Cao and Thomas Kipf. MolGAN: An implicit generative model for small molecular graphs. In ICML 2018 workshop on Theoretical Foundations and Applications of Deep Generative Models, 2018.
  • [5] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pages 3844–3852, 2016.
  • [6] Yanwei Fu, Timothy M Hospedales, Tao Xiang, and Shaogang Gong. Transductive multi-view zero-shot learning. IEEE transactions on pattern analysis and machine intelligence, 37(11):2332–2345, 2015.
  • [7] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
  • [8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [9] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864. ACM, 2016.
  • [10] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
  • [11] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.
  • [12] Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The" wake-sleep" algorithm for unsupervised neural networks. Science, 268(5214):1158–1161, 1995.
  • [13] Zhiting Hu, Zichao Yang, Ruslan Salakhutdinov, and Eric P. Xing. On unifying deep generative models. In International Conference on Learning Representations, 2018.
  • [14] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros.

    Image-to-image translation with conditional adversarial networks.


    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 1125–1134, 2017.
  • [15] Wengong Jin, Regina Barzilay, and Tommi Jaakkola.

    Junction tree variational autoencoder for molecular graph generation.

    In Proceedings of the 35th International Conference on Machine Learning, 2018.
  • [16] Vassilis Kalofolias, Xavier Bresson, Michael Bronstein, and Pierre Vandergheynst. Matrix completion on graphs. arXiv preprint arXiv:1408.1717, 2014.
  • [17] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014.
  • [18] Thomas N Kipf and Max Welling. Variational graph auto-encoders. In

    NIPS Workshop on Bayesian Deep Learning

    , 2016.
  • [19] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017.
  • [20] Maosen Li, Siheng Chen, Xu Chen, Ya Zhang, Yanfeng Wang, and Qi Tian. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2019.
  • [21] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems, pages 700–708, 2017.
  • [22] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
  • [23] Alireza Makhzani. Implicit autoencoders. arXiv preprint arXiv:1805.09804, 2018.
  • [24] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian Goodfellow. Adversarial autoencoders. In International Conference on Learning Representations, 2016.
  • [25] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2794–2802, 2017.
  • [26] Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 3(2):127–163, 2000.
  • [27] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.

    Efficient estimation of word representations in vector space.

    In International Conference on Learning Representations, 2013.
  • [28] Galileo Namata, Ben London, Lise Getoor, Bert Huang, and UMD EDU. Query-driven active surveying for collective classification. In 10th International Workshop on Mining and Learning with Graphs, 2012.
  • [29] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pages 271–279, 2016.
  • [30] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710. ACM, 2014.
  • [31] Victor Garcia Satorras and Joan Bruna Estrach. Few-shot learning with graph neural networks. In International Conference on Learning Representations, 2018.
  • [32] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93–93, 2008.
  • [33] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 974–983. ACM, 2018.
  • [34] Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L. Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. In Advances in neural information processing systems, pages 4800–4810, 2018.
  • [35] Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. GraphRNN: Generating realistic graphs with deep auto-regressive models. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5708–5717, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
  • [36] Shengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Information maximizing variational autoencoders. In

    Proceedings of the 33rd Association for the Advancement of Artificial Intelligence

    , 2019.
  • [37] Huangjie Zheng, Jiangchao Yao, Ya Zhang, and Ivor Wai-Hung Tsang. Degeneration in vae: in the light of fisher information loss. ArXiv, abs/1802.06677, 2018.
  • [38] Huangjie Zheng, Jiangchao Yao, Ya Zhang, Ivor Wai-Hung Tsang, and Jia Wang. Understanding vaes in fisher-shannon plane. In Proceedings of the 33rd Association for the Advancement of Artificial Intelligence, 2019.

Appendix A Supplementary Experimental Setup

Dataset Statistics. The statistics of used datasets is shown in Table 3.

Cora Citeseer Steam Pubmed
#nodes 2,708 3,327 9,944 19,717
#edges 10,556 9,228 533,962 88,651
#graph sparsity 0.14% 0.08% 0.53% 0.02%
#atts dim 1,433 3,703 352 500
#avg nnz atts num 18.17 31.6 8.45 -
#class 7 6 - 3
atts form categorical categorical categorical real-valued
Table 3: The statistics of four datasets. In this table, atts form means the attribute style. avg nnz atts num means the average hot number for nodes. #class indicates the number of classes of nodes for different datasets.

Parameter Setting. For each dataset, we randomly sample nodes with attributes as training data and as validation data and the rest as test data that need to generate attributes. For NeighAggre, we directly use the one-hop neighbours as a node’s neighbours. For all learning based methods (VAE, GCN, NANG), we set the latent dimension as 64 with 0.005 as the learning rate. Dropout rate equals 0.5 is utilized and the maximum iteration number is 1,000. Adam optimizer is applied for them to learn the model parameters. Remind that different datasets may have different attribute forms including categorical and real-valued. Therefore, for the datasets with categorical attributes such as Cora, Citeseer and Steam, weighted Binary Cross Entropy loss (BCE) is applied. The weight put on non-zero samples equals calculated from the training node attribute matrix. And for Pubmed with real-valued attributes, Mean Square Error (MSE) loss is used.

For our adversarial NANG, We set the generation step as 2 and discriminator step as 1 for Cora, Citeseer while the generation step as 5 and discriminator step as 1 for Steam and Pubmed. The hyper-parameters is for Cora, Citeseer, Steam while

for Pubmed. The experiments are conducted with multi times and the mean value is adopted as the performance. The method is implemented by Pytorch on one computer with one Nvidia TitanX GPU.

Appendix B Latent Embedding Visualization

(a) VAE
(b) GCN
(c) NANG
Figure 3: The t-SNE visualization of test node embeddings on Cora. Each color represents one of seven classes. Note that all methods learn the node embeddings without class information.

The involved methods in this paper, VAE, GCN and NANG encode node information into low-dimensional embedding and decode it as node attributes. Good representation ability means that a method can learn representative embedding where nearby nodes correspond to similar objects. Therefore, we conduct an experiment to visualize the learned node embeddings by t-SNE [22]. Specifically, the latent embeddings for all test nodes are sampled and we use t-SNE to make dimension reduction and visualize them in 2-D space on Cora dataset. Nodes in the same class are expected to be clustered together. Note that for all methods, they do not use label information in the training process. Therefore, the t-SNE visualization result is learned without class supervision for all methods. Figure 3 shows the corresponding result.

For VAE with Gaussian prior in Figure 3 (a), we can clearly see that the nodes of different classes are mixed together, which means it cannot distinguish the nodes belonging to different classes. For GCN in Figure 3 (b), it seems that the nodes are encoded into a narrow and stream like space, where different nodes are mixed and overlapped. Compared to VAE, the narrow and stream like space of GCN happens mainly because it has no prior assumption, which makes it lose distributed constraint. As for our NANG in Figure 3 (c), we can clearly see that different nodes are clustered well in accordance with their classes. Although Gaussian prior is both imposed on the coding space of VAE and NANG, our NANG can make information supplement between attribute modality and structure modality, yet capture more complex pattern for the latent space while VAE fails. Therefore, NANG presents better t-SNE visualization result.

(a) A+X - Cora
(b) A+X - Citeseer
(c) A+X - Pubmed

(d) Recall - Cora
(e) Recall - Citeseer
(f) Recall - Steam
Figure 4: NANG performance with different on both the node classification with "A+X" setting and profiling task. (a-c) means the result for node classification with "A+X" setting on Cora, Citeseer and Pubmed, respectively. The dotted line with "only A" represents that only the structure information is used, in which GCN as the classifier. (d-f) indicates the result for profiling on Cora, Citeseer and Steam, respectively. The dotted line with "GCN" means we use the GCN as the generation method because it is the most competitive baseline for profiling task in our paper.

Appendix C Hyperparameter

In our NANG, we introduce to emphasize the cross-reconstruction stream in our objective function. It is desirable to see how our method responds to this hyperparameter. Intuitively, we conduct an experiment about the node classification with "A+X" setting and profiling performance with different .

Figure 4 (a-c) shows the result for node classification with "A+X" setting on Cora, Citeseer and Pubmed, respectively. And Figure 4 (d-f) indicates the result for profiling on Cora, Citeseer and Steam, respectively. From these figures, we can clearly see that the hyperparameter is important for our method since we rely on the cross-reconstruction stream to generate node attributes. In Figure 4 (a-c), it shows that we need a large to generate high-quality node attributes which could augment GCN classifier with only "A" is used. In Figure 4 (d-f), NANG can mostly perform better than the most competitive baseline GCN in our paper. And for this result on Steam, too large could deteriorate the model performance because a large can weaken the importance of distribution matching. Therefore, should be chosen according to specific datasets.

(a) Train joint loss
(b) Train GANs loss

(c) Validation metric
(d) MMD distance
Figure 5: Visualization of the training process for NANG on Cora. (a) The joint loss represents the sum of self-reconstruction stream and cross-reconstruction stream during the training process. (b) The GAN loss of the training process. (c) Validation Recall@10 along the training steps. (d) The train and validation MMD distance between learned and Gaussian prior .

Appendix D Learning Process Visualization

In order to understand the learning process of our method, we plot some learning curves including the train joint loss, train GAN loss, validation metric and MMD distance along the learning process. The result is indicated in Figure 5.

This figure shows that both the train joint loss and train GAN loss converges, and the validation Recall@10 increases step by step and finally converges at around epoch. The train and validation MMD distance is shown in Figure 5 (d), within the training process, two encoders and encode the input information and align them as the same latent factor whose distribution is an aggregated one from posteriors, the decreasing MMD distance indicates that the implicit distribution matches the whole distribution of step by step.