Networks are ubiquitous in nature and society, including social networks, information networks, biological networks and various technological networks. The complex structure of networks poses big challenge for data mining tasks dealing with networks. To combat this challenge, researchers resort to network embedding, i.e., learning low-dimensional representation for each node to capture and preserve the network structure (Cui et al., 2018; Grover and Leskovec, 2016; Perozzi et al., 2014)
. With the learned representations of nodes, many downstream mining and prediction tasks on networks, e.g., node classification and link prediction, can be easily addressed using standard machine learning tools.
Many network embedding methods, unsupervised or supervised, have been proposed and successfully applied to node classification and link prediction (Cui et al., 2018; Tang et al., 2015; Ou et al., 2016; Cao et al., 2015). These methods learn representations of nodes leveraging structural proximity or structural similarity among nodes. However, in many real world networks, nodes are usually associated with rich attributes, e.g., content of articles in citation network (Le and Lauw, 2014) and user profile in social networks (Qi et al., 2012). This motivates researchers to study the problem of attributed network embedding.
Attributed network embedding aims to learn a sole low-dimensional representation for each node by simultaneously considering the information manifested in both network structure and the node attributes (Liao et al., 2018; Huang et al., 2017a, b). Existing methods for attributed network embedding mainly fall into two paradigms. Methods in the first paradigm learn separated representations for each node according to network structure and node attributes respectively, and then concatenate them into a single representation (Liao et al., 2018; Gao and Huang, 2018). Methods in the other paradigm attempt to directly obtain a single representation for each node by translating node attributes into network structure or vice versa (Yang et al., 2015; Liu et al., 2018). However, methods in both paradigms have their drawbacks. Methods in the first paradigm neglect the correlation between these two types of information, while the second paradigm assumes strong dependence between node attributes and network structure. Thus we are still lack of an effective method for attributed network embedding.
In this paper, we propose a novel perspective to address attributed network embedding. Unlike previous methods, we attempt to learn node representations by modeling the attributed local subgraph of each node. A node’s attributed local subgraph is defined as the subgraph centered at the target node together with associated node attributes. This perspective transfroms the problem of learning node representations into the problem of modeling the context information manifested in both network structure and node attributes. Motivated by this perspective, we propose a novel graph auto-encoder framework, namely GraphAE, for attributed network embedding. GraphAE consists of graph encoder and graph decoder. In graph encoder, target node aggregates the attributes diffused from nodes in its local subgraph to generate its own representation, while in graph decoder, each node diffuses its representation to nodes in its local subgraph to help reconstruct their attribute information. Our proposed framework generates a node’s representation by capturing both the structural information and attribute information manifested in its attributed local subgraph, having high capacity to learn good node representation for attributed network.
To evaluate the performance of proposed GraphAE framework, we conduct extensive experiments at two downstream tasks, i.e., node classification and link prediction. Experimental results on real-world datasets demonstrate that our proposed method outperforms the state-of-the-art network embedding approaches at both tasks.
2. Related Work
Our proposed framework works in an encoder-decoder manner to learn better embeddings for attributed networks. To simultaneously capture the network structures and node attributes information manifested in local attributed subgraph, graph convolutional network is adopted in both encoder layer and decoder layer. In this section, we provide a brief introduction of related works in network embedding and graph convolutional network.
2.1. Network Embedding
Network embedding technology, which aims to learn low-dimensional embedding for nodes in network, actually evolves from the dimension reduction algorithm (Cui et al., 2018)
. Some early works first leverage feature similarity to build an affinity graph, and then treat eigenvectors as network representations, such as LLE(Roweis and Saul, 2000) and Isomap (Tenenbaum et al., 2000). Recently, more network embedding methods leveraging the structural proximity or structural similarity among nodes have been proposed. Structural proximity based methods try to preserve different orders of proximities among nodes when learning node embeddings, varying from first-order proximity (Man et al., 2016), second order proximity (Tang et al., 2015) to high order proximity (Cao et al., 2015; Wang et al., 2017; Ou et al., 2016). Structure similarity based approaches (Henderson et al., 2012; Ribeiro et al., 2017; Donnat et al., 2018) take into account structural roles of nodes, restricting nodes with similar structural roles to possess similar representations. Moreover, some deep models (Cao et al., 2016; Wang et al., 2016) have been proposed to account for more complex structural properties.
However, besides structural properties, nodes in real world networks are usually associated with rich labels and attributes, which facilitates the problem of attributed network embedding (Yang et al., 2015; Liao et al., 2018; Le and Lauw, 2014; Zhu et al., 2007; Qi et al., 2012). Some approaches (Tu et al., 2016; Huang et al., 2017b, a) simply take the label information into consideration, while others utilize more detailed attributes information. TADW (Yang et al., 2015) proposes to obtain node embeddings by decomposing the adjacency matrix, with the attribute matrix being fixed as a factor. DANE (Gao and Huang, 2018) leverages two separated auto-encoders to learn structural representations and attributed representations of nodes respectively and concatenates these two as final representations with consistent and complementary regularization in hidden layer.
2.2. Graph Convolutional Network
Graph convolutional neural networks (GCNN) generalize CNN on non-Euclidean domains(Bronstein et al., 2017) have shown great success in variant of tasks. Existing GCNNs can be roughly categorized into two kinds, i.e., spectral GCNNs (Bruna et al., 2014; Defferrard et al., 2016) and spatial GCNNs (Bronstein et al., 2017). Spectral GCNNs define the convolution on spectral domain, which first transform the signal into spectral domain and apply filters on it (Bruna et al., 2014)
. Spatial GCNNs view graph convolution as “patch operator”, which construct a new feature vector using its neighborhood’s information. GCN introduced by Kipf et al.(Kipf and Welling, 2017) uses order Chebyshev polynomials to approximate the filter in GCNN (Bruna et al., 2014). Under these circumstances, convolution operator equals to the weighted sum of neighboring nodes’ features and weights are defined by normalized edge weights. As the weights in GCN only determined by network structure, velickovic et al. (Veličković et al., 2018) propose Graph Attention Network(GAT) to learn the weights by structure masked self-attention.
The majority of these methods do not scale to large graph or are designed for whole graph, GraphSAGE (Hamilton et al., 2017) learns a function that generates embeddings by sampling and aggregating features from a node’s local neighborhood which could learn inductive node embedding for large graph. Recently, some methods optimize the sampling strategy (Chen et al., 2018b, a) so that gcn can be better applied to large-scale networks.
In this section, we first give definition of the notations used in this paper and then introduce the architecture and detailed implementation of our proposed graph auto-encoder framework.
We define a network as , where denotes the node set with size , and denotes the edge set. The network is represented by an adjacency matrix A, where if and otherwise. Attributes of nodes in the network are represented by an attribute matrix , where is the dimension of node attributes. is the -th row of , and represents the attribute vector of node . Attributed network embedding aims to learn low-dimensional representations from adjacency matrix and attribute matrix , such that the learned representations can preserve both network structure and node attributes.
3.2. Graph Auto-Encoder Framework
from hidden representation Z which is generated by graph encoder with network structureand node attributes as input. represents the output of -th layer in graph encoder while is the output of -th layer in graph decoder. b) shows the process of a single layer of graph encoder with nodes and as targets. c) shows the process of a single layer of graph decoder. Nodes and propagate their representations to help neighboring nodes reconstruct their attributes.
In this paper, we propose a graph auto-encoder framework for attributed network embedding. The framework consists of two main parts, i.e., graph encoder and graph decoder. Graph encoder generates hidden representation with attribute matrix serving as input, while graph decoder tries to reconstruct attribute matrix from hidden representation . Both graph encoder and graph decoder charactirize the diffusion of attribute information over network . The whole architecture of the framework is shown in Figure 1a).
A single layer of graph encoder and graph decoder are shown in Figure 1b) and Figure 1c) respectively. Graph encoder learns a sole embedding for the target node by aggregating the attribute information from nodes in its attributed local subgraph. In graph decoder, each node tries to propagate its representation to nodes in its local subgraph to help them reconstruct their attribute information. The network embedding learned by such encoder-decoder framework naturally captures both the structural information and the attribute information manifested in each node’s attributed local subgraph, thus better serving applications on attributed networks.
3.2.1. Graph Encoder
In this section, we introduce the detailed implementation of graph encoder. Graph encoder consists of a stack of single encoder layers, each of which aggregates the attribute information from the neigboring nodes of a target node. By stacking multiple encoder layers, graph encoder is able to aggregate the attribute information from the multi-hop ego-network of the target node, which is taken as the target node’s attributed local subgraph in this paper.
A single encode layer can be formalized as follows:
where is the hidden representation of node in the -th layer and denotes its dimension. is the set of node ’s neighbors including node itself in our experiments. is the aggregation weight and measures how important node is to node , andis applied on every node to extract effective high-level features from inputs. The key of designing a single encoder layer lies in the definition of aggregation weight , which is implemented via attention mechanism in this paper.
Attention based aggregation mechanism. We adopt shared attention mechanism to effectively measure the aggregation weight between two given nodes with node attributes serving as input. We use the same attention mechanism as in GAT (Veličković et al., 2018). This mechanism is parameterized by a weight vector ( is the dimension of input vector), following with a nonlinearity activation. For each node , we only compute for node , where is neighbors of node i (Veličković et al., 2018). The attention mechanism can be expressed as:
where denotes matrix transposition and represents concatenation operation. In our experiments, we adopt LeakyReLU (with negative input slope ) (Veličković et al., 2018) as nonlinearity activation . We also employ multi-head attention to stabilize the learning process of self-attention and capture multiple types of relationships between nodes. In our experiments, we concatenate the different representations learned by different heads in hidden layer, and average them on the final layer of the graph encoder.
Attention based aggregation mechanism can capture both the structural proximity and the attribute proximity between pairs of nodes, allowing better modeling for the attributed local subgraph of a target node.
3.2.2. Graph Decoder
Similar to graph encoder, graph decoder consists of multiple single decoder layers. The decoder layer in conventional auto-encoder framework decompresses the hidden representations and makes the output closely match the input data, which regularizes the hidden representation containing rich information about raw input. In graph encoder, a node obtains its represetation by aggregating the attribute information from its local subgraph, which makes it neccessary to propagate its representation to nodes in its local subgraph to help reconstruct their attribute information. In fact, all nodes aggregating from their neighbors are identical to all nodes propagating to their neighbors from the global view of the whole network, which allows graph decoder to adopt the same architecture as graph encoder. Taking figure 1c) as example, node propagates its representation to neighbors , with attention weight and . But from the view of node , this operator is equal to aggregate representation from node with attention weight . This motivates us to adopt graph attention layer to build graph decode layer . In our experiments we stack the same number of graph attention layer as graph encoder, and the hidden units are symmetric to graph encoder. The overall architecture for graph decoder is shown right side of figure 1a) and a single layer of graph decoder is shown in Figure 1c).
3.3. Loss function
We directly measure the Euclidean distance between the reconstructed attribute matrix (Output of graph decoder) and the original input attributes
as loss function, which is formalized as follows:
We add L2 regularization on parameters and the overall loss function of our model can be formalized as follows:
where is the hyper-parameter used to control the weight of L2 regularization of parameters. All the parameters in our framework are trained by minimize using gradient descent.
3.4. Scalable to large graph
Graph attention layer takes the whole graph (adjacency matrix and node attributes matrix ) as input, and both memory and time cost for computing are related to the size of graph . As a result, this layer can not directly be applied to large-scale networks. Mini-batch is used to avoid inputting the whole graph, but it is still computing costly as aggregation operator depends on lots of neighbors. Specifically, “K-order neighbors” are attached to the input in order to learn the embedding of the given node when we stack layers of graph attention networks as graph encoder. Because in real-world network, neighbors will soon cover the whole network with increase of order
. In order to reduce the complexity of the computing (reduce the number of neighbors), we randomly sample a fix number of neighbors for each node to update its representation in each epoch. All nodes needed in computing are sampled at first, the number of which is unrelated to the size of graph, thus our model is scalable to large graph. The process is shown in Algorithm 1.
Line 1-7 is the process of sampling all nodes that used in computing. is the set of nodes that are used in layer k. We adapt the same sample strategy as GraphSAGE (Hamilton et al., 2017) and
is a uniform sampling function at layer k, i.e, for each node randomly sample a number of its neighbor nodes by uniform distribution. Furthermore, we sample nodes with replacement in cases where the sample size is larger than the nodes’ degree. Line 8-13 is the process of computing representations for all nodes giving input set, while line 15-21 is the process of reconstructing node attributes from hidden representations.
Different from the algorithm that mentioned above, the attention weight and the aggregation operator are only applied on the subset of its neighbors that appear in set . From the Algorithm 1, “k-order proximity” of target nodes are still kept and the number of nodes that used to learn the representation for target nodes is reduced to an acceptable scale.
We evaluate our proposed model on real-world datasets at two commonly adopted tasks, i.e., link prediction and node classification. Link prediction checks the ability of nodes’ representations to reconstruct network structure and predict future links, while node classification verifies whether node embeddings learned by our model are effective for downstream tasks. Moreover, we also provide detailed analysis for the performance of our proposed model.
We conduct experiments on four real-world datasets, i.e., Cora (Yang et al., 2015; Liu et al., 2018; Gao and Huang, 2018), Citeseer (Yang et al., 2015; Liu et al., 2018; Gao and Huang, 2018), Wiki (Yang et al., 2015; Liu et al., 2018; Gao and Huang, 2018) and Pubmed (Gao and Huang, 2018). Cora, Citeseer and Pubmed are three citation networks where nodes are articles and edges indicate citations between articles. In these three datasets, citation relationships are viewed as undirected edges for simplicity. Attributes associated with nodes are extracted from the title and the abstract of each article and are represented as sparse bag-of-word vectors. Stop words and low-frequency words are removed in preprocessing. Wiki dataset is a web page network, where nodes represent web pages and edges are hyper links among web pages. Text information on the web pages is processed in a similar way as in the other three datasets to extract the attributes. Each node in the four datasets only has one label, indicating which class the node belongs to. Statistics of these datasets, including number of nodes (Nodes), number of edges (Edges), number of categories (Classes) and the dimension of attributes (Features), are summarized in Table 1.
4.2. Experiments Set-up
4.2.1. Model set-up
In experiments, the number of layers in graph encoder is set to be 2. The dimensions of hidden representations in two encoder layers are set to be 128 and 64 respectively. The number of attention heads is set to be for the first encode layer, and for the second layer. We stack two decode layers for graph decoder. The first decode layer has 128 hidden units with attention heads. The dimension of the output of second decoder layer is set to be the dimension of input attributes and the second decoder layer has
attention head. We also add dropout (dropout probability) and L2 regularization () to prevent overfitting, and train our models using Adam with a learning rate of 0.001. Weights are all initialized by glorot (Glorot and Bengio, 2010) that brings substantially faster convergence.
We compare our model with the following baselines at both link prediction and node classification tasks. All the baselines fall into three categories, namely “Attributes-only”, “Structure-only” and “Attributes+Structure”. Models in “Attributes-only” group leverage node attributes information only to extract node representations, from which we select SVD and auto-encoder as our baselines. “Strucuture-only” models consider structure information only, i.e., preserving structural proximity in embedding space, while ignoring attribute information. In this group we choose Deepwalk and SDNE as our baselines. Methods in “Attribute+Structure” group capture both nodes attributes and structure proximity simultaneously and we consider several recent state-of-the-art algorithms as our baselines. A detailed description of our baselines is illustrated as follows:
Auto-encoder (AE): AE (Hinton and Salakhutdinov, 2006) is the conventional auto-encoder model with nodes attributes as input only. The number of hidden units is set the same as the GraphAE.
DeepWalk (DW): DW (Perozzi et al., 2014) learns embedding using structural information only. DeepWalk learns the node embedding from a collection of random walks using skip-gram with hierarchical softmax. As for the parameters, the number of random walks is 10, the number of vertex per walk , window size and embedding dimension .
SDNE: SDNE (Wang et al., 2016) is a deep model that capture both first-order and second-order proximity of nodes in embedding with only structure information being considered. The structure of hidden units in SDNE is set the same as GraphAE, and the hyper-parameters of , and are tuned by using grid search on the validation set.
DW+SVD: DW+SVD concatenates representations leaned by DeepWalk and SVD.
TADW: TADW (Yang et al., 2015) is an approach that utilizes both network structure and text information to learn embedding. We set the dimension of representations to be and the coefficient of regularization term to be .
GAE/VGAE: GAE and VGAE (Kipf and Welling, 2016) are weakening variant of the model, as they replace attention based aggregation with edge based aggregation and remove graph decoder only aiming to reconstruct network structure. Hyper-parameters are set the same as in their paper. We train the model for a maximum of 200 iterations using Adam (Kingma and Ba, 2015) with a learning rate of .
STNE: STNE (Liu et al., 2018) is a sequence translation model that translate attributes associate with nodes into their identities with structure information encoded in random walk path. For all the datasets, we generate 10 random walks that start at each node, and the length of the walks is set to 10. For Cora Citeseer and Wiki which are used in (Liu et al., 2018), we use the same architecture of model and hyper-parameters as in (Liu et al., 2018). The neural networks have 9 layers for Pubmed with dropout probability .
|% Labeled Nodes|
|% Labeled Nodes|
4.3. Link Prediction
In this section, we evaluate the ability of learned embeddings to reconstruct network and predict future connections via link prediction. We generate the dataset as many other works do (Grover and Leskovec, 2016; Wang et al., 2016; Kipf and Welling, 2016). We split the edges in the network according to the ratio of 85%, 5% and 10% as positive instances for training, validation and testing, respectively. For test set, we add some negative samples by randomly sampling some unconnected node pairs and the ratio of positive samples and negative samples is kept as 1:1. After having obtained the embeddings for each node, we get link probability by calculating inner product of the embeddings on test data. We adopt the area under the ROC-curve (AUC) and average precision from prediction scores (AP) implemented by sklearn (Pedregosa et al., 2011)
as the evaluation metrics. We report the AUC and AP measures of baseline models and ours in Table 1, we bold the best results and underline the next best results. We summarize the following observations and analyses:
“Attributed-only”, especially SVD, achieves comparable or better results than “Structure-only” methods in all datasets. The reason is that all these datasets are assortative networks, nodes with similar attributes are more likely to connect with each other.
We also observe that “Attributes+Structure” methods that incorporate both node attributes and network structure information improves the link prediction performance. TADW, GAE and VGAE get better results than other “Attributes+Structure” methods. The superiority may result from the fact that they treat reconstruct adjacency matrix as their objects and directly optimize them, which is highly related to the link prediction task.
Our GraphAE model achieves relatively significant improvements in AUC and AP over the baselines in all the four datasets, as shown in the Table 2. Our model incorporates node attributes and network structure in a way and captures high-order proximity in graph encoder, thus achieving better result.
|% Labeled Nodes|
|% Labeled Nodes|
4.4. Node Classification
In this section, we conduct experiments on node classification to demonstrate the effectiveness of the learned embeddings for downstream tasks. All nodes attributes and edges are observed when learning embeddings. After learning the embedding for each node, a logistic regression classifier (LR) with L2 regularization are used to classify the nodes into different labels. We use LR package provided by sklearn(Pedregosa et al., 2011) with default parameters. We random sample a certain number of nodes with labels as training data and the rest as test. We repeat the experiments 10 times and report the mean result. To conduct a comprehensive evaluation, we vary the percentage of labeled nodes in training from 10% to 50%. Furthermore, we employ Macro-F1 and Micro-F1 as the metrics to evaluate the classification result. All hyper-parameters used for learning embedding are set the same as previous link prediction experiment in section 4.3. The classification results are shown in Table 3, 4, 5, 6 respectively, and the best results are boldfaced while the next best are underlined. From these results, we have the following observations and analysis:
“Structure-only” methods outperform “Attributes-only” methods in Cora, get comparable result in Pubmed, while perform worse on the other two datasets. Characteristics of datasets are the main reason to explain this phenomenon. Documents in Wiki network contribute more on the classification of nodes as the hyperlink relationship is loose. Web pages belonging to different categories still have a high probability to have hyperlinks. Documents in both Citeseer and Wiki have more words than Cora, so “Attributes-only” methods perform better. Cora network has high edge density and less words in documents thus “Structure-only” methods get better result in this dataset.
Well-designed attributed network embedding methods (TADW, DANE, STNE) perform better than both “Attributes-only” methods and “Structure-only” methods, because these two kinds of information describe different aspects of the same node and provide complementary information. Unfortunately, simply concatenate these two kinds of information may not improve the performance, as “SVD+DW” gets worse node classification result than SVD in Citeseer and Wiki dataset. It demonstrates that simple concatenation is not sufficient to capture the interaction between these two types of information. GAE and VGAE get poor results in Wiki, which can be explained by the design of these two models. Wiki network has higher edge density than other networks and the aim of GAE and VGAE is to reconstruct the observed edges, thus may overfitting the observed edges and bringing in much noises.
Our model outperforms all compared baselines in Cora, Citeseer and Pubmed. In Wiki network, our model achieves comparable result with STNE but still outperforms other baselines. Our model captures the attribute information and structure information in a unified way and adopts attention mechanism to enhance proximity modeling, thus better utilizing these two complementary type of information.
Our model GraphAE outperforms all compared baselines when fewer labeled nodes are available in training. As we can see in the tables 2,3,4,5,6 the result of almost all the baselines (except GAE and VGAE) drop quickly when fewer labeled nodes are used in training. The reason is that these baselines do not well use neighboring nodes to preserve proximities in embedding space. By leveraging our graph encoder, our model gets smooth representations of nodes with their neighbors’, thus obtaining better result especially when lacking of label information. Although edges in Wiki are not reliable, our model still get better result in Wiki data. This is because our model better leverage node attributes—learn more accurate aggregation weights through attention mechanism and preserve attributes proximity with graph decoder.
4.5. Analysis of our model
To comprehensively analyze the performance of our proposed model, we provide detailed illustration on the difference and relation between our model and some other auto-encoder frameworks. Moreover, the effectiveness of different components of our model is also analyzed in this section.
4.5.1. Relation to some baselines
AE, SDNE, GAE, DANE are four auto-encoder based frameworks used for network embedding. Thus we analyse their relationships with GraphAE in this section. Auto-encoder only leverages the attributes information, and SDNE uses only structure information when learning embeddings. In contrast, GraphAE model network structure and node attributes simultaneously. GraphAE outperforms them in both link prediction and node classification tasks, which demonstrates that both nodes attributes and network structure are essential information for attributed network embedding.
GAE and VGAE (Kipf and Welling, 2016) are special variants of our model, as they replace attention based aggregation mechanism with edge based aggregation mechanism and take network structure reconstruction as their objectives. We find that GraphAE gets better result than GAE and VGAE in all downstream tasks across all the datasets. It demonstrates that graph decoder and attention based aggregation help our model to better capture essential information of inputs which is useful for downstream tasks. DANE leverages two separated auto-encoders to learn structural representations and attribute representations of nodes respectively and concatenate these two as final representations with consistent and complementary regularization in hidden layer. Our model also outperforms DANE in both downstream tasks, which demonstrates that GraphAE better captures the context information manifested in both node attributes and network structure.
Recently, Hu et al, proposed ARGA (Pan et al., 2018) which improve GAE by enforcing the latent representation to match a prior distribution via an adversarial training scheme. This idea can also be added to our model to further enhance the embeddings.
4.5.2. Influence of attention mechanism and graph decoder
In this section we compare GraphAE with its two variants to verify the effectiveness of attention based aggregation weighting mechanism and graph decoder. For the first purpose, we replace the attention based aggregation weighting mechanism with edge based aggregation weighting mechanism, where , is the edge weight. We mark this baseline as “-attention”. To inspect the effectiveness of graph decoder we add another variant of our model which removes graph decoder and adopts structure reconstruction based objective function. We name this model “-decoder”, and the loss function is formalized as follows:
where is the number of negative samples and we set . is the distribution to sample nodes from, uniform distribution is used in our experiments. Other hyper parameters, e.g., number of hidden units, are set the same as GraphAE for fair comparison. Experiment setup is the same as section 3, and limited by space we only display the AUC result for link prediction and macro-f1 with 10% train label for node classification. As shown in the tables 7 and 8, GraphAE performs consistently better than “-attention” in both link prediction and node classification. This phenomenon can be explained by following two reasons. First, edges in the four datasets do not contain rich information and only indicate whether two nodes are connected with each other. Thus GraphAE-attention aggregates all neighboring nodes without distinguishing importance of nodes, which limits the capacity of model. Second, attention based mechanism provides a more flexible way to capture the proximities in node attributes and network structure by reweighting the importance of neighbors.
In all downstream tasks, we find that GraphAE outperforms “-decoder” especially when edges are not reliable, for example GraphAE outperforms “-decoder” with a huge margin in wiki. It mainly because structure reconstruction based loss overlaps with the function of graph encoder—model the proximity of nodes. It demonstrates that our graph decoder is better than structure reconstruction based loss which might over-emphasize local proximity.
4.5.3. Influence of dimension of hidden representation
Dimension of embedding is an important parameter, thus we examine how the different sizes of embedding affect the performance of downstream tasks. Due to space limitations, we only display the result of node classification on four datasets and we get similar result in link prediction task. We vary dimension of embedding from [8, 16, 32, 64, 128, 256], the number of units in first hidden layer is twice than the dimension of embedding. Other hyper-parameters are keep the same as mentioned in Section 4.
As shown in Figure2, the trend of the curves in two datasets is very similar-performance increase when dimension of embedding get larger at first and decrease when the size of embedding larger than a specific value. From the experimental results, we can find that our model is a bit sensitive to dimension. Fortunately, as the curves are “unimodal”, it is easy to find a dimension gets good results in downstream tasks.
In this paper, we propose a novel graph auto-encoder framework, namely GraphAE, for attributed network embedding. GraphAE use a stack of graph convolutional networks to encode both network structure and node attributes into a sole low-dimensional representation for each node. In graph decoder, we use another stack of graph convolutional networks to reconstruct both network structure and node attributes from representations. Our model leverages the complex interaction between network structure and node attributes through feature diffusion, thus has high capacity to learn good node representations for attributed network. Experimental results show that our model consistently outperforms all the benchmark algorithms in two down-stream tasks. In the future, we will explore more kinds of powerful graph convolutional networks for the encoder layer of our framework.
- Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: a system for large-scale machine learning.. In OSDI, Vol. 16. 265–283.
- Bronstein et al. (2017) Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. 2017. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34, 4 (2017), 18–42.
- Bruna et al. (2014) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. 2014. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR2014), CBLS, April 2014.
- Cao et al. (2015) Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2015. GraRep: Learning Graph Representations with Global Structural Information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 891–900.
- Cao et al. (2016) Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2016. Deep Neural Networks for Learning Graph Representations.. In AAAI. 1145–1152.
- Chang et al. (2015) Shiyu Chang, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C Aggarwal, and Thomas S Huang. 2015. Heterogeneous network embedding via deep architectures. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 119–128.
- Chen et al. (2018a) Jie Chen, Tengfei Ma, and Cao Xiao. 2018a. FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. https://openreview.net/forum?id=rytstxWAW
et al. (2018b)
Jianfei Chen, Jun Zhu,
and Le Song. 2018b.
Stochastic Training of Graph Convolutional Networks with Variance Reduction. InInternational Conference on Machine Learning. 941–949.
- Cui et al. (2018) Peng Cui, Xiao Wang, Jian Pei, and Wenwu Zhu. 2018. A survey on network embedding. IEEE Transactions on Knowledge and Data Engineering (2018).
- Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems. 3844–3852.
- Donnat et al. (2018) Claire Donnat, Marinka Zitnik, David Hallac, and Jure Leskovec. 2018. Learning Structural Node Embeddings via Diffusion Wavelets. In International ACM Conference on Knowledge Discovery and Data Mining (KDD), Vol. 24.
- Gao and Huang (2018) Hongchang Gao and Heng Huang. 2018. Deep Attributed Network Embedding.. In IJCAI. 3364–3370.
Glorot and Bengio (2010)
Xavier Glorot and Yoshua
Understanding the difficulty of training deep
feedforward neural networks. In
Proceedings of the thirteenth international conference on artificial intelligence and statistics. 249–256.
- Golub and Reinsch (1970) Gene H Golub and Christian Reinsch. 1970. Singular value decomposition and least squares solutions. Numerische mathematik 14, 5 (1970), 403–420.
- Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 855–864.
- Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems. 1024–1034.
- Henderson et al. (2012) Keith Henderson, Brian Gallagher, Tina Eliassi-Rad, Hanghang Tong, Sugato Basu, Leman Akoglu, Danai Koutra, Christos Faloutsos, and Lei Li. 2012. RolX: Structural Role Extraction & Mining in Large Graphs. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1231–1239.
- Hinton and Salakhutdinov (2006) Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. science 313, 5786 (2006), 504–507.
- Huang et al. (2017a) Xiao Huang, Jundong Li, and Xia Hu. 2017a. Accelerated attributed network embedding. In Proceedings of the 2017 SIAM International Conference on Data Mining. SIAM, 633–641.
- Huang et al. (2017b) Xiao Huang, Jundong Li, and Xia Hu. 2017b. Label informed attributed network embedding. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 731–739.
- Huang and Mamoulis (2017) Zhipeng Huang and Nikos Mamoulis. 2017. Heterogeneous information network embedding for meta path based proximity. arXiv preprint arXiv:1701.05291 (2017).
- Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. (2015).
- Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Variational Graph Auto-Encoders. NIPS Workshop on Bayesian Deep Learning (2016).
- Kipf and Welling (2017) Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations (ICLR).
- Le and Lauw (2014) Tuan MV Le and Hady W Lauw. 2014. Probabilistic latent document network embedding. In 2014 IEEE International Conference on Data Mining (ICDM). IEEE, 270–279.
- Liao et al. (2018) Lizi Liao, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua. 2018. Attributed social network embedding. IEEE Transactions on Knowledge and Data Engineering (2018).
- Liu et al. (2018) Jie Liu, Zhicheng He, Lai Wei, and Yalou Huang. 2018. Content to node: Self-translation network embedding. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1794–1802.
- Man et al. (2016) Tong Man, Huawei Shen, Shenghua Liu, Xiaolong Jin, and Xueqi Cheng. 2016. Predict Anchor Links across Social Networks via an Embedding Approach.. In IJCAI, Vol. 16. 1823–1829.
- Ou et al. (2016) Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. 2016. Asymmetric transitivity preserving graph embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1105–1114.
- Pan et al. (2018) Shirui Pan, Ruiqi Hu, Guodong Long, Jing Jiang, Lina Yao, and Chengqi Zhang. 2018. Adversarially regularized graph autoencoder for graph embedding. arXiv preprint arXiv:1802.04407 (2018).
- Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
- Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learning of Social Representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 701–710.
- Qi et al. (2012) Guo-Jun Qi, Charu Aggarwal, Qi Tian, Heng Ji, and Thomas Huang. 2012. Exploring context and content links in social media: A latent space method. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 5 (2012), 850–862.
- Ribeiro et al. (2017) Leonardo FR Ribeiro, Pedro HP Saverese, and Daniel R Figueiredo. 2017. struc2vec: Learning node representations from structural identity. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 385–394.
- Roweis and Saul (2000) Sam T Roweis and Lawrence K Saul. 2000. Nonlinear dimensionality reduction by locally linear embedding. science 290, 5500 (2000), 2323–2326.
- Tang et al. (2015) Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. LINE: Large-scale Information Network Embedding. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1067–1077.
- Tenenbaum et al. (2000) Joshua B Tenenbaum, Vin De Silva, and John C Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. science 290, 5500 (2000), 2319–2323.
- Tu et al. (2016) Cunchao Tu, Weicheng Zhang, Zhiyuan Liu, Maosong Sun, et al. 2016. Max-Margin DeepWalk: Discriminative Learning of Network Representation.. In IJCAI. 3889–3895.
- Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. International Conference on Learning Representations (2018). https://openreview.net/forum?id=rJXMpikCZ accepted as poster.
- Wang et al. (2016) Daixin Wang, Peng Cui, and Wenwu Zhu. 2016. Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1225–1234.
- Wang et al. (2017) Xiao Wang, Peng Cui, Jing Wang, Jian Pei, Wenwu Zhu, and Shiqiang Yang. 2017. Community Preserving Network Embedding.. In AAAI. 203–209.
- Yang et al. (2015) Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Y Chang. 2015. Network representation learning with rich text information.. In IJCAI. 2111–2117.
- Zhu et al. (2007) Shenghuo Zhu, Kai Yu, Yun Chi, and Yihong Gong. 2007. Combining content and link for classification using matrix factorization. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 487–494.