Flexible Attributed Network Embedding
Network embedding aims to find a way to encode network by learning an embedding vector for each node in the network. The network often has property information which is highly informative with respect to the node's position and role in the network. Most network embedding methods fail to utilize this information during network representation learning. In this paper, we propose a novel framework, FANE, to integrate structure and property information in the network embedding process. In FANE, we design a network to unify heterogeneity of the two information sources, and define a new random walking strategy to leverage property information and make the two information compensate. FANE is conceptually simple and empirically powerful. It improves over the state-of-the-art methods on Cora dataset classification task by over 5 than 10 results improve more than the state-of-the-art methods as increasing training size. Moreover, qualitative visualization show that our framework is helpful in network property information exploration. In all, we present a new way for efficiently learning state-of-the-art task-independent representations in complex attributed networks. The source code and datasets of this paper can be obtained from https://github.com/GraphWorld/FANE.READ FULL TEXT VIEW PDF
Network embedding aims to learn a latent, low-dimensional vector
Attributed network embedding has received much interest from the researc...
Nodes in a multiplex network are connected by multiple types of relation...
Node representation learning is the task of extracting concise and
This paper investigates the problem of network embedding, which aims at
In this paper, the task of cross-network node classification, which leve...
Floorplans are commonly used to represent the layout of buildings. In
Flexible Attributed Network Embedding
Network embedding is an important and ubiquitous research problem with applications ranging from drug design to commodities or friendship recommendations [11, 9, 3, 31]. For most practical networks in the form of graphs, nodes have more than one attributes that greatly determine their roles in the system. For example, individuals in a social network have various properties, such as sexuality, educational background, and partisanship. Moreover, the social science [17, 19] has shown that attributes of nodes can reflect and affect their community structures [8, 13, 29].
Current network embedding studies include structure-preserving methods and property-preserving methods . After decades of research, many structure-preserving approaches, such as DeepWalk , node2vec  and struc2vec , have been proposed to learn network features based on structure information. However, these methods only consider network structures, failing to take advantages of node attributes during encoding .
Recently, few property-preserving methods are proposed. These methods could be further categorized into matrix factorization and deep learning based methods. Matrix factorization based network embedding represent network property in the form of a matrix and factorize this matrix to obtain node embedding, such as TADW  and HSCA . These methods is time and space consuming . Deep learning based network embedding, such as SNE , DANE  and DVNE 
, get inspiration from existing neural network models and (or) design new model to learn network features. These methods get high accuracy result at the cost of high training time requirement.
In response, we propose FANE, a scalable and flexible attributed network embedding framework to integrate both structure and property information to learn features. Briefly, we design a network to unify heterogeneity of the structure and property information sources, and define a new random walking strategy to leverage property information and make them compensate and flexible. Overall, our paper makes the following contributions:
We propose FANE, an efficient and flexible framework that integrates network attribute and structure for feature learning in networks.
We analyze and verify that FANE can learn features as state of the art structure-preserving and property-preserving methods.
We extend our method onto attribute space. Relationships between attributes can also be explored under our framework.
We evaluate our framework on multi-label classification task and conduct visual analysis on several real-world datasets.
One key problem in network embedding is what to preserve in learning. Here we discuss related works based on what they aim to preserve.
Structure-preserving network embedding. DeepWalk  generalizes language modeling SkipGram  for network embedding, which uses random walks to learn latent representations by treating walks as the equivalent of sentences. Instead of exploiting random walks to capture network structure, LINE  learns vertex representations by explicitly modeling the first-order and second-order proximity. In addition, struc2vec  first encodes the vertex structural role similarity into a multilayer network, where the weights of edges at each layer are determined by the structural role difference at the corresponding scale. Moreover, node2vec 
presents a random walking method to interpolates between Breadth-first Sampling and Depth-first Sampling. Inspired by those methods, we have designed a new random walking strategy to leverage property information and make them compensate in the embedding process.
Property-preserving network embedding. TADW  extends DeepWalk  to get a vertex-context matrix. HSCA  integrates homophily, structural context, and vertex content to learn effective network representations. However, calculate factorization on large real-world network matrix with millions of rows and columns is expensive and unscalable. SNE  includes two similar deep neural network models to deal with structure and attribute information separately in embedding layer, the result then processed by same hidden layer to learn features. Differently, DANE 
use two separate deep neural network models for structure and attribute information, and use joint distribution to optimize the result. DVNE
learns a Gaussian distribution in the Wasserstein space as the latent representation of each node. These methods get high accuracy result at the cost of high training time requirement.
Attributed network clustering. Network clustering aims to divide a given set of objects into groups of similar objects. Lots of related algorithms have been proposed, such as Minimum Cut Algorithm , Multi-way network Partition 
, k-medoid and k-means algorithm
, Spectral Clustering method. SA-cluster  and its extended version [4, 32] presented a way to class attributed network by extracting attributes as separate node. Inspired by these works, we construct a new network to integrate structure and attribute for network embedding.
In this section, we first give the problem definition of attributed network embedding, and then discuss our solutions for the key challenges.
An attributed network is formally denoted as , where is the set of vertices, is the set of edges, is the weights of edges, and is the set of attributes associated with vertices in for describing vertex properties. Each vertex is associated with an attribute vector where is the attribute value of vertex on attribute .
Our goal is to learn a mapping function from network to feature representations . Here is the dimension of the feature representations. should support:
Scalability. As a network compression approach, The network embedding inherently should be scalable to deal with large scale of network. Meanwhile, the size of network nodes ranges from tens to millions, even billions in practise .
Integration. Network structures and properties are the fundamental factors that need to be considered in network embedding. However, preserving these properties in a network embedding space is challenging due to the disparity and inhomogeneity between the network space and the embedding vector space .
Adaption. Various domain, data and applications require different network embedding methods, such as structure-preserving, property-preserving or both. Is there a way to provide an framework which is flexible enough to learn features as needed?
is learned by maximizing the conditional probability (objective function):
The key challenge is how to define the context nodes for attributed nodes, so we are able to integrate property information effectively (T2). We deal with this by construct a new network as discussed below.
As stated, the goal of network construction here is to effectively integrate structure and property information. But how to get there. Before that, let us go back to the original network and take another way of looking at the problem. In network, different nodes are connected with edges, which in fact reflect the connection between these nodes in property space (structure information). Meanwhile, if two nodes have same attribute, we could say that these nodes have connection in attribute space (property information). So attributes play as edge in attribute space, like actual edges in network. In order to integrate attribute information in network embedding. We try to concrete the connection in attribute space by appending special virtual edges to represent it. So the heterogeneous edges in the result network could be used to reflect connections between nodes in both structure and property space.
Consequently, the simplest way is to add an edge between nodes that have same attribute. However, by doing so, we increase the worst case edge size from to ). We resolve the worst case edge size problem by introducing virtual attribute nodes. As shown in Figure 1, attribute network is constructed based on attribute information from the raw network by taking attributes as special virtual nodes : if node has attribute , there will exist a virtual edge between and . Thus, the worst case edge size of resultant network would be reduced to ), which is less than ) generally.
Let the constructed network be denoted as , where is the set of raw nodes and virtual attribute node, is the set of raw edges and virtual edges between nodes and corresponding attribute nodes, is same with besides node properties of attribute nodes. Each attribute node is associated with an attribute vector where only is nonzero. includes and weights for attribute edges, which can be defined as needed. The objective function evolves as:
So the aim is to compute the new mapping function . It is interesting that we get an side effect that network properties can be embedded too. So we could also learn features specifically on network property. It is non-trial as shown in experiments in Section 4.
After network construction, we can integrate property information into various state-of-the-art structure-preserving network embedding methods. However, We confront the second key challenge: how to provide the flexibility (T3) in various situations. That is to say, how FANE support user get structure- or property-preserving feature learning or both as needed. Inspired by node2vec , we designed a new random walking strategy which allows continuous transition between attribute-preserving and structure-preserving network embedding.
Given a source node and a fixed random walking length . The th node in the walk , which starts with , is generated by the following distribution:
where is the unnormalized transition probability between nodes and , and is the normalizing constant. Notice that includes constructed attribute edges.
As illustrated in Figure 1, a random walking that just traversed to node through edge . Notice that node and have same attribute , so there will be two more virtual edges connecting the virtual attribute node with node and with node in network separately. One key here is how to define the probability for attribute node related walking. In the next section, we will exhaustively discuss three possible random walking strategies and analyze the effectiveness following our goal (T3).
After enumeration case by case, we find that there are three different random walking strategies in the constructed attributed network :
Source Focused (FANE-sf). Let denote the set of attribute nodes in and let denotes the probability of the next step from the source node to the target node . And . The probability, , is defined as only if the source node . In practice, there are four conditions as shown in Figure (a)a. Formally, the probability of the next step is designed as:
where denotes the shortest path distance between nodes and , and
where which is the previous node of .
Under our proposed framework, the random walking should be able to biased toward the attribute nodes in order to be attribute-preserving. However, the probability of random walking from the source node to the attribute node, by definition, depends on the values of , and . Consequently, it is hard to define the probability of random walking such that the walking is always biased toward the attribute node. So strategy FANE-sf could be structure-preserving by adjusting the values of and , it is unlikely to be attribute-preserving. This result contradicts with our objectives. Experiment also confirms our analysis as shown in Figure 3, in which strategy FANE-sf yields little separation between two subsets of nodes with different attributes.
In strategies FANE-sf and FANE-stf, the probability of property-preserving is defined by . By adjusting its values, it is possible to make the random walking be structure-preserving or attribute-preserving. We will thoroughly discuss the effect of in Section 3.5. This result satisfies our objectives. In Figure 3, we can see that by setting smaller than and , the biased random walking is attribute-preserving, both strategies yield distinct separation between two different attributes. We choose strategy FANE-tf biased random walking strategy in the following experiments.
While and control the likelihood of walking to local and structure nodes separately , property parameter decides the bias between structure- and property-preserving. Let us consider two cases:
Case 1: By setting to be large, it is unlikely for the source node walking to the attribute nodes during the random walking process. Thus, the random walking is more biased toward preserving the local and the structural information of the network based on the values of and .
Case 2: By setting to be small, we increase the probability of walking from the source nodes to the attribute nodes. Thus, nodes sharing similar attribute information are more likely to be linked together by the attribute nodes. Consequently, the result of network embedding will be more property-preserving. Figure 4 validates the embedding effectiveness of FANE with different on real datasets. As we can see, the embedding result will be more property homophily by decreasing .
We can see that the hyper-parameter functions as a slider between structure- and attribute-preserving. In Section 4, we will conduct an experiment to elaborate how our method can integrate both structure-preserving and property-preserving feature learning (T2) and make continuous transition between both (T3).
The pseudocode of FANE is shown in Algorithm 1. We first construct an property-enhanced network
. By importing attribute node in-out hyperparameter, we could control the weights of structure-preserving and property-preserving random walking. The source code and datasets of this paper can be obtained from https://github.com/GraphWorld/FANE.
The proposed method is flexible enough for embedding of network datasets from different domains as discussed below.
We evaluate our method on several datasets from different domains, as listed in Table 1.
Adjnoun : Nodes represent the most commonly occurring adjectives and nouns in novel David Copperfield. Edges connect any pair of words that occur in adjacent position. The property attributes, adjectives and nouns, are used as node classification information.
WebKB : Nodes and edges represent web-pages and citation network separately. Attributes are described as 0/1-valued word vectors indicating the absence/presence of the corresponding word from the dictionary.
Cora : Nodes represent scientific publications and edges reflect the citation relationships. Similar with dataset WebKB, attributes are described as 0/1-valued word vectors indicating the absence/presence of the corresponding word from the dictionary.
CiteSeer : Nodes represent scientific publications and edges reveal the citation networks. Similarly, attributes are described as 0/1-valued word vectors indicating the absence/presence of the corresponding word from the dictionary.
ego-Facebook : Nodes represent Facebook survey participants. The friend list is shown as edge. Attributes are anonymous personal information of those participants.
ego-Gplus : Nodes represent Google users who decide to share their circles. The friend list is shown as edge too. Attributes are their personal information.
ego-Twitter : Nodes are participants from Twitter. Edge represent following list. Attributes are hashtags or user themselves in twitter.
We compare FANE with several state-of-the-art network embedding methods. The implements of these methods are from the original authors.
node2vec : This approach provide a way to integrate Breadth-first Sampling and Depth-first Sampling in random walking, which introduces the Skip-Gram algorithm to learn the node representation vectors.
struc2vec : This approach first encodes the vertex structural role similarity into a multilayer network. The weights of edges at each layer are determined by the structural role difference at the corresponding scale.
HSCA : This approach simultaneously integrates homophily, structural context, and vertex content to learn effective network representations.
Different with existing methods, FANE could realize flexible structure homophily and attribute homophily. Firstly, we empirically test our embedding results in comparison to structure homophily methods, node2vec( = = 1) and struc2vec, and attribute homophily methods, TADW( = 8, textRank = 20, lambda = 0.2, train_ratio = 0.5, = 5) and HSCA( = 8, textRank = 20, lambda = 0.2, train_ratio = 0.5, mu = 0.1, = 5) as shown in Figure 5. Notice that we set the embedding dimension, , and textRank to be low for TADW and HSCA given the small node size of dataset adjnoun.
To begin with, our method can be either mainly structural homophily or attribute homophily by tuning the values of . In Figure (a)a to (c)c, we can see that by setting to be large, FANE can yield similar result as those of node2vec and struc2vec in terms of preserving structural relationship. In addition, in Figure (d)d to (f)f, when is set to be small, the embedding result of FANE can also be property-preserving like those of TADW and HSCA. Moreover, FANE can extract aptitudinal features reflecting both structure and attribute homophily by adjusting the hyper-parameter , as shown in Figure (i)i to (g)g (T2). When , the embedding result integrates relatively more attribute information while for , relatively more structural information is preserved. For further explain the effects of FANE in integrating structure and attribute homophily, we take as an example, as shown in Figure (h)h, the result classes (shown with different colors) are mostly consistent with property attributes (shown with different shapes). Notice that in this situation, there is still one node (Green circle) which represent adjective word “first”
is classified with nouns. The reason is that the word are connected closely with nouns in structure. It proves that FANE preserves structure homophily even in this extreme setting (T3).
The Word network is an effective example demonstrating the functionality of our proposed method. In the following section, we will conduct additional experiments to evaluate the effectiveness of our embedding method on network classification.
Network classification is one of the main application for network embedding. We test FANE on several popular network datasets with ground truth: WebKB, Cora, and Citeseer. We use Support Vector Machine (SVM) for classification. We let the training ratio varies from 10 % to 90%. Other parameters are set as follows.
For node2vec, we did a grid search to find the best combination of (0.25 0.50 1.0 2.0 4.0) and (0.25 0.50 1.0 2.0 4.0) that yields the best results. The values of and are: (0.25, 2.0) in WebKB, (4.0, 2.0) in Cora, and (4.0, 0.25) in Citeseer. For struc2vec, we use the default parameters in its code. For TADW, as instructed in its code, setting textRank = 200, lambda = 0, = 5 for Cora and Citeseer. For WebKB, we tried all values of that appear in its paper and finally decide to set = 15, by which yield the best results of TADW. Similarly, for HSCA, setting textRank = 200, lambda = 0.2, mu = 0.1, = 5 for Cora and Citeseer, following the instructions given in the code. For WebKB, we tried all common values of mu and that appear in its paper and finally decided to use mu = 0.1 and = 15. Setting = 8 if the benchmark methods have the embedding dimension parameter, including FANE.
For FANE, we set values of parameters for different datasets as shown in Table 2. Note that is the penalty parameter in SVM training. Figure 6 shows the Micro-F and Macro-F  results on the datasets. There are mainly three observations from the result:
FANE consistently outperforms all other benchmark methods for all datasets and for all training data. In dataset Cora, FANE even has average 10% to 20% better performance than that of HSCA and node2vec, which are the second highest results, under the same training ratio. It is worth mentioning that even at very low training ratio, FANE can still yields high accuracy. For example, in dataset WebKB and Cora, our method with training data equal to 10% could match or even beat other benchmark methods with 90% training data.
FANE’s performance of classification significantly increase as the number of samples increases. Notice that the benchmark methods generally reach their extreme after certain number of training samples. For example, in dataset Cora, the classification accuracy of the benchmark methods start to fluctuate up and down beyond 60% training ratio. Similarly, in dataset Citeseer, the fluctuation patterns even start at 10% training data. On the contrary, FANE still yields a significant increase in classification accuracy as the training samples increase.
FANE has more significant improvement for Macro-F in cases like WebKB and Cora. As shown in Figure 6, FANE outperforms HSCA, which is the second best results, for nearly averaged 20%. Macro-F treats each class equally and computes the average result of each class. Thus, a high Macro-F result implies that FANE accurately classify nodes for their ground truths for every class.
Figure 7 shows a visual illustration of classification results. The embedding results are classified with K-Means. As we can see, the different classes are reasonably separated after classification with FANE. Furthermore, recall that our main purpose is to provide a flexible framework to realize the transition and the integration between structural-preserving and attribute-preserving. The parameter is free to be decreased or increased depending which information we want to preserve.
Network Visualization has been widely used in network interaction and analysis. It is a powerful tool to reveal the content of a network in a easily interpretable way by finding patterns, marking connections, and showing clustering results. Besides visual analysis of network as existing methods. our method are inherently suitable to visual analyze network attributes.
The visual analysis technique helps us design, evaluate, and explore our proposed framework. To be more specific, Figure 3 to Figure 4 help us visualize our different random walking strategies and their effects on network embedding. Thus, we can directly see how changing strategies or parameters alters the embedding results. This offers great help in the designing process. Moreover, Figure 5 and Figure 7 provide another viable way to evaluate our results: we can now straightforwardly view how the embedded nodes are mixed, separated, and classified as different groups.
Figure 8 also offer some insights to be explored. For example, Figure (a)a, Figure (d)d, and Figure (g)g demonstrate the embedding result for dataset ego-Facebook, ego-Gplus, and ego-Twitter. As we can see, nodes are grouped into different areas. Recall that FANE treat attribute as nodes during the embedding process. Therefore, we can also adopt the same visual analysis techniques to explore information about attributes in a 2D figure as shown in Figure (b)b, Figure (e)e, and Figure (h)h. We find the visualized patterns for those attributes are interesting to be discussed.
For attributes in Facebook and Twitter, we find that attributes are also separated to different groups just like those embedded nodes in Figure (a)a and (g)g. Thus, we enlarge the small selected region in Figure (b)b to see if we can uncover some relationships between those attributes. In Figure (c)c, attribute number 163(work; end_date), 65(education; year), 24(education; school), 164(work; start_date), 148(work; employer), 214(education; concentration), 213(education;concentration), and 143(work;employer) 111213 and 214 represent two different concentrations (majors) while 143 and 148 represent two different employers. are closely grouped together. This result is not surprising because we know one’s majors is somehow related to his/her schools and decide which type of enterprises they will enter in the future. Besides, the years they graduate from schools also affect the years they start or end their works. Even though all attributes in the Facebook dataset are anonymous, it is interesting to see that attributes are grouped based on their information.
Nevertheless, the embedded attribute pattern for Google is different to those of Facebook and Twitter: most of the attributes are grouped together with only few attributes placed far away from them. We also enlarge some of the isolated attributes to see if attributes in those isolated areas are still grouped based on their meanings. There are four attributes in that selected area in Figure (f)f: 893(job_title: music), 298(job_title: dj), 872(job_title: dj) 222893 and 298 means two different attributes with same type of jobs, and 238(job_title: producer). There is an intense correlation among working as DJ, working in the field of music, and working as a producer. This example confirms that attributes are also grouped based on their meaning even in this isolated area.
We keep exploring the attributes information for Twitter. Notice that unlike those of Facebook or Google+, attributes in Twitter are actually hashtags and users. Consequently, we noticed some patterns in which hashtags and users are clustered according to their contents and their personal interests. In Figure (i)i, all attributes are related to BBC in terms of BBC News and BBC Sports. For example, 7004(@bbcsport) is the official BBC Sport channel in Twitter, 7030(@georgeyboy) is a British famous sport broadcaster and used to work for BBC 5 live, and 6990(@bbc5live) is the official channel for BBC 5 Live. Thus, different types of attributes, personal information or user themselves, could be embedded and analyzed as long as there exits some shared information between them.
From the above examples, we proved that it is possible to uncover some nontrivial information about attributes if we also embed the attribute and visualize them. Thus, our proposed framework, FANE, is not limited in flexibly integrating attribute and structural information of the nodes; instead, it could be extended to reflect more information of attributes themselves.
To test the sensitivities of parameter and , we fixed the values of , and for datasets Cora, Citeseer, and Wiki. We let the embedding dimension, , vary from 20 to 100 and let the penalty parameter vary from 0.1 to 1.0. The result is shown in Figure 9, where the x-axis represents the variations of values of and the different lines represents the variations of values of .
As we can see, the testing accuracies is relatively stable with respect to the penalty parameter, , in most cases when the embedding dimension is kept unchanged. However, in some cases, also heavily impacts the result. For example, in Dataset WebKB , testing result fluctuate over 15% with values of changed. Moreover, the embedding dimension, , has a relatively high impact on the testing accuracy in Figure 9. Overall, our testing results are relatively stable regard to the values of while are heavily impacted by the values of .
To evaluate the scalability of FANE, we learn representations from Erdos-Renyi graphs with increasing node size ranging from to at degree of 10. As we can see, in Figure 10, the computational time scales up linearly with increasing number of nodes. Recall that and stand for number of nodes and attributes separately. This result proves that our proposed framework is scalable with respect to number of nodes.
Since we treat attributes as nodes during our network embedding process, we also test the scalability with size of attributes per node. Figure 10 plot a network of computational time versus number of attribute nodes to fully prove the scalability of our framework. Based on Erdos-Renyi network, we fix the number of nodes to 1000 and add the number of attributes from to for each nodes. As we can see, the computational time also vary linearly with the number of attribute nodes. Consequently, these two experiments confirm that our proposed framework is scalable over number of nodes and number of attributes. Thus, our method could handle large scale network embedding with controllable amount of time.
In this paper, we proposed an attributed network embedding framework which could flexibly integrate structure information and attribute information. Thus, it could learn features based on structure, attributes or both, and could provide a smooth transition between attribute-preserving and structure-preserving embedding.
Experiments confirm that our proposed method outperforms the listed STAR methods on network classification. To our best knowledge, FANE is the first method to provide flexible adjustment between attribute-preserving and structural-preserving. Under our proposed framework, we can actively intervene the embedding process to determine which type of information or which kind of integration we want. Moreover, we provide a visual analysis approach to design, optimize, and evaluate our method. This intuitive way is non-trivial in network analysis and interaction.
In this paper, we restrict our discussions on undirected attributed graphs but our method can be easily extended to process directed attributed graphs. In addition, we assume that every attribute shares same importance. However, the attribute parameter could also be extended to reflect the relative importance between different attributes. Moreover, we treat attribute as normal nodes, it is interesting to process attribute further, such as classification or clustering. It maybe useful in problems such as attribute compression, which is important in processing network with thousands attributes.
Journal of Machine Learning Research, 9:2579–2605, 2008.
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, pages 2079–2085, 2018.
Finding community structure in networks using the eigenvectors of matrices.Physical Review E, 74(3):036104, 2006.