1 Introduction
Learning a useful feature representation from graph data lies at the heart and success of many machine learning tasks such as node classification
[Neville and Jensen2000][Akoglu et al.2015], link prediction [Al Hasan and Zaki2011], among others [Koyutürk et al.2006, Ng et al.2002]. Motivated by the success of word embedding models, such as the skipgram model [Mikolov et al.2013], recent works extended word embedding models to learn graph embeddings [Perozzi et al.2014, Goyal and Ferrara2017]. The primary goal of these works is to model the conditional probabilities that relate each input vertex to its context, where the context is a set of other vertices surrounding and/or topologically related to the input vertex. Many variants of graph embedding methods proposed
random walks to generate the context vertices [Perozzi et al.2014, Grover and Leskovec2016, Ribeiro et al.2017, Cavallari et al.2017]. For instance, DeepWalk [Perozzi et al.2014] initiates random walks from each vertex to collect sequences of vertices (similar to sentences in language). Then, the skipgram model is used to fit the embeddings by maximizing the conditional probabilities that relate each input vertex to its surrounding context. In this case, vertex identities are used as words in the skipgram model, and the embeddings are tied to the vertex ids.In language, the foundational idea is that words with similar meanings will be surrounded by a similar context [Harris1954]
. As such, in language models, the context of a word is defined as the surrounding words. However, this foundation does not directly translate to graphs. Since unlike words in languages that are universal with semantics and meaning independent of the corpus of documents, vertex ids obtained by random walks on graphs are not universal and are only meaningful within a particular graph. This key limitation has two main disadvantages. First, these embedding methods are inherently transductive, dealing essentially with isolated graphs, and unable to generalize to unseen nodes. Consequently, they are unsuitable for graphbased transfer learning tasks such as acrossnetwork classification
[Kuwadekar and Neville2011, Getoor and Taskar2007], and graph similarity/comparison [Goldsmith and Davenport1990, Zager and Verghese2008]. Second, by using this traditional definition of random walks, there is no general way to integrate vertex attributes/features to the network representation.There is no guarantee that similar vertices are surrounded by similar context (obtained using random walks on graphs). Recent empirical analysis shows that using random walks in graph embeddings primarily capture proximity among the vertices (see [Goyal and Ferrara2017]), so that vertices that are close to one another in the graph are embedded together, e.g., vertices that belong to the same community are embedded similarly. Although proximity among the vertices does not guarantee their similarity, the idea of a network position or a role [Lorrain and White1977, Rossi and Ahmed2015a, Henderson et al.2011] is more suitable to represent the similarity and structural relatedness among vertices. Roles represent vertex connectivity patterns such as hubs, starcenters, staredge nodes, nearcliques or vertices that act as bridges to different regions of the graph. Intuitively, two vertices belong to the same role if they are structurally similar. Random walks will likely visit nearby vertices first, which makes them suitable for finding communities, rather than roles (structural similarity) (see Sec. 3 for theoretical analysis).
To overcome the above problems, we propose the Role2Vec framework which serves as a basis for generalizing many existing methods that use traditional random walks. Role2Vec utilizes the flexible notion of attributed random walks that is not tied to vertex identity and is instead based on a function
that maps a vertex attribute vector to a type, such that two vertices belong to the same type if they are structurally similar. The proposed framework provides a number of important advantages to any method generalized using it. First, the proposed framework is naturally inductive as the learned features generalize to new nodes and across graphs and therefore can be used for transfer learning tasks. Second, they are able to capture structural similarity (roles) better. Third, the proposed framework is inherently spaceefficient since embeddings are learned for types (as opposed to vertices) and therefore requires significantly less space than existing methods. Fourth, the proposed framework naturally supports graphs with attributes (if available/given as input). Furthermore, our approach is shown to be effective with an average improvement of
in AUC while requiring on average x less space than existing methods on a variety of graphs from different application domains.2 Framework
We consider an (un)directed input graph , where is the number of vertices in , and is the number of edges in . For any vertex , let be the set of direct neighbors of , and is the vertex degree. In addition, we consider a matrix of attributes/features, where each is a vector for vertex . For example, for graphs without attributes, could simply be an indicator vector for vertex and is equivalent to the number of vertices (i.e., having if , and otherwise) [Perozzi et al.2014, Grover and Leskovec2016]. For attributed graphs, may include observed attributes, topological features, and/or node types for heterogeneous graphs. The goal of an embedding method is to derive useful features of particular graph elements (e.g., vertices, edges) by learning a model that maps each graph element to the latent dimension space. While the approach remains general for any graph element, this paper focuses on vertex embeddings.
To achieve this, an embedding is usually defined with three components: (1) the context function, which specifies a set of other vertices called the context for any given vertex , such that the context vertices are surrounding and/or topologically related to the given vertex. Each vertex is associated with two latent vectors, an embedding vector and a context vector . (2) the conditional distribution, which specifies the statistical distribution used to combine the embedding and context vectors. More specifically, the conditional distribution of a vertex combines its embedding and the context vectors of its surrounding vertices. (3) the model parameters (i.e., embedding and context vectors) and how these are shared across the conditional distributions. Thus, an embedding method models the conditional probability that relate each vertex to its context as follows: , where is the set of context vertices for vertex , is its feature/attribute vector, and is the conditional distribution.
Our goal is to model , assuming the context vertices are conditionally independent. The most commonly used conditional distribution is the categorical distribution (see [Rudolph et al.2016] for a generalization). In this case, a softmax function parameterized with the two latent vectors (i.e., embedding and context vectors) is used. Thus, for each inputcontext vertex pair ,
(1) 
For sparse graphs, the summation in contains many zero entries, and thus can be approximated by subsampling those zero entries (using negative sampling similar to language models [Mikolov et al.2013]). Finally, the objective function of the embedding method is the sum of the logarithm of likelihood values of each vertex, i.e., .
Clearly, there is a class of possible embedding methods where each of the three components (discussed above) is considered a modeling choice with various alternatives. Recent work proposed random walks to sample/collect the context vertices [Perozzi et al.2014, Grover and Leskovec2016].
2.1 Mapping Vertices to VertexTypes
Given matrix of attributes and/or structural features, the Role2Vec framework starts by locating sets of vertices, however large or small be the shortest distance between any two in a set, who are placed similarly with respect to all other sets of vertices. Thus, two vertices belong to the same set if they are similar in terms of attributes and/or structural features. We achieve this by learning a function that maps the vertices to a set of vertextypes where is often much smaller than , i.e., ,
(2) 
Thus, is a function mapping vertices to vertextypes based on the attribute matrix . Clearly, the function is a modeling choice, which could be learned automatically or defined manually by the user. We explore two general classes of functions for mapping vertices to their types. The first class of functions are simple functions taking the form:
(3) 
where is an attribute vector and is a binary operator such as concatenation, sum, among others. The second class of functions are learned by solving an objective function. This includes functions based on a lowrank factorization of the matrix having the form with factor matrices and where is the rank and is a linear or nonlinear function. More formally,
(4) 
where is the loss, is constraints (e.g., nonnegativity constraints ), and is a regularization penalty. Then, we partition into disjoint sets of nodes (for each of the vertextypes) , where is set of vertices mapped to vertextype
, by solving the kmeans objective:
(5) 
2.2 Attributed Random Walks
Recently, random walks received much attention in learning network embeddings [Perozzi et al.2014, Grover and Leskovec2016], in particular to generate the context vertices. Consider a random walk of length and starting at a vertex of the input graph , if at time we are at vertex , then at time , we move to a neighbor of with probability . Thus, the resulting randomly chosen sequence of vertex indices
is a Markov chain. However, a key limitation of these methods is that the embeddings learned based on random walks are fundamentally tied to vertex ids. By using this traditional definition of random walks, there is no general way to integrate vertex attributes and structural features to the network representation. On the other hand, vertex attributes and structural features can easily be represented by differentiating the edges according to the types of their endpoints, which leads to the definition of
attributed random walks.Definition (Attributed walk)
Let be a dimensional vector for vertex . An attributed walk of length is a sequence of adjacent vertextypes,
(6) 
induced by a randomly chosen sequence of indices generated by a random walk of length starting at , and a function that maps an input vector to a vertex type .
The induced vertextype sequence in the above definition is called attributed random walks and is also a Markov chain.
The Role2vec framework uses vertex mapping and attributed random walks to learn the embeddings. Thus, our goal is to model the conditional probability that relate each vertextype to the types of its context,
(7) 
Hence, the embedding structure (i.e., the embedding and context vectors) is shared among the vertices with the same vertextype. Specifically, we learn and for each partition of vertices, which are mapped to vertextype . Note that Role2vec learns an embedding for an aggregated network, where detailed relations among individual vertices are aggregated to total relations among vertextypes.
2.3 Role2Vec Algorithm
The Role2Vec algorithm is shown in Alg. 1. Alg. 1 takes the following inputs: (1) graph , (2) attribute matrix , (3) embedding dimension , (4) walks per vertex , (5) walk length , (6) context window size . In Line 3, if is not available, we derive structural features using the graph structure itself. For instance, in this paper, we use small subgraphs called motifs as structural features. Counts of motif patterns were shown to be useful for a variety of network analysis tasks and can be computed quickly and efficiently with parallel algorithms [Ahmed et al.2015, Ahmed et al.2016, Benson et al.2016]. Since many graph properties including motifs exhibit power law distributions, we preprocess using logarithmic binning, similar to [Henderson et al.2011] (Line 4). In Line 5, vertices are mapped to vertextypes using a function as discussed in Section 2.1. Then, we precompute the random walk transition probabilities , which could be uniform or weighted (Line 6). Lines 813 initiate random walks from each vertex using the notion of attributed random walks in Lines 17–24. Finally, Role2Vec
learns the embeddings using stochastic gradient descent in Line
14.Recall that is the number of nodes, is the number of types, and . Role2vec has the following properties.
Property
Role2vec is spaceefficient with space complexity .
Proof. To store the learned embeddings of the vertextypes, Role2vec takes space. Also, Role2vec takes space for a hash table mapping vertices to their corresponding types. Thus, the total space used by Role2vec is , less space compared to baselines that require .
As , Role2vec converges to the baseline random walk methods [Perozzi et al.2014, Grover and Leskovec2016], since each vertex is mapped to a new type that uniquely identifies it from other vertices, i.e., is a onetoone function from onto itself.
3 Theoretical Analysis
In a graph , the sequence of vertices visited by a random walk of length is represented by a directed path on the graph. In this section, we analyze the properties and parameters of random walks that affect the embedding methods. Lemmas 3–3 analyze the constraints and bounds on vertex reachability, expected access time, and representation of vertices/edges in random walks respectively.
We consider a random walk of length and starting at vertex of , if at time we are at vertex , then at time , we move to a neighbor of with probability . Clearly, the randomly chosen sequence of vertex indices is a Markov chain. We denote by the distribution of , where is the probability that the random walk visits vertex at time . Similarly, we denote by the transition probability from vertex to vertex in one step, where . Thus, the Markov property implies that this Markov chain is uniquely defined by its onestep transition matrix ,
(8) 
Let be the transition matrix whose entries are the step transition probabilities, such that
(9) 
is the probability that the walk moves from vertex to vertex in exactly steps. Finally, we denote by the probability that starting at vertex , the first transition to vertex occurs at time ,
(10) 
Lemma
If and are two nonadjacent vertices in a connected graph , then there is at least one neighbor where for .
Proof. Let be the degree of vertex , and denote by the set of neighbors of . For each neighbor , start a random walk at , and let be the probability that the first transition from to occurs at time . Now begin a random walk at and let be the probability that the first transition from to occurs at time . By conditioning on the first transition, we have
Set , thus the probability is the mean of the probabilities of ’s neighbors, for and . This implies that there is at least one neighbor where for , and Property 3 is proved.
Lemma 3 shows that the probability is upper bounded by the probability of at least one of ’s neighbors (i.e., ).
Lemma
If and are two nonadjacent vertices in a connected graph , with is the expected access time from to , and is the average neighbor access time for , then with probability less than , a random walk starting at takes at least time to reach .
Proof. Recall that is the probability that starting at , the random walk first visits at time , then the expected access time from to is . By conditioning on the first transition, we have
where is the degree of , and is the expected access time for some neighbor vertex . Since for any vertex in , then by Markov’s inequality, for any ,
Let be the average neighbor access time for vertex . Then, with , a random walk starting at takes at least to reach .
Lemma 3 shows that the expected access time for a random walk from to is at least twice the average neighbor access time for vertex . Lemma 3 and 3 verify the intuition that a random walk starting at any vertex will likely visit nearby vertices first before visiting distant vertices, and that a distant vertex is more reachable from some neighbor in less steps. This makes random walks more suitable for capturing communities rather global structural similarities among all the vertices.
Lemma
Suppose we start random walks of length from any vertex in . For a given edge , let denote the total number of random walks containing
. Then, the expectation of the random variable
is upper bounded by , i.e., .Proof. Recall that the probability of a random walk starting at visits at time is , where is the indicator vector for vertex , which equals in coordinate and otherwise. Then, for a given edge , the probability that the random walk visits at time and at time is (since the transition probability from to is ). Suppose we start random walks of length from , let denote the total number of random walks containing , then the expectation of is the sum of the probabilities that there exists a random walk visiting as follows
where is the degree matrix with the th diagonal entry is the vertex degree , is the unit vector with all entries equal to , and .
4 Experiments
In this section, we investigate the effectiveness of the proposed framework using a variety of graphs. Unless otherwise mentioned, all experiments use logarithmic binning^{1}^{1}1Logarithmic binning assigns the first nodes with smallest attribute value to (where ), then assigns the fraction of remaining unassigned nodes with smallest value to , and so on. and the bin size is chosen by searching over . In these experiments, we use a simple function that represents a concatenation of the attribute values in the node attribute vector . We searched over 10 subsets of the motif features of size 24 nodes shown in Figure 1. We evaluate the role2vec approach presented in Section 2.3 that leverages the attributed random walk framework (Section 2) against a number of popular methods including: node2vec [Grover and Leskovec2016], DeepWalk [Perozzi et al.2014], struc2vec [Ribeiro et al.2017], and LINE [Tang et al.2015]
. For our approach and node2vec, we use the same hyperparameters (
, , ) and grid search over as mentioned in [Grover and Leskovec2016]. We use logistic regression (LR) with an L2 penalty. The model is selected using 10fold crossvalidation on
of the labeled data. Experiments are repeated for 10 random seed initializations. All results are statistically significant with pvalue . We use AUC to evaluate the models. Data was obtained from NetworkRepository [Rossi and Ahmed2015b].4.1 Comparison
This section compares the proposed approach to other embedding methods for link prediction. Given a partially observed graph with a fraction of missing edges, the link prediction task is to predict these missing edges. We generate a labeled dataset of edges as done in [Grover and Leskovec2016]. Positive examples are obtained by removing of edges randomly, whereas negative examples are generated by randomly sampling an equal number of node pairs that are not connected with an edge, i.e., each node pair . For each method, we learn features using the remaining graph that consists of only positive examples. Using the learned embeddings from each method, we then learn a model to predict whether a given edge in the test set exists in or not. Notice that node embedding methods such as DeepWalk and node2vec require that each node in appear in at least one edge in the training graph (i.e., the graph remains connected), otherwise these methods are unable to derive features for such nodes. This is a significant limitation that prohibits their use in many realworld applications.
Graph  R2V  R2VDW  N2V  DW  LINE  S2V 

0.627  0.627  0.672  0.669  
0.731  0.716  0.716  0.691  0.729  
0.846  0.813  0.811  0.709  0.858  
0.768  0.735  0.620  0.791  
0.656  0.655  0.627  0.660  0.623  
0.847  0.756  0.745  0.769  0.857  
0.960  0.854  0.848  0.850  0.883  
0.597  0.580  0.498  0.551  0.590  
0.742  0.728  0.763  0.758  
0.925  0.804  0.738  0.768  0.861 
Graph  R2V  R2VDW  N2V  DW  LINE  S2V 

0.681  0.621  0.621  0.494  0.662  
0.715  0.715  0.562  0.736  
0.838  0.796  0.793  0.498  0.834  
0.738  0.678  0.660  0.533  0.699  
0.644  0.673  0.631  0.516  0.599  
0.821  0.746  0.731  0.471  0.843  
0.730  0.728  0.618  0.798  
0.593  0.508  0.549  0.553  
0.906  0.912  0.904  0.784  0.905  
0.885  0.808  0.797  0.650  0.841 
For comparison, we use the same set of binary operators [Grover and Leskovec2016] to construct features for the edges by combining the learned embeddings of its endpoints. The AUC results are provided in Table 1 and 2. Moreover, the AUC scores from our method are all significantly better than the other methods at . Note that we also used the role2vec framework to generalize DeepWalk (DW) by using the notion of attributed random walk, we call this R2VDW. We summarize the gain/loss in predictive performance over the other methods in Figure 2. In all cases, our method achieves better predictive performance over the other methods across a wide variety of graphs with different characteristics. Overall, the mean and product binary operators give an average gain in predictive performance (over all graphs) of and , respectively.
We also investigated learning types using lowrank matrix factorization (Eq. 4) with squared loss. No regularization or constraints were used. Eq. 5 was used to partition nodes into types. Results are provided in Table 3 and are comparable to previous results that use concatenation to derive types. Due to space, we report results for only a few graphs using the mean operator.
R2V  0.710  0.748  0.867  0.926 

R2VFac.  0.707  0.761  0.848  0.905 
4.2 Spaceefficient Embeddings
We now investigate the spaceefficiency of the learned embeddings from the proposed framework and intermediate representation. Observe that any embedding method that implements the proposed attributed random walk framework (and intermediate representation) learns an embedding for each distinct node type . As described earlier in Sec. 2.3, in the worst case, an embedding is learned for each of the vertices in the graph and we recover the baseline methods [Perozzi et al.2014, Grover and Leskovec2016] as a special case. In general, the best embedding most often lies between such extremes and therefore the embedding learned from a method implementing Role2Vec is often orders of magnitude smaller in size, since .
Given an attribute vector of motif counts (Figure 1) for an arbitrary node in , we derive embeddings using each of the following:
(11)  
(12)  
(13)  
(14) 
where is a function that maps to a type . In these experiments, we use logarithmic binning (applied to each dimensional motif feature) with and use defined as the concatenation of the logarithmically binned attribute values. Embeddings are learned using the different subsets of attributes in Eq. (11)(14). For instance, Eq. (11) indicates that vertex types are derived using the (logarithmic binned) number of 2stars and triangles incident to the given vertex (Figure 1). We measure the space (in bytes) required to store the embedding learned by each method. In Figure 3, we summarize the reduction in space from our approach compared to the other methods. In all cases, the embeddings learned from our approach require significantly less space and thus more spaceefficient. Specifically, the embeddings from our approach require on average times less space than the best method averaged across all graphs.
5 Related Work
Recent embedding methods for graphs have largely been based on the popular skipgram model [Mikolov et al.2013, Cheng et al.2006] originally introduced for learning vector representations of words in text. In particular, DeepWalk [Perozzi et al.2014] used this approach to embed the nodes such that the cooccurrence frequencies of pairs in short random walks are preserved. Node2vec [Grover and Leskovec2016] introduced hyperparameters to DeepWalk that tune the depth and breadth of the random walks. These approaches are becoming increasingly popular and have been shown to outperform a number of existing methods. These methods (and many others) are all based on simple random walks and thus are wellsuited for generalization using the attributed random walk framework. While most network representation learning methods use only the graph [Perozzi et al.2014, Tang et al.2015, Cao et al.2015, Grover and Leskovec2016], our framework exploits both the graph and structural features (e.g., motifs).
While most work has focused on transductive (withinnetwork) learning, there has been some recent work on graphbased inductive approaches. Yang et al.
Planetoid proposed an inductive approach called Planetoid. However, Planetoid is an embeddingbased approach for semisupervised learning and does not use any structural features. Rossi
et al. deepGL proposed an inductive approach for (attributed) networks called DeepGL that learns (inductive) relational functions representing compositions of one or more operators applied to an initial set of graph features. Recently, Hamilton et al. GraphSage proposed a similar approach that also aggregates features from node neighborhoods. However, these approaches are not based on randomwalks. Heterogeneous networks [Shi et al.2014] have also been recently considered [Chang et al.2015, Dong et al.2017] as well as attributed networks [Huang et al.2017b, Huang et al.2017a]. Huang et al. huang2017label proposed an approach for attributed networks with labels whereas Yang et al. yang2015network used text features to learn node representations. Liang et al.liang2017seano proposed a semisupervised approach for networks with outliers. Bojchevski
et al. bojchevski2017deep proposed an unsupervised rankbased approach. Coley et al. coley2017convolutional introduced a convolutional approach for attributed molecular graphs that learns graph embeddings as opposed to node embeddings. Duran et al. [Duran and Niepert2017] proposed an embedding Propagation method to learn node representations. However, most of these approaches are neither inductive nor spaceefficient.6 Conclusion
This work proposed a flexible framework based on the notion of attributed random walks. The framework serves as a basis for generalizing existing techniques (that are based on random walks) for use with attributed graphs, unseen nodes, graphbased transfer learning tasks, and allowing significantly larger graphs due to the inherent spaceefficiency of the approach. Instead of learning individual embeddings for each node, our approach learns embeddings for each type based on functions that map feature vectors to types. This allows for both inductive and transductive learning.
References
 [Ahmed et al.2015] Nesreen K. Ahmed, Jennifer Neville, Ryan A. Rossi, and Nick Duffield. Efficient graphlet counting for large networks. In ICDM, page 10, 2015.
 [Ahmed et al.2016] Nesreen K. Ahmed, Jennifer Neville, Ryan A. Rossi, Nick Duffield, and Theodore L. Willke. Graphlet decomposition: Framework, algorithms, and applications. KAIS, pages 1–32, 2016.
 [Akoglu et al.2015] Leman Akoglu, Hanghang Tong, and Danai Koutra. Graph based anomaly detection and description: a survey. DMKD, 29(3):626–688, 2015.
 [Al Hasan and Zaki2011] Mohammad Al Hasan and Mohammed J Zaki. A survey of link prediction in social networks. In Social Network Data Analytics, pages 243–275. 2011.
 [Benson et al.2016] Austin R Benson, David F Gleich, and Jure Leskovec. Higherorder organization of complex networks. Science, 353(6295):163–166, 2016.
 [Bojchevski and Günnemann2017] Aleksandar Bojchevski and Stephan Günnemann. Deep gaussian embedding of attributed graphs: Unsupervised inductive learning via ranking. arXiv:1707.03815, 2017.
 [Cao et al.2015] Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph representations with global structural information. In CIKM, pages 891–900. ACM, 2015.
 [Cavallari et al.2017] Sandro Cavallari, Vincent W Zheng, Hongyun Cai, Kevin ChenChuan Chang, and Erik Cambria. Learning community embedding with community detection and node embedding on graphs. In CIKM, pages 377–386, 2017.
 [Chang et al.2015] Shiyu Chang, Wei Han, Jiliang Tang, GuoJun Qi, Charu C Aggarwal, and Thomas S Huang. Heterogeneous network embedding via deep architectures. In SIGKDD, pages 119–128, 2015.

[Cheng et al.2006]
Winnie Cheng, Chris Greaves, and Martin Warren.
From ngram to skipgram to concgram.
Int. J. of Corp. Linguistics, 11(4):411–433, 2006.  [Coley et al.2017] Connor W Coley, Regina Barzilay, William H Green, Tommi S Jaakkola, and Klavs F Jensen. Convolutional embedding of attributed molecular graphs for physical property prediction. J. Chem. Info. & Mod., 2017.
 [Dong et al.2017] Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. metapath2vec: Scalable representation learning for heterogeneous networks. In SIGKDD, pages 135–144, 2017.
 [Duran and Niepert2017] Alberto Garcia Duran and Mathias Niepert. Learning graph representations with embedding propagation. In NIPS, pages 5119–5130, 2017.
 [Getoor and Taskar2007] L. Getoor and B. Taskar, editors. Intro. to SRL. MIT Press, 2007.
 [Goldsmith and Davenport1990] Timothy E Goldsmith and Daniel M Davenport. Assessing structural similarity of graphs. 1990.
 [Goyal and Ferrara2017] Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and performance: A survey. arXiv preprint arXiv:1705.02801, 2017.
 [Grover and Leskovec2016] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In SIGKDD, pages 855–864, 2016.
 [Hamilton et al.2017] William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. arXiv:1706.02216, 2017.
 [Harris1954] Zellig S Harris. Distributional structure. Word, 10(23):146–162, 1954.
 [Henderson et al.2011] Keith Henderson, Brian Gallagher, Lei Li, Leman Akoglu, Tina EliassiRad, Hanghang Tong, and Christos Faloutsos. It’s who you know: graph mining using recursive structural features. In SIGKDD, pages 663–671. ACM, 2011.
 [Huang et al.2017a] Xiao Huang, Jundong Li, and Xia Hu. Accelerated attributed network embedding. In SDM, 2017.
 [Huang et al.2017b] Xiao Huang, Jundong Li, and Xia Hu. Label informed attributed network embedding. In WSDM, 2017.
 [Koyutürk et al.2006] Mehmet Koyutürk, Yohan Kim, Umut Topkara, Shankar Subramaniam, Wojciech Szpankowski, and Ananth Grama. Pairwise alignment of protein interaction networks. JCB, 13(2):182–199, 2006.

[Kuwadekar and
Neville2011]
Ankit Kuwadekar and Jennifer Neville.
Relational active learning for joint collective classification models.
In ICML, pages 385–392, 2011.  [Liang et al.2017] Jiongqian Liang, Peter Jacobs, and Srinivasan Parthasarathy. Seano: Semisupervised embedding in attributed networks with outliers. arXiv:1703.08100, 2017.
 [Lorrain and White1977] Francois Lorrain and Harrison C White. Structural equivalence of individuals in social networks. In Social Networks, pages 67–98. Elsevier, 1977.
 [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
 [Neville and Jensen2000] Jennifer Neville and David Jensen. Iterative classification in relational data. In AAAI SRL Workshop, pages 13–20, 2000.

[Ng et al.2002]
Andrew Y Ng, Michael I Jordan, and Yair Weiss.
On spectral clustering: Analysis and an algorithm.
In NIPS, pages 849–856, 2002.  [Perozzi et al.2014] Bryan Perozzi, Rami AlRfou, and Steven Skiena. Deepwalk: Online learning of social representations. In SIGKDD, pages 701–710, 2014.
 [Ribeiro et al.2017] Leonardo F.R. Ribeiro, Pedro H.P. Saverese, and Daniel R. Figueiredo. Struc2vec: Learning node representations from structural identity. In SIGKDD, pages 385–394, 2017.
 [Rossi and Ahmed2015a] R.A. Rossi and N.K. Ahmed. Role discovery in networks. TKDE, 27(4):1112–1131, 2015.
 [Rossi and Ahmed2015b] Ryan A. Rossi and Nesreen K. Ahmed. The network data repository with interactive graph analytics and visualization. In AAAI, pages 4292–4293, 2015.
 [Rossi et al.2017] Ryan A. Rossi, Rong Zhou, and Nesreen K. Ahmed. Deep feature learning for graphs. In arXiv:1704.08829, 2017.
 [Rudolph et al.2016] Maja Rudolph, Francisco Ruiz, Stephan Mandt, and David Blei. Exponential family embeddings. In Advances in Neural Information Processing Systems, pages 478–486, 2016.
 [Shi et al.2014] Chuan Shi, Xiangnan Kong, Yue Huang, S Yu Philip, and Bin Wu. HeteSim: A General Framework for Relevance Measure in Heterogeneous Networks. TKDE, 26(10):2479–2492, 2014.
 [Tang et al.2015] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. LINE: Largescale Information Network Embedding. In WWW, pages 1067–1077, 2015.
 [Yang et al.2015] Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Y Chang. Network representation learning with rich text information. In IJCAI, pages 2111–2117, 2015.
 [Yang et al.2016] Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Revisiting semisupervised learning with graph embeddings. arXiv:1603.08861, 2016.
 [Zager and Verghese2008] Laura A Zager and George C Verghese. Graph similarity scoring and matching. Applied mathematics letters, 21(1):86–94, 2008.