## 1. Introduction

Recent advances in deep learning and convolutional neural network (CNN)

(Krizhevsky et al., 2012) have made remarkable breakthroughs to many fields, such as machine translation (Wu et al., 2016)and reading comprehension in natural language processing (NLP)

(Yu et al., 2018), object detection (Szegedy et al., 2013) and image classification (Marino et al., 2017)in computer vision (CV). In addition to text, audio, image and video data, information networks (or graphs) represent another type of natural and complex data structure representing a set of entities and their relationships. A wide variety of real-world data in business, science and engineering domains are best captured as information networks, such as protein interaction networks, citation networks, and social media networks like Facebook, LinkedIn, to name a few.

Network representation learning (NRL), also known as network embedding, is to train a neural network to represent an information network as a collection of node-embedding vectors in a latent space such that the desired network features are preserved, which enables the well-trained NRL model to perform network analytics, such as link prediction or node cluster, as shown in Fig. 1. The goal of NRL is to employ deep learning algorithms to encode useful network information into the latent semantic representations, which can be deployed for performing popular network analytics, such as node classification, link prediction, community detection, and domain-specific network mining, such as social recommendation

(Fan et al., 2019; Wang et al., 2019), protein to protein interaction prediction (Fout et al., 2017), disease-gene association identification (Han et al., 2019), automatic molecule optimization (Fu et al., 2020) and Bitcoin transaction forecasting (Wei et al., 2020).Different from traditional feature engineering that relies heavily on handcrafted statistics to extract structural information, NRL introduces a new data-driven deep learning paradigm to capture, encode and embed structural features along with non-structural features into a latent space represented by dense and continuous vectors. By embedding edge semantics into node vectors, a variety of network operations can be carried out efficiently, e.g., computing the similarity between a pair of nodes, visualizing a network in a 2-dimensional space. Moreover, parallel processing on large scale networks can be naturally supported with the node embedding learned from NRL.

Most of the existing network representation learning efforts are targeted on learning node embeddings of a homogeneous network, in which all nodes are homogeneous and all edges belong to a single type of node relationships, e.g., a social network is considered homogeneous when we only consider users and their friendship relationships (Perozzi et al., 2014). A heterogeneous information network consists of nodes and edges of heterogeneous types, corresponding to different types of entities and different kinds of relations respectively. Knowledge graph (Guo et al., 2018; Ji et al., 2020) and RDF graphs (Yuan et al., 2013) are known examples of heterogeneous information networks.

DeepWalk (Perozzi et al., 2014)

is the first node embedding algorithm that learns to encode the neighborhood features of each node in a homogeneous graph through learning the encoding of its scoped random walk properties using the autoencoder algorithms in conjunction with node2vec

(Grover and Leskovec, 2016). Inspired by DeepWalk design, dozens of node embedding algorithms have been proposed (Grover and Leskovec, 2016; Ribeiro et al., 2017; Donnat et al., 2018; Rozemberczki et al., 2019; Zhang et al., 2018a; Epasto and Perozzi, 2019; Liu et al., 2019; Tang et al., 2015b; Wang et al., 2018; Cao et al., 2015; Yang et al., 2015; Wang et al., 2016; Cao et al., 2016; Qu et al., 2017; Tsitsulin et al., 2018; Hamilton et al., 2017b; Bojchevski and Günnemann, 2018; Guo et al., 2019; Rossi et al., 2018; Sun et al., 2019b; Deng et al., 2020; Kipf and Welling, 2017; Wu et al., 2019a; Zhang and Chen, 2018; Cen et al., 2019; Liao et al., 2018; Perozzi et al., 2017). Although most of them focus on learning node embeddings for homogeneous networks, they differ in terms of the specific encoding schemes and the specific types of node semantics captured and used for learning node embedding. This survey paper mainly reviews the design principles and the different node embedding techniques developed for network representation learning over homogeneous networks. To facilitate the comparison of different node embedding algorithms, we introduce a unified reference framework to divide and generalize the node embedding learning process on a given network into preprocessing steps, node feature extraction steps and node embedding model that can be used for link prediction and node clustering. With this unifying reference framework, we highlight the most representative methods, models, and techniques used at different stages of the node embedding model learning process.We argue that an in-depth understanding of different node embedding methods/models/techniques is also essential for other types of network representation learning approaches that are built on top of node embedding techniques, such as edge embedding (Abu-El-Haija et al., 2017; Gao et al., 2019), subgraph embedding (Bordes et al., 2014; Cao et al., 2018) and entire-graph embedding (Bai et al., 2019; Narayanan et al., 2017). For example, an edge can be represented by a Hadamard product of its two adjacent nodes’ vectors. Similarly, graph coarsening mechanisms (Chen et al., 2018b; Yu et al., 2019) may create a hierarchy by successively clustering the nodes in the input graph into smaller graphs connected in a hierarchical manner, which can be used to generate representations for subgraphs and even for the entire graph.

We conjecture that this survey paper not only helps researchers and practitioners to gain an in-depth understanding of different network representation learning techniques, but also provides practical guidelines for designing and developing the next generation of network representation learning algorithms and systems.

Current surveys (Cai et al., 2018; Hamilton et al., 2017a; Cui et al., 2019) primarily focus on presenting a taxonomy to review the existing work on network representation learning. Concretely, (Cai et al., 2018) proposes two taxonomies of graph embedding based on problem settings and techniques respectively and it first appeared in 2017 on ArXiv and published in 2018. (Cui et al., 2019) proposes a taxonomy of network embedding according to the types of information preserved. (Hamilton et al., 2017a) appeared in 2017 in the IEEE Data Eng. Bulletin. It describes a set of conventional node embedding methods with the focus on pairwise proximity methods and neighborhood aggregation based methods. In contrast, our unified reference framework provides a broader and more comprehensive comparative review of the state of the art in network representation learning. In our three-stage reference framework, each stage serves as a categorization of the set of technical solutions dedicated to the tasks respective to this stage. For example, we not only provide a review of node embedding models using the unified framework, but also describe a set of optimization techniques that are commonly used in different node embedding methods, and also an overview of recent advances in NRL.

We list the mathematical notations used throughout the paper in Table 1 for reference convenience.

The remainder of this survey is structured as follows. In Section 2, we describe the basic steps of network representation learning to generate node embeddings using the autoencoder approach. In Section 3, we present an overview of the unifying three-stage reference framework for NRL, and discuss representative methods, models, and optimization techniques used at each stage. In Section 4, we review recent advances in conventional NRL, distributed NRL, multi-NRL, dynamic NRL and knowledge graph representation learning by using the proposed reference framework and discuss several open challenges. In Section 5, we conclude our survey.

Notation/Symbol | Meaning | Notation/Symbol | Meaning |

W, U, M | bold capital letters represent matrices | h, , | bold lowercase letters represent vectors |

, , , | lowercase letters represent nodes | , , | italic lowercase letters represent vector elements |

, | mapping function, i.e., encoder | , | decoder |

adjacency matrix | Laplacian matrix | ||

node ’s neighborhood | persona graph | ||

ego-network of node | , | -step transition matrix | |

additional information matrix | empirical probability |
||

filter of a convolution operation | Chebyshev polynomials | ||

matrix of filter parameters | aggregation function in the th layer | ||

set of negative samples | learning rate | ||

, , | first loss, second loss, loss | biase vector | |

, | th instance and its corrupted form | matrix of representations in the th layer of network | |

rooted PageRank matrix | timestamp before the current event | ||

, , | representations of head, translation and tail | distance function of a triple |

## 2. Network Representation Learning: What and How

To establish a common ground for introducing design principles of NRL methods and techniques, we first provide a walkthrough example to illustrate how network representation learning works from the autoencoder perspective. We first briefly describe DeepWalk (Perozzi et al., 2014) as it will be used as the reference NRL model in this section.

DeepWalk (Perozzi et al., 2014) is generalized from the advancements in language modeling, e.g., word2vec (Mikolov et al., 2013b). In a language model, the corpus is built by collecting sentences from many documents. If we regard node traveling paths as sentences, we can build corpus for network representation learning. Given an input network, we use the random walk method with parameters , and to generate multiple node traveling paths, where decides how many times to issue random walks from a node, is the path length, and is the probability of stopping walk and restarting from the initial node.

In the language processing field, skip-gram and continuous bag-of-word (CBOW) (Mikolov et al., 2013b)

are two commonly used models for estimating the likelihood of co-occurrence among words in training set. For network representation learning, training instances are extracted from node traveling paths based on a sliding window. An instance consists of a target node and its context located within a fixed window size, e.g., (3, (1,7)). In the meantime, we also need to build a vocabulary to index and sort all nodes by their frequency in the corpus, and then build a Huffman tree based on the frequency for hierarchical softmax.

The learning model shown in Fig. 2 contains three layers and it belongs to a typical autoencoder paradigm. The input vectors are encoded into latent representation vectors by means of a matrix W, which is known as encoding, and then the latent representation vectors are reconstructed into output vectors by means of a matrix U, which is known as decoding. Given a training instance, skip-gram model is mainly used to forecast the context given a target node, while on the contrary CBOW model predicts the target node given its context. Skip-gram model is widely adopted in network representation learning, since the conditional probability can be decomposed into multiple simple conditional probabilities under independence assumption,

(1) |

where is a mapping function that embeds nodes into the low-dimensional vector space, i.e., refers to the target node ’s learned representation. Here

acts as the encoder from the autoencoder perspective. Given the embedding of the target node, the conditional probability acts as the decoder that captures the reconstruction of the target node and its context nodes in the original network. A loss function is defined accordingly to measure the reconstruction error and the learning model is trained by minimizing the loss. To be specific, for each training instance, the target node

is initially encoded to be a one-hot vector and its dimension equals vocabulary size , and W is a matrix initialized by letting its entries randomly falling in a range. One-hot encoding implies that

is a -dimensional vector simply copying a row of W associated with the target node . A matrix is set to decode the encoded vector , where the conditional probability is obtained by doing softmax, i.e., . But softmax is not scalable, because for each training instance, softmax requires to repeat vector multiplication for times to obtain the denominator.To improve computation efficiency of decoder, DeepWalk uses hierarchical softmax (Mikolov et al., 2013a; Morin and Bengio, 2005) instead of softmax to implement the conditional probability factorization. Hierarchical softmax model builds a binary tree and places all network nodes on the leaf layer. Then there will be branch nodes and each of them has an associated -dimensional vector. For the output node in a training instance, it corresponds to a leaf node in the tree representing the probability of in the output layer given target node . It is easy to identify a unique path from the root node to node , and the conditional probability can be computed based on the path, i.e.,

(2) |

where is the length of the path toward . DeepWalk uses Huffman tree to implement hierarchical softmax due to its optimal property on average path length. On the path toward leaf node

, a binary classifier is used to compute the probability of going left or right at each branch node, i.e.,

(3) |

The model’s goal is to obtain the maximized conditional probability, which is equivalent to minimize the following loss function

(4) |

To this end, back-propagation method is used to update two weight matrices W and U with gradient descent. Firstly, we take the derivative of loss with regard to each on the path toward the context node and obtain

(5) |

where if go left and otherwise. The corresponding vectors in matrix U is updated by

(6) |

Then for each context node in an instance, we take the derivative of loss with regard to the hidden vector and obtain

(7) |

The vector in matrix W is updated accordingly by

(8) |

The model will learn the latent representation for every node by updating matrices iteratively and eventually stabilize. The hidden vectors are the node representations learned from the network.

Fig. 2 shows a simple network with 8 nodes, we build a corpus by running random walk 4 times for each node and obtain 32 node sequences. We also generate multiple instances for training by means of a sliding window with , which means a target node may have 2 context nodes at most. Instance (3, (1,7)) is the first one used for the following training, and the target node’s input vector is ’s one-hot vector (0,0,1,0,0,0,0,0). After encoded by weight matrix W, we obtain ’s hidden vector that is here. In order to obtain the conditional probability, we build a Huffman tree in the output layer. The weight matrix U is a collection of all branch nodes’ vectors, and they are initialized to be zero vectors so as to make sure that the probabilities of going left and going right at each branch are initially identical, i.e., . Paths (, , ) and (, , ) are two unique paths toward leaf nodes and respectively, and then we have based on Eq. (2). In the following process, we need to update the correlated vectors in weight matrices reversely to minimize the loss, so we obtain , , based on Eq. (6), and then based on Eq. (8). After we finish the training against all instances, the representation learned from the model is the hidden vector. Since the input layer uses one-hot encoding, the network representation is weight matrix W. We plot these representations in a two-dimensional coordinate system for visualization, where community structure can be observed.

## 3. The Reference Framework for Network Representation Learning

We present a unified reference framework illustrated in Fig. 3 to capture the workflow of network representation learning and it contains three consecutive stages. The first stage is the network data preprocessing, and it is responsible for obtaining the desired network structure information from the original input network. During this stage, the prime aim is to employ a learning-task suitable network preprocessing method to transform the input network into a set of internal data structures, which is more suitable for structural feature extraction in the next stage. Node state and edge state provide additional context information other than network topological structure, which are useful and can be leveraged for learning network representations. Different NRL algorithms tend to have different design choices on which additional information will be utilized to augment the node context information in addition to network topological structure. The second stage is the network feature extraction and it is responsible for sampling training instances from the input network structure. Prior to sampling, it should choose the source of raw features that helps preserve the expected network properties. These properties may optionally be inferred from a specific learning task. The source of raw features can be classified into local structure (e.g., node degree, node neighbors, etc.) and global structure (multi-hop neighborhoods, node rank, etc. ) with respect to every node in the raw input graph. Different sampling methods are used for extracting features from different structures. The third stage is the learning node embeddings over the training set. Different embedding models can be leveraged to learn hidden features for node embedding, such as matrix factorization, probabilistic model, graph neural networks. These representative embedding models are often coupled with optimization techniques, such as hierarchical softmax, negative sampling, and attention mechanism, for better embedding effects.

### 3.1. Network Data Preprocessing

The network processing method is the data preparation stage for network representation learning. When end-users have different applications in mind for deploying NRL models, different data preprocessing methods should be employed. Hence, specific learning tasks should be discussed in the section on network data preprocessing. For example, when NRL trained models are used for node classification, if the node classification or node clustering aims to categorize new nodes based on their node state information and node edge neighbor information, then the preprocessing stage should employ techniques that can preprocess the raw graph input to obtain those required node state properties and node linkage properties for deep feature extraction (the second stage) before entering the NRL model training, the third stage of the network representation learning workflow. However, if the end-users prefer to perform node clustering or classification based on only network topology and traversal patterns over the network structure rather than node state information, then the pairwise node relationships over the entire network and their hop counts are critical in the preprocessing stage in order to learn the node distance features in terms of graph traversal semantics in the stage 2. These two steps will ensure that the NRL model training in stage 3 will deliver a high quality NRL model for end-users to perform their task specific node classification or node clustering, which are network data and domain specific in real world applications.

#### 3.1.1. Network Preprocessing Methods

To effectively capture useful features, the network structure is usually preprocessed before feature extraction. We categorize current preprocessing methods into two types:

Matrix-based Processing In most cases, we are using an adjacency matrix A to represent a network , where its entries could directly describe the connection between arbitrary two nodes and the connection is also the basic unit of network structure. According to the hypothesis in DeepWalk that nodes with similar contexts are similar, a node’s contexts are defined as the set of nodes arrived. In order to reflect transitions between nodes, a transition matrix is proposed in (Cao et al., 2015), where D is the diagonal matrix such that , and refers to the one-step transition probability from to . Accordingly, the transition matrix can be generalized to high steps, and the transition matrix within steps (Qiu et al., 2018) is obtained by . Yang et al. (Yang and Yang, 2018) defined two proximity matrices: the first-order proximity matrix is the adjacency matrix, and the second-order proximity matrix consists of , where , are the corresponding rows of . As an important branch of embedding, spectral methods require to convert adjacency matrix A into Laplacian matrix L before the Laplacian Eigenmaps (Belkin and Niyogi, 2001), where .

Graph Decomposition Graph decomposition is another important type of methods for data preprocessing, by which the original network is decomposed into multiple graphs and each of them consists of a subset of nodes and a group of edges that correspond to connections in the original network or are connected based on certain rules. These graphs may be connected together to form a new network in order to extract features for a specific network property.

To capture the structural proximity, the context graph (Ribeiro et al., 2017) is proposed by leveraging the decomposition idea. The context graph is a multi-layer weighted graph and each layer is a complete graph of all nodes. In particular, the edges in layer are linked by nodes that are -hop away from each other, and the edge weight in the layer is defined as , where represents the -structural distance between node and calculated based on their -hop neighborhoods. For adjacent layers, the corresponding nodes are connected by directed edges, and the edge weights in both directions are defined as and , respectively, where denotes the number of edges incident to node such that their weights are higher than the layer ’s average weight.

In social networks, ego-network induced on user ’s neighborhood is often used to denote her/his social circle. If the interactions of every node in the original network are divided into couples of semantic subgroups, these subgroups will capture different components of user’s network behavior. Based on the idea, Epasto et al. (Epasto and Perozzi, 2019) propose a method to convert into its persona graph , where each original node corresponds to multiple personas. Formally, given the original network and a clustering algorithm , the method consists of three steps: 1) For each node , its ego-network is partitioned into disjoint components via , denoted by ; 2) Collect a set of personas, where each original node induces personas; 3) Add connections between personas if and only if , and . Social users often participate in different communities, the persona graph obtained via the above procedure presents a much better community structure for further embeddings.

For very large graphs whose scales exceed the capability of single machine, they must be decomposed into multiple partitions before further execution. Adam et al. (Lerer et al., 2019) propose a block decomposition method that first splits entity nodes into parts and then divides edges into buckets based on their source and destination entity nodes’ partitions. For example, for an edge (u, v), if source u and destination v are in partitions and , respectively, then it should be placed into bucket . For each bucket , source and destination partitions are swapped from disk, respectively, and the edges are loaded accordingly for training. To ensure that embeddings in all partitions are aligned in the same latent space, an ’inside-out’ ordering is adopted to require that each bucket has at least one previously trained embedding partition.

#### 3.1.2. Learning task

Node classification aims to assign each node to a suitable group such that nodes in a group have some similar features. Link prediction aims to find out pairs of nodes that are most likely to be connected. Node classification and link prediction are two most basic fundamental tasks for network analytics. Both tasks could be further instantiated into many practical applications such as social recommendation (Fan et al., 2019), knowledge completion (Lin et al., 2015), disease-gene association identification (Han et al., 2019), etc. Therefore, we mainly focus on the two basic tasks. For both tasks, node’s topological structure is the important basis for classification and prediction. For example, nodes with more common neighbors often have higher probability to be assigned to the same group or to be connected. This type of structure can be generalized to multi-hop neighborhoods (Zhang and Chen, 2018), which requires to compute the transition matrix of the network.

In addition to pure structural information, there are other useful information available for learning tasks, e.g., node state information and edge state information. In many real-world networks, node itself may contain some state information such as node attribute, node label, etc., and this information may be essential for some tasks. For example, in social networks, besides embedding social connections, we can also encode user attributes to obtain more comprehensive representations for individuals (Zhang et al., 2018b; Liao et al., 2018). Node attributes are still an important source of features to be aggregated for inductive embeddings (Hamilton et al., 2017b; Bojchevski and Günnemann, 2018). Nodes with similar attributes and/or structures are more likely to be connected or classified together. Meanwhile, some node labels are usually fed into supervised models to boost the task of node classification (Kipf and Welling, 2017; Zhang et al., 2019). As the most common edge state information, edge weights can be integrated with topological structure to achieve more accurate classification and prediction. Besides, edges in some networks may have signs. Take Epinions and Slashdot as examples, users in these two social network sites are allowed to connect to other users with either positive or negative edges, where positive edges represent trust and like while negative edges convey distrust and dislike. For link prediction on such a signed network, we have to predict not only possible edges but also signs of those edges (Wang et al., 2017a; Kim et al., 2018).

### 3.2. Network Feature Extraction

The main task of NRL is to find out the hidden network features and encode such features as node embedding vectors in a low-dimensional space. Network properties are used to analyze and compare different network models. For a NRL task, the learned hidden features should preserve network properties so that advanced network analytics can be performed accurately and at the same time the original network properties can be preserved. For example, nodes that are closer in the original network should also be closer in the latent representation space in which the node embedding vectors are defined. Most of the NRL methods focus on preserving topological structures of nodes, such as in-degree or out-degree neighbors (degree proximity), first order proximity, random walk distance, and so forth. We categorize the node structural properties into local structure and global structure.

Embeddings from local structure focus on the preservation of local network properties such as degree proximity, first-order proximity, etc. In comparison, global structure provides rich choices of sources of raw features to be extracted so as to preserve even more network properties. The classification of source of raw features as well as network properties and sampling methods are summarized in Table 2.

#### 3.2.1. Local Structure Extraction

Local structure reflects a node’s local view about the network, which includes node degree (in-degree and out-degree), neighbors (in-degree neighbors and out-degree neighbors), node state and adjacent edge (in-edge and out-edge) state. To preserve degree proximity, Leonardo et al. (Ribeiro et al., 2017) define a proximity function between two nodes where each node’s neighbors are sorted based on their degrees and the proximity is measured by the distance between the sorted degree sequences. The first-order proximity (Tang et al., 2015b) assumes nodes that are neighbors to each other (i.e., connected via an edge) are similar in vector space, while the degree of similarity may depend on the edge state. For example, in a signed social network, neighbors with positive and negative edges are often called friends and foes, respectively. From the perspective of social psychology (Wang et al., 2017a; Kim et al., 2018), when we take edge sign into account, nodes should be more similar to its friends than its foes in the representation space. William et al. (Hamilton et al., 2017b) present an inductive representation learning by aggregating features extracted from neighboring node states.

Sampling Methods

For source of raw features like degree and neighbors, training instances can be calculated or fetched directly from adjacency matrix. For adjacent edges, training instances can be generated by edge sampling. The simple edge sampling is also to fetch entries from adjacency matrix. When applied to weighted networks where the pairwise proximity has close relationship with edge weights, if edge weights have a high variance, the learning model will suffer from gradient explosion or disappearance. To address the problem, LINE

(Tang et al., 2015b) designs an optimized edge sampling method that fetches edges with the probabilities proportional to their weights. For node state like node attribute, its features can be extracted by leveraging existing embedding techniques, e.g., Word2vec (Mikolov et al., 2013b). If node state is given as a node label, it usually works as the supervised item to train the embedding model.#### 3.2.2. Global Structure Extraction

Structures that transcend local views can be considered global such as multi-hop neighborhoods, community, connectivity pattern, etc. Considering that the first-order proximity matrix may not be dense enough to model the pairwise proximity between nodes, as an global view (Cao et al., 2015), the pairwise proximity is generalized to high-order form by using -step transition matrix, i.e., . Instead of preserving a fixed high-order proximity, Zhang et al. (Zhang et al., 2018c) define the arbitrary-order proximity by calculating the weighted sum of all -order proximities. Community structure is an important network property with dense intra-community connections and sparse inter-community connections, and it has been observed in many domain-specific networks, e.g., social networks, co-authoring networks, language networks, etc. Wang et al. (Wang et al., 2017b) introduce a community representation matrix by means of the modularity-driven community detection method, and use it to enable each node’s representation similar to the representation of its community. Considering that a node may belong to multiple communities, Sun et al. (Sun et al., 2017) define a basis matrix to reveal nodes’ community memberships, where is the node set size, is the total number of communities and indicates the propensity of node to community . The basis matrix is learned and preserved during the representation learning process.

Network nodes usually act as various structural roles (Grover and Leskovec, 2016; Ahmed et al., 2018) and appear different connectivity patterns such as hubs, star-edge nodes, bridges connecting different clusters, etc. Node role proximity assumes nodes with similar roles have similar vector representations and it is a global structural property that is different from community structure, since it primarily focuses on the connection patterns between nodes and their neighbors.

As a global structural property, node rank is always used to denote a node’s importance in the network. PageRank (Page et al., 1998) is a well-known approach to evaluate the rank of a node by means of its connections to others. Specifically, the ranking score of a node is measured by the probability of visiting it, while the probability is obtained from the ranking score accumulated from its direct predecessors weighted by the reciprocal of its out-degree. Lai et al. (Lai et al., 2017) demonstrate that node representations with global ranking preserved can potentially improve both results of ranking-oriented tasks and classification tasks. Node degree distribution is also a global inherent network property. For example, scale-free property refers to a fact that node degrees follow a heavy-tailed distribution, and it has proven to be ubiquitous across many real networks, e.g., Internet, social networks, etc. The representation learning for scale-free is explored in (Feng et al., 2018).

Another global structural property is proposed in Struc2vec (Ribeiro et al., 2017) that considers structural similarity from network equivalence perspective without requiring two nodes being nearby, i.e., independent of nodes’ network positions. To reflect this property, Struc2vec presents a notion of node structural identity that refers to a node’s global sense. Struc2vec uses the multi-layer graph output from data preprocessing stage to measure node similarity at different scales. In the bottom layer, the similarity exclusively depends on node degrees. In the top layer, the similarity lies in the entire network. Tu et al. (Tu et al., 2018) propose a similar concept, i.e., regular equivalence, to describe the similarity between nodes that may not be directly connected or not having common neighbors. According to its recursive definition, neighbors of regularly equivalent nodes are also regularly equivalent. To ensure the property of regular equivalence, each node’s embedding is approximated by aggregating its neighbors’ embeddings. After updating the learned representations iteratively, the final node embedding is capable of preserving the property in a global sense.

Different types of global structure are used to reflect different network properties where multi-hop neighborhoods, node connectivity pattern and node identity are used to preserve pairwise proximity, which reflects a pairwise relationship between nodes, including high-order proximity, node role proximity, and node identity proximity. For example, the node community membership reflects a relationship between a node and a group of nodes, which share some common network properties. Furthermore, node rank and node degree distribution are used to preserve a kind of distribution-based network property, including node importance ranking, or a relationship between a node and the entire network, such as the scale free network whose degree distribution follows a power law.

Sampling Methods For source of raw features like multi-hop neighborhoods, they can be obtained by matrix power operation, i.e., , but the computation suffers from high complexity. Random walk and its variants are widely explored to capture the desirable network properties with high confidence. For example, DeepWalk (Perozzi et al., 2014) presents a truncated random walk to generate node traveling paths. It uses co-occurrence frequencies between node and its multi-hop neighborhoods along these paths to reflect their similarity and capture the high-order proximity accordingly.

From the perspective of community structure, due the dense intra-community connections, nodes within the same community have higher probability to co-occur on the traveling paths than nodes in different communities. Hence random walk can also be used to capture the community structure. When we consider the hierarchy of communities, different communities may have different scales. The regular random walk makes the training set having more entries from than from (), and then it is biased towards preserving small-scale community structure. Walklets (Perozzi et al., 2017) presents a skipped random walk to sample multi-scale node pairs by skipping over steps in each traveling path.

Another drawback of random walk is that it requires too many steps or restarts to cover a node’s neighborhoods. To improve its coverage, Diff2Vec (Rozemberczki and Sarkar, 2020) present a diffusion-based node sequence generating method that consists of two steps: 1) Diffusion graph generation, which is in charge of generating a diffusion graph for each node . is initialized with , and then randomly fetch node from and node from ’s neighborhoods in the original graph, append two nodes and the edge to . The above process is repeated until grows to the predefined size. 2) Node sequence sampling, which generates Euler walk from as the node sequence. To make sure is Eulerian, is converted to a multi-graph by doubling every edge into two edges.

Real-world networks often exhibit a mixture of multiple network properties. In order to capture both community structure and node role proximity, node2vec (Grover and Leskovec, 2016) designs a flexible biased random walk that generates traveling paths in an integrated fashion of BF (breadth-first) sampling and DF (depth-first) sampling. To this end, two parameters and

are introduced to smoothly interpolate between two sampling methods, where

decides the probability of re-fetching a node in the path while allows the sampling to discriminate between inward and outward nodes.In scale-free networks, a tiny fraction of ”big hubs” usually attracts most edges. Considering that connecting to ”big hubs” does not imply proximity as strong as connecting to nodes with mediocre degrees, a degree penalty based random walk (Feng et al., 2018) is proposed. For a pair of connected nodes (, ), its principle is to reduce the likelihood of being sampled as ’s context when has a high degree and they do not share many common neighbors. To this end, the jumping probability from to is defined as , where denotes the first and second order of proximity between two nodes, and are their degrees, and is a parameter.

As an anonymous version of random walk, anonymous walk (Ivanov and Burnaev, 2018) provides a flexible way to reconstruct a network. For a random walk , its corresponding anonymous walk is the sequence of node’s first occurrence positions, i.e., , where . For an arbitrary node , a known distribution over anonymous walks of length is sufficient to reconstruct a subgraph of that is induced by all nodes located within hops away from . Therefore, anonymous walk can be used to preserve global network properties like high-order proximity and community structure by approximating the distributions.

One of the baseline approaches to extracting global structure is to use random walk as the sampling method. For complex types of global structure, e.g., multi-hop neighborhoods and node community membership, an integrated sampling method is often recommended, which combines random walk with other types of graph traversal methods, such as anonymous walk. An advantage of using anonymous walk as the sampling approach is that it is sufficient to reconstruct the topology around a node by utilizing distribution of anonymous walks of a single node, because anonymous walk captures richer semantics than random walk.

Categority | Source of raw features | Network property | Sampling method |

Local structure | Degree/in-degree/out-degree (Ribeiro et al., 2017) | Degree proximity | Directly from adjacency matrix |

Neighbors/in-degree neighbors/out-degree neighbors (Tang et al., 2015b) | First-order proximity | Directly from adjacency matrix | |

Node state (Hamilton et al., 2017b) | Node state proximity | Directly from node state information | |

Adjacent edge state (Wang et al., 2017a; Kim et al., 2018) | First-order proximity | Weighted edge sampling | |

Global structure | Multi-hop neighborhoods (Zhang et al., 2018c) | High-order proximity | Random/Anonymous walk |

Node community membership (Wang et al., 2017b; Sun et al., 2017) | Community structure | Random/Anonymous walk | |

Node connectivity pattern (Grover and Leskovec, 2016; Ahmed et al., 2018) | Node role proximity | Random walk | |

Node degree distribution (Feng et al., 2018) | Scale-free | Degree penalty based random walk | |

Node rank (Lai et al., 2017) | Node importance ranking | Random walk | |

Node identity (Ribeiro et al., 2017; Tu et al., 2018) | Node identity proximity | Random walk |

### 3.3. Node Embedding

Recent years many efforts have been devoted to the design of node embedding model. We are trying to review those work from a universal perspective of autoencoder. In general, node embedding is equivalent to an optimization problem that encodes network nodes into latent vectors by means of an encoding function . Meanwhile the objective is to ensure that the results decoded from vectors preserve the network properties we intend to incorporate, where the decoder is represented by a function of encoding results, i.e., , where denotes the number of input arguments. The output vectors are the latent representations of hidden features learned by , and they are also the expected node representations.

#### 3.3.1. Embedding Models

We classify the current embedding models into the following three types.

Matrix Factorization Given the matrix of input network, matrix factorization embedding factorizes the matrix to get a low-dimensional matrix as the output collection of node representations. Its basic idea can be traced back to the matrix dimensionality reduction techniques (Chen et al., 2018a). According to the matrix type, we categorize the current work into relational matrix factorization and spectral model.

*(1) Relational Matrix Factorization* Matrix analysis often requires figuring out the inherent structure of a matrix by a fraction of its entries, and it is always based on an assumption that a matrix admits an approximation of low rank . Under this assumption, the objective of matrix factorization corresponds to finding a matrix such that approximates with the lowest loss , where is a user-defined loss function. In the autoencoder paradigm, the encoder is defined by the matrix , e.g., , each column represents a vector. The decoder is defined by the inner product of two node vectors, e.g., , so as to infer the reconstruction of the proximity between node and . When we focus on preserving the first-order proximity , i.e., , and let be the training set sampled from , then the objective is to find to minimize the reconstruction loss, i.e.,

(9) |

where Frobenius norm

is often used to define the loss function. Singular value decomposition (SVD)

(Golub and Reinsch, 1970) is a well-known matrix factorization approach that can find the optimal rank approximation of proximity matrix . If the network property focuses on high-order proximity, the proximity matrix can be replaced by power matrix, e.g., , , . For example, GraRep (Cao et al., 2015) implements node embedding by factorizing into two matrices and ,(10) |

where denotes the representation matrix and denotes the parameter matrix in decoder, e.g., the unary decoder computes the reconstructed proximities between and other nodes.

In the autoencoder paradigm, the equivalence between DeepWalk and matrix factorization can be proved by making the following analogies: 1) Define the pairwise proximity as the co-occurrence probability inferred from the training set. 2) Let and corresponds to and in DeepWalk, where each column , of them refers to the representation of node acting as target node and context node, respectively.

In addition, matrix factorization can also incorporate additional information into node embeddings. For example, given a text feature matrix where denotes the node set size and denotes the feature vector dimension, TADW (Yang et al., 2015) applies inductive matrix completion to the node embedding and defines the following matrix factorization problem:

(11) |

where is a harmonic factor. The output matrices and factorized from the transition matrix can be regarded as the collection of node embeddings and concatenated as a representation matrix.

*(2) Spectral Model*
In spectral model, a network is mathematically represented by a Laplacian matrix, i.e., , where adjacency matrix acts as the proximity matrix, and the entry of diagonal matrix describes the importance of node . is a symmetric positive semidefinite matrix that can be thought of as an operator on functions defined on the original network. Let be the representation matrix, it also acts as the encoder that maps the network to a -dimensional space. The representations should assure neighboring nodes stay as close as possible. As a result, the decoder can be defined as according to Laplacian Eigenmaps (Belkin and Niyogi, 2003), where is the th column of . Then the node embedding problem is defined as follows:

(12) |

where is imposed as a constraint to prevent collapse onto a subspace of dimension less than . The solution

consists of the eigenvectors

corresponding to the lowest eigenvalues of the generalized eigenvalue problem . Furthermore, given an additional information matrix describing node attribute features, the node embedding can be derived by following the idea of locality preserving projection (LPP) (He and Niyogi, 2003) that introduces a transformation matrix to realize the mapping , where and are the th column of and . Similarly, the solution is obtained by computing the mapping vectors corresponding to lowest eigenvalues of the problem , and the node embedding incorporates additional information by .Probabilistic Model Probabilistic model is specifically designed for node embedding via preserving the pairwise proximity measured by a flexible probabilistic manner.

*(1) Skip-gram Model*
Skip-gram is the classic embedding model that converts pairwise proximity to the conditional probability between node and its context. For example, DeepWalk (Perozzi et al., 2014) relies on the co-occurrence probability derived from random traveling paths to preserve the second-order proximity during node embedding.

*(2) Edge Probabilistic Model*
Edge probabilistic model enforces node embeddings designed primarily for network reconstruction. For example, LINE (Tang et al., 2015b) relies on the idea of edge probability reconstruction without assistance of random walk sampling. It focuses on preserving both first-order and second-order proximities by defining two decoders, i.e.,

(13) |

where refers to the representation of acting as a context node. The objectives to be optimized are based on the loss functions derived from the distance between two distributions, i.e.,

(14) |

where are the empirical probabilities and is the weight of edge .

Graph Neural Networks It is not difficult to find that each node’s embedding via the above models requires the participation of its neighboring nodes, e.g., building -order proximity matrix with nodes and their -step neighborhoods in matrix factorization model, and counting node’s co-occurrence frequency with its context nodes in skip-gram model. The common idea can be intuitively generalized to a more general model, graph neural networks (GNNs) that follow a recursive neighborhood aggregation or message passing scheme. Graph convolutional network (GCN) (Bruna et al., 2014) is a very popular variant of GNNs, where each node’s representation is generated by a convolution operation that aggregates its own features and its neighboring nodes’ features. Graph isomorphism network (GIN) (Xu et al., 2019) is another recently proposed variant with a relatively strong expressive power.

*(1) Graph Convolutional Network*
In terms of the definition of convolution operation (Wu et al., 2019b), GCNs are often grouped into two categories: Spectral GCNs and Spatial GCNs. Spectral GCNs (Defferrard et al., 2016; Bruna et al., 2014) define the convolution as conducting the eigendecomposition of the normalized Laplacian matrix in Fourier domain. An intuitive way to explain spectral convolution is to regard it as an operation that uses a filter to remove noise from a graph signal . The graph signal denotes a feature vector of the graph with entry representing node ’s value.
Let be the matrix of eigenvectors of

, graph Fourier transform is defined as a function that projects input signal

to the orthonormal space formed by , i.e., . Let be the transformed signal, entries of correspond to the coordinates of the input signal in the new generated space, and the inverse Fourier transform is defined as . When the filter is parameterized by and defined as , then the convolution operation against signal with filter is represented by(15) |

where represents the Hadamard product. Nevertheless, The eigendecomposition of Laplacian matrix is expensive for large graphs, and the complexity of multiplication with is . To address the complexity problem, Hammond et al. (Hammond et al., 2011) suggests that the filter can be approximated by an abridged expansion based on Chebyshev polynomials up to th order. The convolution is redefined as

(16) |

where and with and . The above convolution operation is -localized since it requires the participation of the neighboring nodes within -hop away from the central node. Spectral GCNs model consists of multiple convolution layers of the form Eq. 16 where each layer is followed by a point-wise non-linearity. From the perspective of autoencoder, each node’s feature embedding is encoded by the convolution operation, so that the graph convolution actually acts as the encoder and we call it spectral convolution encoder or filter encoder here. In order to deal with multi-dimensional input graph signals, the spectral convolution encoder is generalized to account for the signal matrix with input channels and filters, i.e.,

(17) |

where , and is matrix of filter parameters. The signal matrix is often initialized by the input graph information such as node attributes. It is noted that the filter parameters can be shared over the whole graph, which significantly decreases the amount of parameters as well as improves the efficiency of filter encoder.

Similar to convolution neural networks on images, spatial GCNs (Hamilton et al., 2017b) consider graph nodes as image pixels and directly define convolution operation in the graph domain as the feature aggregation from neighboring nodes. To be specific, the convolution acts as the encoder to take the aggregation of the central node representation and its neighbors’ representations to generate an updated representation, and here we call it spatial convolution encoder or aggregation encoder. In order to explore the depth and breadth of a node’s receptive field, spatial GCNs usually stack multiple convolution layers. For a -layer spatial GCN, its aggregation encoder is defined as follows:

(18) |

where is node ’s representation (also called hidden state in some other literatures) in the th layer with . is an aggregation function responsible for assembling a node’s neighborhood information in the th layer, and parameter matrix specifies how to do aggregation from neighborhoods, like filters in spectral GCNs, is shared across all graph nodes for generating their representations. As the layer deepens, the node representations will contain more information aggregated from wider coverage of nodes. The final node representations are output as the normalized matrix of node representations on layer , i.e., .

Category | Sub-category | Encoder | Decoder | Optimization objective | Publication |

Matrix Factorization | Relational matrix factorization | , , | Eq. 9, Eq. 10, Eq. 11 | (Cao et al., 2015; Yang et al., 2015) | |

Spectral model | , | , | Eq. 12 | (Belkin and Niyogi, 2003; He and Niyogi, 2003) | |

Probabilistic Model | Skip-gram model | Eq. 2 | Eq. 4 | (Perozzi et al., 2014) | |

Edge probabilistic model | Eq. 13 | Eq. 14 | (Tang et al., 2015b) | ||

Graph Neural Networks | Spectral Graph Convolutional Network | Eq. 17 | Eq. 20 | (Hammond et al., 2011; Kipf and Welling, 2017, 2016) | |

Spatial Graph Convolutional Network | Eq. 18 | Eq. 20 | (Hamilton et al., 2017b) | ||

Graph Isomorphism Network | Eqs. 18, 19 | Eq. 20 | (Xu et al., 2019) |

*(2) Graph Isomorphism Network*
The representational power of a GNN primarily depends on whether it maps two multisets to different representations, where multiset is a generalized concept of a set that allows multiple instances for its elements. The aggregation operation can be regarded as a class of functions over multisets that their neural networks can represent. To maximize the representational power of a GNN, its multiset functions must be injective. GCN has been proven to be not able to distinguish certain simple structural features as its aggregation operation is inherently not injective. To model injective multiset functions for the neighbor aggregation, GIN (Xu et al., 2019)

presents a theory of deep multisets that parameterizes universal multiset functions with neural networks. With the help of multi-layer perceptrons (MLPs), GIN updates node representations as follows:

(19) |

where is a learnable parameter or a fixed scalar. The aggregation encoder can be defined in the same way of Eq. 18.

For both GCN and GIN, the decoder can be designed in any form of the previously discussed decoders. Recently, many efforts have been devoted to designing task-driven GCNs, so that the decoder has to incorporate supervision from a specific task. For instance, assume that each node has a label chosen from a label set . A function like sigmoid, , can be defined as a decoder to realize the mapping from a node representation to its corresponding label , where parameters are learnable throughout the embedding process. We use function to measure the loss between the true label and the predicted one , then the objective is concluded as

(20) |

We summarize the specific classification of node embedding models outlined in Table 3 and make comparisons among them as shown in Table 4.

Category | Advantage | Disadvantage |

Matrix Factorization | 1) has the solid theoretical foundation available to enhance its interpretability. 2) additional information, e.g., node state and edge state, can be easily incorporated into the matrix factorization model. | 1) high computation cost. 2) hard to be applied to dynamic embedding. |

Probabilistic Model | relatively efficient compared to other models especially when it was designed for network reconstruction. | hard to be applied to dynamic embedding. |

Graph Neural Networks | 1) naturally supports embedding of both structural and additional information. 2) easy to be applied to dynamic embedding. | high computation cost. |

#### 3.3.2. Optimization Techniques

Many of the above models have achieved non-trivial embedding results. The success of these models inevitably relies on some optimization techniques that have been proposed to assist the node embedding from various aspects such as the computational complexity reduction, the acceleration of embedding process and the enhancement of training efficiency and effectiveness. In this section, we will show several popular optimization techniques with a focus on how they work at different stages of NRL, and summarize these techniques in Table 5.

Hierarchical Softmax As a variant of softmax, hierarchical softmax (Morin and Bengio, 2005; Perozzi et al., 2014) has been proposed to speed up the training process. To this end, hierarchical softmax leverages a well-designed tree structure with multiple binary classifiers to compute the conditional probabilities for each training instance (see Sec. 2 for details). Therefore, it helps decoder reduce the complexity from to .

Negative Sampling

As an alternative to hierarchical softmax, noise contrastive estimation (NCE)

(Gutmann and Hyvärinen, 2012)suggests that logistic regression can be used to help models distinguish data from noise. Negative sampling (NEG) is a simplified version of NCE that was firstly leveraged by Word2vec

(Mikolov et al., 2013b) to help the skip-gram model to deal with the difficulty of high computational complexity w.r.t. model training. Specifically, NEG needs a noise distribution to generate negative samples for every positive one. can be arbitrarily chosen, and more often it is empirically set by raising the frequency to the power for the best quality of embeddings. The training loss of NEG is formally defined as(21) |

where is the set of negative samples and is the output vector of node that corresponds to the column of matrix in the skip-gram model. Then the training objective is simplified to be able to discriminate the output node from nodes draws from using logistic regression. To update node embeddings under NEG, the derivative of loss with regard to the input of output is given by

(22) |

where is a binary indicator of samples, i.e., if is a negative sample, otherwise . The output vector is updated by

(23) |

By using NEG, we just need to update the output vectors of nodes from instead of the entire node set

. The computational effort is therefore saved significantly. Finally, the node vector is updated accordingly by the error backpropagation to the hidden layer, i.e.,

(24) |

Technique | Working stage | Technical principle | Optimization goal | Publication |

Hierarchical Softmax | Node embedding | Leverage a tree structure to minimize the computation of conditional probabilities | Computational complexity reduction, acceleration of embedding process | (Morin and Bengio, 2005; Perozzi et al., 2014) |

Negative Sampling | Feature extraction | Reduce the number of output vectors that need to be updated | Computational complexity reduction, acceleration of embedding process | (Gutmann and Hyvärinen, 2012; Mikolov et al., 2013b) |

Attention Mechanism | (1) Feature extraction; (2) Node embedding | (1) Replace previously fixed hyperparameters with trainable ones; (2) Distinguish neighborhood’s importance via trainable weights |
Enhancement of training efficiency and effectiveness | (Abu-El-Haija et al., 2018; Velickovic et al., 2018) |

Attention Mechanism Ever since attention mechanism was proposed, it has become an effective way to help models focus on the most important part of data. NRL also benefits from attention mechanism by conducting attention-guided random walks in feature extraction stage, and designing attention-based encoder in node embedding stage.

In the stage of feature extraction, the attention mechanism (Abu-El-Haija et al., 2018) is borrowed to lead random walk to optimize an upstream objective. Let be the transition matrix, be the initial positions matrix with set to the number of walks starting at node , and be the walk length, the context distribution can be represented by a -dimensional vector . To obtain the expectation on co-occurrence matrix, , need to be assigned to

as a co-efficient. An attention model is proposed to learn

automatically, where , and all can be trained by backpropagation. The attention model aims to guide the random surfer on ”where to attend to” as a function of distance. To this end, the model on is trained according to the expectation on random walk matrix, i.e.,(25) |

For many random walk based methods like DeepWalk, they are special cases of the above equation where is not infinite, are fixed apriori, i.e., .

In the stage of node embedding, the graph attention network (GAT) (Velickovic et al., 2018) proposes to incorporate attention mechanism into a spatial GCN for providing differentiated weights of neighborhoods. Specifically, GAT defines a graph attention layer parametrized by a weight vector and builds a graph attention network by stacking the layers. The attention function is defined to measure the importance of neighbor to the central node ,

(26) |

where refers to the feature of input node and

denotes the weight matrix of a shared linear transformation which is applied to every node. The convolution operation (aggregation encoder) is defined as

(27) |

## 4. Recent Advances in Network Representation Learning

Here we review existing NRL studies with a focus on recent methods that have achieved significant advances in machine learning and/or data mining. These studies are classified into four categories and the detailed taxonomy is shown in Fig. 4 which enables us to classify current NRL methods based on whether the learning is: 1) performed in centralized or distributed fashion; 2) using single network or multiple networks; 3) applied to static network or dynamic evolving network. Besides, we also include some works on knowledge graph as one of important extensions of NRL. We review these works according to the proposed reference framework and present a brief summary as shown in Table 6.

### 4.1. Conventional NRL Methods

Methods in this category share some common features such as performing in a centralized fashion, applying to only single network representation learning and only learning from a static network. We introduce recent advances in this category according to two main types of network properties, i.e., pairwise proximity and community structure.

#### 4.1.1. Pairwise Proximity

MVE (Qu et al., 2017) presents a multi-view embedding framework where pairs of nodes with different views are sampled as instances. Each view corresponds to a type of proximity between nodes, for example, the following-followee, reply, retweet relationships in many social networks. These views are usually complementary to each other. MVE uses skip-gram model to yield the view-specific representations that preserve the first-order proximities encoded in different views. Moreover, an attention-based voting scheme is proposed to identify important views by learning different weights of views.

SDNE (Wang et al., 2016)

presents a semi-supervised learning model to learn node representations by capturing both local and global structure. The model architecture is illustrated in Fig. 5 that consists of two components: unsupervised component and supervised component. The former is designed to learn node representations by preserving the second-order proximity and the latter utilizes node connections as the supervised information to exploit the first-order proximity and refine node representations. Specifically, given the adjacency matrix

, the first component utilizes a deep autoencoder to reconstruct the neighborhood structure of each node. The encoder relies on multiple non-linear functions to encode each nodeinto a vector representation. The hidden representation in the

th layer is defined as(28) |

where is the input vector of , , and are weights and biases respectively in the th layer. The decoder reconstructs the input vectors (e.g., ) from the most hidden vectors (e.g., ) by means of non-linear functions. Note that the number of zero elements in is far less than that of zero elements. The autoencoder is prone to reconstruct the zero elements. To avoid this situation, SDNE imposes more penalty to the reconstruction error of non-zero elements and the loss function is defined as

(29) |

where denotes the Hadamard product. The second component enhances the first-order proximity by borrowing the idea of Laplacian Eigenmaps to incur a penalty once neighboring nodes are embedded far away. Consequently, the loss function is defined as

(30) |

In addition, in order to avoid falling to local optima in the parameter space, SDNE leverages deep belief network to pretrain the parameters at first.

DNGR (Cao et al., 2016)

designs a deep denoising autoencoder method to capture the non-linearities of network features. Its basic idea is to learn node representations from the positive pointwise mutual information (PPMI) matrix of nodes and their contexts. As illustrated in Fig. 6, DNGR consists of three components: random surfing, calculation of PPMI and a stacked denoising autoencoder (SDAE). In the first component, random surfing is used to generate the probabilistic co-occurrence (PCO) matrix that corresponds to a transition matrix by nature. Then the second component calculates the PPMI matrix based on the PCO matrix by following

(Bullinaria and Levy, 2007). After that, SDAE is presented for highly non-linear abstractions learning. In order to recover the complete matrix under certain assumptions, SDAE partially corrupt the training sample by randomly assigning some of ’s entries to zero with a certain probability. As a result, the objective becomes minimizing the reconstruction loss, i.e.,(31) |

where denotes an encoding function, denotes a decoding function, the th instance and its corrupted form are denoted by and respectively.

Struc2vec (Ribeiro et al., 2017) insists that node identity similarity should be independent of network position and neighborhoods’ labels. To well preserve the property, the input network is firstly preprocessed into a context graph which is a multi-layer weighted graph described in Sec. 3.1.1. Then a biased random walk process is conducted to produce node traveling sequences. Each walker chooses its next step on the same layer or across different layers and the choosing probabilities are proportional to edge weights, so that the structurally similar nodes are more likely to be visited. After having samples, Struc2vec uses skip-gram to train the learning model and hierarchical softmax is leveraged to minimize the complexity.

#### 4.1.2. Community Structure

MRF (Jin et al., 2019)

proposes a structured pairwise Markov random field framework. To ensure the global coherent community structure, MRF adopts the Gibbs distribution to measure the posterior probability

of community partition given network adjacency matrix and maximize by optimizing a well-designed energy function .