I Introduction
The last decade has witnessed tremendous success of deep learning models, most of which highly rely on the availability of large amounts of training data
[1, 2]. Semisupervised learning (SSL) [3, 4], which is essentially close to the recent popular scheme of fewshot learning [5, 6, 7], naturally aims at alleviating such reliance on training data. Arguably the most classic model for SSL is based on transductive inference on graphs, i.e., label propagation (LP) [3, 4]. Given a mixed set of labeled and unlabeled data points (e.g., images), LP firstly constructs a homogeneous affinity network (e.g., a nearestneighbor adjacency matrix), and then propagates labels on the network. Due to its simplicity and effectiveness, LP has found numerous industrial applications [8, 9] and attracted various followingup research [10, 11, 12].While SSL is well studied on homogeneous networks, in the real world, however, data are often multityped and multirelational, which can be better modeled by heterogeneous networks [13, 14]. For example, in a movie recommendation dataset [15], the basic units can be users, movies, actors, genres and so on, whereas in a place recommendation dataset [16, 17] objects can be users, places, categories, locations, etc. Moreover, knowledge bases such as Freebase and YAGO can also be naturally modeled by heterogeneous networks, due to their inherent rich types of objects and links. As a powerful model, heterogeneous network enables various tasks like the classification and recommendation of movies and places, as well as relational inference in knowledge bases, where SSL is highly desired due to the lack of labeled data. However, trivial adoptions of LP on heterogeneous networks by suppressing the type information are not ideal, since they do not differentiate the functionalities of multiple types of objects and links.
To leverage the multirelational nature of heterogeneous networks, the concept of metapaths (or metagraphs, as a more general notion) has been proposed and widely used by existing models on heterogeneous networks [18, 19, 20]. For the particular problem of SSL, [14, 21, 22, 13] have also leveraged metapaths to capture the different semantics among targeted types of objects. However, as pointed out in [23, 16], the assumption that all useful metapaths can be predefined by humans is often not valid, and exhaustive enumeration and selection over the exponential number of all possible ones is impractical. Therefore, existing methods considering a fixed set of metapaths cannot effectively capture and differentiate various object interactions on heterogeneous networks.
In this work, to address the limitations of the existing works, we propose a novel NEP (Neural Embedding Propagation) framework for SSL over heterogeneous networks. Figure 1 gives a running toy example of NEP, which is a powerful yet efficient neural framework that coherently combines an object encoder [16, 17] and a modular network [24, 25]. It leverages the compositional nature of metapaths and trivially generalizes to attributed networks.
In Figure 1, the object encoder can be simply implemented as an embedding lookup table. Trained together with a parametric predictive model (e.g.
, a multilayer perceptron (MLP)), it computes a mapping between object representations in a latent embedding space and object labels given as supervision. Such embeddings capture the correlations among object labels, and alleviates their inherent sparsity and noise. We find it also important to allow the embedding of labeled objects to change along training, which indicates that unlabeled data may even help improve the modeling of labeled data.
One step further, to capture the complex interactions on different types of links, we cast each of them as a unique differentiable neural network module (
e.g., also an MLP). Different metapaths then correspond to unique modular networks, which are dynamically composed through stacking the corresponding neural network layers w.r.t. the particular link types along the paths. During the training of NEP, each time starting from a particular object, to mimic the process of LP, we propagate its label along a particular sampled path, by feeding its object embedding into the corresponding modular network. An loss is computed between the propagated embedding and the original embedding on the end object, to require proper smoothness between the connected objects. Then the gradients are back propagated along the path to update both the corresponding neural modules and the object encoder.Due to the expressiveness of neural networks, NEP is able to automatically discover the functionalities of different types of links and dynamically model their common compositions (i.e., metapaths) onthefly based on uniform random walks, which allows us to abandon the explicit consideration of a limited set of metapaths but rather model them in a datadriven way. Finally, as nonlinearity can be easily added into the MLPbased neural network modules, NEP can be more flexible with complex object interactions.
To further improve the efficiency of NEP, we design a series of intuitive and effective training strategies. Firstly, in most scenarios, we only care about the labels of certain targeted types of objects. This allows us to only compute their embeddings and sample the random paths only among them. Secondly, to fully leverage training labels, we reversely sample the random paths from labeled objects, which makes sure the propagation paths all end on labeled objects, so the propagated embeddings can directly encode highquality label information. Finally, to boost training efficiency, we design a twostep path sampling approach, which essentially groups instances of the same metapaths into minibatches, so that the same modular network is instantiated and trained in each minibatch, leading to 300+ times gain on efficiency as well as slight gain on effectiveness.
Our experiments are done on three realworld heterogeneous networks with millions of objects and links, where we comprehensively study the effectiveness, efficiency and robustness of NEP. NEP is able to achieve relative gain on classification accuracy compared with the average scores of all baselines across three datasets, which indicates the importance of the proper modeling of complex object interactions on heterogeneous networks. Besides, NEP is also shown to be the most efficient regarding the leverage of training data and computational resources, while being robust towards hyperparameters in large ranges. All code will be released upon the acceptance of this work.
Ii Related Work and Preliminaries
Iia Heterogeneous Network Modeling
Networks are widely adopted as a natural and generic model for interactive objects. Most of recent network models focus on the higherorder object interactions, since few interactions are independent of others. Arguably, the most popular ones include personalized page rank [26] and DeepWalk [27] based on random walks, LINE [28] and graph convolutional networks [12] leveraging the direct node neighborhoods, as well as higherorder graph cut [29] and graph kernels methods [30] considering small network motifs with exact shapes. All of them have stimulated various followup works, the discussion of which is beyond the scope of this work.
In the real world, objects have multiple types and interact in different ways, which leads to the invention of heterogeneous networks [31]. Due to its capacity of retaining rich representations of objects and links, it has drawn increasing research attention in the past decade and facilitated various downstream applications including link prediction [32], classification [33], clustering [34], recommender systems [35]
[36] and so on.Since objects and links in heterogeneous networks have multiple types, the interaction patterns are much more complex. To capture such complex interactions, the tool of metapath has been proposed and leveraged by most existing models on heterogeneous networks [18]. Traditional object proximity models measure the total strength of various interactions by counting the number of instances of different metapaths between objects and adding up the counts with predefined or learned weights [18, 23, 19, 37, 22, 21, 38, 14, 39], whereas the more recent network representation learning methods leverage metapath guided random walks to jointly model multiple interactions in a latent embedding space [20, 40, 15, 41, 13, 42]. However, the consideration of a fixed set of metapaths, while helping regulate the complex interactions, largely relies on the quality of the metapaths under consideration, and limits the flexibility of the model, which is unable to handle any interactions not directly captured by the metapaths.
IiB SemiSupervised Learning
Semisupervised learning (SSL) aims at leveraging both labeled and unlabeled data to boost the performance of various machine learning tasks. Among many SSL methods, the most classic and influential one might be label propagation (LP) [3, 4]. Its original version assumes the input of a small amount of labeled data and a data affinity network, either computed based on the distances among attributed objects, or derived from external data.To predict the labels of unlabeled data, it propagates the labels from labeled data based on the topology of the affinity network, with the smoothness assumption that nearby objects on the network tend to have similar labels. Due to the simplicity and effectiveness of LP, many followup works have been proposed to improve it, especially on the homogeneous network setting [10, 11, 12, 43].
SSL has also been studied in the heterogeneous network setting. The uniqueness of heterogeneous network is its accommodation of multityped objects and relations, thus leading to complex object interactions and propagation functions. Therefore, all SSL models on heterogeneous networks leverage a given set of metapaths to regulate and capture the complex object interactions. For example, [21, 22] both use a set of metapaths to derive multiple homogeneous networks and optimize the label propagation process on all of them, whereas [37, 14, 44, 45, 46] jointly optimize the weight of different metapaths. [13, 42] simultaneously preserves the object proximities w.r.t. multiple metapaths to learn a unique network embedding. However, besides the limitation of given set of metapaths, they still only consider simple interaction patterns with linear propagation functions.
Iii Neural Embedding Propagation
In this section, we describe our NEP (Neural Embedding Propagation) algorithm, which coherently combines embedding learning and modular networks into a powerful yet efficient SSL framework over heterogeneous networks.
Iiia Motivations and Overview
In this work, we study SSL over heterogeneous networks. Therefore, the input of NEP is a heterogeneous network , where and are the multityped objects and links, respectively. In general, can be associated with , where object labels is often only available in a small subset , and object attributes can be available for all objects, part of all objects, or none of the objects at all. In this work, we focus on predicting the labels of all objects in based on both and .
Before formally introducing the heterogeneous network setting, let us first consider SSL over homogeneous networks. Particularly, we aim to explain why LP is sufficiently effective in that situation.
In homogeneous networks, since all objects share a single type, it is legitimate for LP to directly put any labels to any objects in the network. Also, labels on a single type of objects are often mutually exclusive and thus can be considered disjointly without a predictive model. Moreover, since all links share a single type, the only thing that can differ across links is their weight, which can be easily modeled by simple linear propagation functions.
To understand the unique challenges of SSL in the heterogeneous network setting, we firstly briefly review the definition of heterogenous networks as follows.
Definition III.1.
A heterogeneous network [31, 18] is a network with multiple types of objects and links. Within , is the set of objects, where each object is associated with an object type , and is the set of links, where each link is associated with a link type . It is worth noting that a link type automatically defines the object types on its two ends.
Our toy example of GitHub data can be seen as a heterogeneous network, where the basic object types include user, repository and organization. The particular network schema is shown in Figure 2.
According to Definition III.1, in the heterogeneous network setting, due to the existence of multiple object types, labels of different types of objects can not be directly propagated, but they rather interact implicitly. For example, in our GitHub network in Figure 1, directly assigning the user label like “ios developer” to a repository object does not make much sense, but such a user label does indicate that the linked repositories might be more likely to be associated with labels like “written in objective c”.
To capture such latent semantics and interactions of labels, as well as addressing their inherent noise and sparsity, we propose and design an object encoder to map various labels into a common embedding space (Section III.B). As a consequence, we propagate object embeddings instead of labels on the network, and a parametric predictive model is applied to map the embeddings back to labels upon prediction. Moreover, as we will show in more details later, this object encoder can be easily extended to incorporate the rich information in various object attributes.
In heterogeneous networks, different types of objects can interact in various ways, which obviously cannot be sufficiently modeled by simple weighted links. Consider our GitHub network in Figure 1, where users can “belong to” organizations and “create” repositories. The links derived by the “belong to” and “create” relations should thus determine different label propagation functions. For example, the labels of organizations might be something like “stanford university” or “google inc.”, whereas those of repositories might be “written in objective c” or
“tensorflow application”
. In this case, although the labels of both organizations and repositories can influence users’ labels regarding “skills” and “interests”, the mapping of such influences should be quite different. Moreover, consider the links even between the same types of objects, say, users and repositories. Since users can “create” or “watch” repositories, the different types of links should have different functions regarding label propagation. For example, when a user “creates” a repository, her labels regarding “skills” like “fluent in python” might strongly indicate the labels of the “created” repository like “written in python”, but when she “watches” a repository, her labels regarding “interests” like “deep learning fan” will likely indicate the labels of the “watched” repository like “tensorflow application”.To model the multityped relations among objects, we propose to cast each type of links as a unique neural network module (Section III.C). The same module is reused over all links of the same type, so the number of parameters to be learned is independent of the size of the network, making the model efficient in memory usage and easy to train. These linkwise neural networks are jointly trained with the object encoders, so that the complex semantics in object labels (and possibly object attributes) can be well modeled to align with the various object interactions and propagation functions determined by different types of links.
One step further, as pointed out by various existing works, we notice that the higherorder semantics in heterogeneous networks can be regulated by the tool of metapath, defined as follows.
Definition III.2.
Each metapath thus captures a particular aspect of semantics. Continue with our example on the GitHub network in Figure 1. The metapath of userrepositoryuser carries quite different semantics from userorganizationuser. Thus, the two pairs of users at the ends of these two paths are similar in different ways, which are composed by the modular links along the paths and should imply different label propagation functions.
To fully incorporate the higherorder complex semantics in heterogeneous networks, we leverage the compositional nature of paths and propose to jointly train our linkwise neural network modules through randomly sampling paths on heterogeneous networks and dynamically constructing the neural modular networks corresponding to their underlying metapaths (Section III
.D). In this way, we do not require the input of a given set of useful metapaths, nor do we need to enumerate all legitimate ones up to a certain size. Instead, we let the random walker compose arbitrary metapaths during training, and automatically estimate their importance and functionalities regarding LP onthefly.
Finally, although NEP is powerful yet light in parameters, we deliberately designed a series of training techniques to further improve its efficiency (Section III.E). We also systematically and theoretically analyze the connections between NEP and various popular SSL algorithms, and briefly talk about several straightforward extensions of NEP left as future works (Section III.F).
IiiB Object Encoder of Labels and Beyond
Standard LP directly propagates labels on the whole network by assigning each object a label probability distribution. In this way, besides the labelobject imcompatibility as we discussed before, they also ignore the complex label semantics and correlations. Moreover, labels in realworld datasets are often sparse and noisy, due to the high expense of highquality label generation, which leads to the builtup of error rates during propagation.
To overcome these problems, instead of propagating labels, we propose to firstly encode various object labels into a common latent space, and then propagate the object embeddings on the network. To this end, we leverage the power of neural representation learning by jointly training an object embedding function and a label prediction function for object encoding.
Particularly, we have the embedding of object as . In the simplest case, can be implemented as a randomly initialized learnable embedding lookup table, i.e., , where is the embeddings of the total objects on the network into a dimensional latent space, and
is the onehot vector representing the identity of
.To encode various labels into a common latent space, we apply an MLP on the object embedding as a parametric label prediction model and impose a supervised loss in terms of crossentropy on softmax classification w.r.t. groundtruth labels on labeled objects.
(1) 
where is the number of labeled objects, is the groundtruth label of object , and is the set of all distinct labels on the network. It is trivial to encode multiple labels for a single object, by computing .
Moreover, we have
(2) 
where
(3) 
is the number of layers in the MLP, and are the parameters of the th layer, and . We use to denote all parameters in the MLPbased parametric prediction model. As we motivated above, such an MLP is useful in capturing the complex label semantics and correlations, and at the same time address the label noise and sparsity.
Note that, discussed above is a basic object encoder that only considers object labels. As to be shown in Section III.F, it is straightforward to extend this object encoder to consider the rich information of available attributes associated with objects.
IiiC TypeAware LinkWise Modules
Now we consider the process of embedding propagation on heterogeneous networks, where multiple types of objects interact in rather complex ways. Our key insight here is, if we regard each link in the network as an influence propagation channel which allows the connected objects to influence each other, then different link types should naturally determine different propagation functions. To explicitly leverage this insight, we use a unique neural network to model the propagation functions of each type of links, which acts as a reusable module on the whole network.
Particularly, for each module, we still resort to the MLP of feedforward neural networks, due to its compatibility with the object encoder and representation power to model the complex labellink interactions, as well as model simplicity. For each link type , we have
(4) 
where
(5) 
is the number of layers in the MLP, and are the parameters of the th layer, and . We use to denote all parameters in all of the MLPbased linkwise neural network modules. Note that, we use to denote the set of all link types, and each link type is counted twice by considering the propagation directions. As we will show in the experiments, compared with linear MLP, nonlinear MLP allows the model of object interactions to be more flexible and effective.
Equipped with such linkwise propagation functions, to mimic the process of LP from object to object through link , we simply input the object embedding into the neural network module corresponding to link type , and get as the propagated embedding. An unsupervised loss (e.g., an loss) is then computed between the propagated embedding of on and the current embedding of to require the label smoothness among and , conditioned on their particular link type . Specifically, for all linked pairs of objects on the network, we have
(6) 
Multiple links among the same pair of objects can also be trivially considered with our model by adding up all corresponding losses.
By combining the supervised loss in Eq. 1 and unsupervised loss in Eq. 6
, we can simply arrive at the overall loss function of NEP, which implements SSL over a heterogeneous network as follows.
(7) 
which shares the identical form with the general objective function of LP [3, 4] and various other SSL algorithms. By properly optimizing , we can jointly train our neural object encoder and linkwise modules, so that the embedding propagation along each link is jointly decided by both the end objects (particularly the propagated labels in our current model) and the link type.
IiiD Comprehensive Semantics with Path Sampling
We notice that most existing models on heterogeneous networks including the recent works on SSL [14, 13] all leverage the tool of metapaths to capture finegrained semantics regarding the higherorder interactions involving multiple object and link types. However, all of them explicitly model a limited set of metapaths, which only leverages part of all complex interactions.
In this work, we leverage the compositional nature of paths, and propose to dynamically sample uniform random walks on heterogeneous networks and compose the corresponding modular neural networks during model training with ultimate flexibility onthefly. In this sense, our consideration of metapaths is truly datadriven, i.e., the sampled paths, while naturally preferring the more common and important underlying metapaths in particular heterogenous networks, can actually also reveal any possible metapaths. Therefore, we are able to avoid the explicit consideration of any limited sets of metapaths and capture the comprehensive higherorder semantics in arbitrary heterogeneous networks.
When training NEP, instead of limiting the embedding propagation along direct links, we consider it along paths consisting of multiple links. Particularly, we revise the unsupervised loss function in Eq. 6 into
(8) 
where is a path sampled with uniform random walks on the heterogeneous network, and is the set of all randomly sampled paths. Correspondingly, we have .
Definition III.3.
A uniform random walk in heterogeneous networks is a random walk that ignores object types. Particularly, on object , the random walker picks the next object to go to based on the uniform link distribution with , where deg is the total number of all types of links that connect to . We do not consider selfloops or restarts.
Next we talk about the construction of , by starting with the definition of a path .
Definition III.4.
A path is an ordered list , where and are the source and destination objects, respectively. are the links along the path. is the number of links in the path, and a path with links is called a length path.
With the linkwise neural network modules defined in Section III.C, we further leverage the idea of modular neural networks from visual question answering [24, 25], by dynamically constructing w.r.t. the underlying metapath of as follows.
(9) 
As we can see, by stacking the corresponding linkwise modules in the correct order, each metapath now corresponds to a unique neural network model, where the components can be jointly trained and reused. As a consequence, each metapath determines a unique learnable embedding propagation function, which further depends on the propagation functions of all of its component links. On one hand, the dynamically composed pathwise models capture the complex finegrained higherorder semantics in the heterogeneous networks, while on the other hand, the learning of the linkwise modules is enhanced across the training based on various paths. As a result, NEP can be efficiently trained to deeply capture the comprehensive semantics and importance of any arbitrary metapaths regarding the embedding propagation functions, which totally breaks free the requirements of given set of metapaths and explicit search or learning for linear importance weights [40, 14, 22].
IiiE Further Efficiency Improvements
To further improve the efficiency of NEP, we design a series of intuitive and effective strategies.
Focusing on Targeted Types of Objects.
Our NEP framework is designed to model heterogeneous networks with multiple types of objects, which naturally can be associated with multiple sets of labels. However, in some realworld scenarios, we only care about the labels of some particular targeted
types of objects. For example, when we aim to classify
repositories on GitHub, we are not explicitly interested in user and organization labels.Due to this observation, we can aggressively simplify NEP by only computing the embeddings of targeted types of objects and subsequently constraining the random paths to be sampled only among them. We call this model NEPtarget. Compared with NEPbasic, NEPtarget allows us to significantly reduce the size of the embedding lookup tables in the object encoder by , which costs the most memory consumption, compared with other model parameters that are irrelevant to the network sizes. Moreover, since we focus on the embedding propagation among targeted types of objects, NEPtarget can effectively save the time of learning the encodings of nontargeted types of objects, which leads to about shorter runtimes until convergence compared with NEPbasic. Finally, it also helps to alleviate the builtup of noises and errors when propagating through multiple poorly encoded intermediate objects, which results in relative performance gain compared with NEPbasic.
Note that, ignoring the embedding of nontargeted objects does not actually contradict with our model motivation, which is to capture the complex interactions among different types of objects. This simplification only works in particular scenarios like the ones we consider in this work, where we only care about and have access to the labels of particular types of objects, and the identities of nontargeted objects are less useful without the consideration of their labels and attributes. In this case, the only information that matters for the nontargeted types of objects is their types, which is sufficiently captured by our typeaware linkwise modules.
Fully Leveraging Labeled Data.
By focusing on targeted types of objects, we have saved a lot of training time for learning the embeddings of nontargeted types of objects. However, on realworld largescale networks, learning the embeddings of unlabeled targeted types of objects can still be rather inefficient. This is because the embeddings of most objects (i.e., unlabeled objects) are meaningless at the beginning, and therefore the modeling of their interactions is also wasteful.
Our first insight here is, to fully leverage labeled data, we should focus on paths that include at least one labeled object, whose embedding directly encodes label information. Since our modular neural networks are reused everywhere in the network, the propagation functions of different links and paths captured around labeled objects are automatically applied to those among unlabeled objects. Moreover, due to the small diameter property of realworld networks [47], we assume that moderately long (e.g., length4) paths with at least one labeled object can reach most unlabeled objects for proper learning of their embeddings.
One step further, we find it useful to only focus on paths ending on labeled objects. The insight here is, according to Eq. 8, the loss is computed between the propagated embedding of the start object and the current embedding of the end object , so training is more efficient if at least one of the two embeddings is “clean” by directly encoding the label information. In this case, is clean if is labeled, but is not clean even if is labeled. Therefore, we apply reverse path sampling, i.e., we always sample paths from labeled objects, and use them in the reverse way, to make sure the end objects are always labeled.
We call this further improved model variant NEPlabel. In our experiments, we observe that NEPlabel leads to another shorter runtimes until convergence and relative performance gain compared with NEPtarget.
Training with TwoStep Path Sampling,
As we have discussed in Section III.D, one major advantage of NEP over existing SSL models on heterogeneous networks is the flexibility of considering arbitrary metapaths and training the corresponding modular networks with path sampling based on uniform random walks onthefly, by leveraging the compositional nature of paths. However, since the modular networks composed for different paths have different neural architectures, it poses unique challenges for the efficient training of NEP by leveraging batch training, especially on GPUs.
We notice that, according to Eq. 9, paths sharing the same underlying metapath should correspond to the same composed modular network. Therefore, although paths sampled by uniform random walks can be arbitrary, we can always group them into smaller batches according to their underlying metapaths. However, path grouping itself is time consuming, and it leads to different group sizes, which is still not ideal for efficient batch training.
To address the challenges, we design a novel twostep path sampling approach for the efficient training of NEP, as depicted in Algorithm 1. Specifically, in order to sample a total number of paths (e.g., 100K), we firstly sample a smaller set of paths (e.g., 100). Then for each of these paths, we find its underlying metapath , and sample paths (e.g., 1K) that follow . Therefore, the particular modular network corresponding to can be composed only once and efficiently trained with standard gradient backpropagation with the batch of samples.
To sample random paths guided by particular metapaths in Step 13, we follow the standard way in [20, 40]. However, different from them, our metapaths are also sampled from the particular network, rather than given by domain experts or exhaustively enumerated. Assuming is sufficiently large compared with , the total paths sampled by our twostep approach only differ in orders from any paths sampled by the original approach, which corresponds to the special case with . Therefore, our path sampling approach is purely datadriven, and our model automatically learns their importance and complex functions regarding embedding propagation.
In Section IV, we show that our twostep path sampling approach can significantly reduce the runtimes of NEP, while it is also able to slightly boost its performance, due to more stable training and faster convergence.
Training Algorithm.
Algorithm 1 gives an outline of our overall training process, which is based on NEPlabel with twostep path sampling. In the inner loop starting from Line 6, it samples a path completely at random without the consideration of metapath. Then it samples more instances under the same metapaths. This strategy is crucial to eliminate the explicit consideration of a limited set of metapaths, which makes NEP different from all existing SSL algorithms on heterogeneous networks. It is also crucial for leveraging batchwise gradient backpropagation, which utilizes the power of dynamically composed modular networks. In this way, our model is data driven and able to consider any possible metapaths underlying uniform random walks on heterogeneous networks.
Complexity Analysis.
In terms of memory, the number of parameters in NEP is , where is the number of targeted types of objects in the network, and are the number of classes and number of link types, respectively, which are independent of the network sizes. The term is due to the embedding lookup table , which can be further reduced to a constant number if we replace it with an MLP given available object attributes. The and terms are due to the parameters in of the predictive model and in the linkwise neural network modules, respectively.
In terms of runtime, training NEP theoretically takes time, where is the number of sampled paths and is the path length. It is the same as the stateoftheart unsupervised heterogeneous network embedding algorithms [40, 20], while can be largely improved in practice based on the series of strategies we develop here.
IiiF Connections and Extensions
We show that NEP is a principled and powerful SSL framework by studying its connections to various existing graphbased SSL algorithms and promising extensions towards further improvements on modeling heterogeneous networks.
To better understand the mechanism of NEP, we analyze it in the wellstudied context of graph signal processing [48, 49]. Specifically, we decompose NEP into three major components: embedding, propagation, and prediction, which can be mathematically formulated as , , and , where is the graph embedding with as the embedding function, is the propagated embedding with as the propagation function, and is the label prediction with as the prediction function.
In NEP, we apply an embedding lookup table as to directly capture the training labels on , while it is straightforward to implement as an MLP to also incorporate object attributes on , which we leave as future works. Such embedding allows us to further explore the complex interactions of labels and attributes among different types of objects. Our major technical contribution is then on the propagation function , which leverages modular networks to properly propagate object embeddings on different paths. Finally, due to the appropriate embedding and propagation functions, we are able to jointly learn a powerful parametric prediction function as an MLP based on very few samples.
Algorithm  

LP [3]  
MR [50]  
Planetoid [11]  
GCN [12] 
In fact, we find that various existing graphbased SSL algorithms can well fit into this threecomponent paradigm, and they naturally boil down to certain special cases of NEP. As summarized in Table I, the classic LP algorithm [3] directly propagates object labels on the graph based on a deterministic autoregressive function , where
is the identical matrix and
is the graph Laplacian matrix [51]. Prediction of LP is done by picking out the propagated labels with largest values. To jointly leverage object attributes and the underlying object links, MR [50] trains a parametric prediction model based on SVM with a graph Laplacian regularizer. To some extend, it can be viewed as propagating the object attributes on the graph. The recently proposed graph neural network models like Planetoid [11] and GCN [12] leverage object embedding to integrate object attributes, links and labels. Although the two works adopt quite distinct models, they essentially only differ in the implementations of the propagation function : Planetoid leverages random path sampling on networks to approximate the effect of graph Laplacian [17], whereas GCN sums up the embedding of neighboring objects in each of its convolutional layers through a different smoothing function with as the normalized graph Laplacian with selfloops.In concept, due to the expressiveness of neural networks, NEP can learn arbitrary propagation functions given proper training data, thus generalizing all of the above discussed algorithms. Moreover, since we notice that the propagation functions of most SSL algorithms on homogeneous networks are constructed to act as a deterministic lowpass filter that effectively encourages smoothness among neighboring objects, it is interesting to extend NEP by designing proper constraints on our modular networks to further construct learnable lowpass filters on heterogeneous networks.
Iv Experiments
In this section, we comprehensively evaluate the performance of NEP for SSL over three massive realworld heterogeneous networks. The implementation of NEP is all public on GitHub^{1}^{1}1https://github.com/JieyuZ2/NEP.
Iva Experimental Settings
Datasets.
We describe the datasets we use for our experiments as follows with their statistics summarized in Table II.

[leftmargin=23pt]

DBLP
We use the public Arnetminer dataset V8 collected by [52]. It contains four types of objects, i.e., authors (A), papers (P), venues (V), and years (Y). The link types include authors writing papers, papers citing papers, papers published in venues, and papers published in years.

YAGO
We use the public knowledge graph derived from Wikipedia, WordNet, and GeoNames
[53]. There are seven types of objects in the network: person (P), location (L), organization (O), and etc, as well as twentyfour types of links. 
GitHub
We use an anonymous social network dataset derived from the GitHub community by DARPA. It contains three types of objects: repository (R), user (U) and organization (O), and six types of links, as depicted in Figure 2.
Dataset  #object  #edge  #class  %labeled 

DBLP  4,925,160  44,931,742  4  
YAGO  545,792  3,517,663  15  
GitHub  2,078,030  61,332,330  16  
subDBLP  333,160  2,620,736  4  
subYAGO  15,672  3,57,312  15  
subGitHub  32,792  347,768  16 
In order to compare with some stateofart graph SSL algorithms that cannot scale up to large networks with millions of objects, we create a smaller subgraph on each dataset by only keeping the labeled objects (both training and testing labels) and their direct neighbors. We also summarize their statistics in Table II.
Compared Algorithms.
We compare NEP with the following graphbased SSL algorithms and network embedding algorithms:
LP [3]: Classic graphbased SSL algorithm that propagates labels on homogeneous networks. To run LP on heterogeneous networks, we suppress the type information of all objects.
GHE [13]: The stateoftheart SSL algorithm on heterogeneous networks through path augmented and task guided embedding.
SemiHIN [14]: Another recent SSL algorithm with promising results on heterogeneous networks by ensemble of metagraph guided random walks.
ZooBP [42]: Another recent SSL algorithm on heterogeneous networks by performing fast belief propagation.
Metapath2vec [20]: The stateoftheart heterogeneous network embedding algorithm through heterogeneous random walks and negative sampling.
ESim [40]
: Another recent heterogeneous network embedding algorithm with promising results through metapath guided path sampling and noisecontrastive estimation.
Hin2vec [41]: Another recent heterogeneous network embedding algorithm that exploits different types of links among nodes.
Evaluation Protocols.
We study the efficacy of all algorithms on the standard task of semisupervised node classification. The labels are semantic classes of objects not directly captured by the networks. For DBLP, we use the manual labels of authors from four research areas, i.e., database, data mining, machine learning and information retrieval provided by [18]. For YAGO, we extract top 15 locations by the edge “wasBornIn” as labels for the person objects and remove all “wasBornIn” links from the network. For GitHub, we manually select 16 highquality tags of repositories as labels, such as security, machine learning, and database.
We randomly select of labeled objects as testing data and evaluate all algorithms on them. For the SSL algorithms, (i.e., LP, GHE, SemiHIN, ZooBP and NEP), we provide the rest 80% labeled objects as training data. For the unsupervised embedding algorithms, (i.e., Metapath2vec, ESim and Hin2vec), we compute the embedding without training data. For all algorithms that outputs a network embedding (i.e., Metapath2vec, ESim, Hin2vec and NEP), we train a subsequent MLP with the same architecture on the learned embeddings with the same 80% labeled training data to predict the object classes. Besides the standard classification accuracy, we also record the runtimes of all algorithms which are measured on a server with four GeForce GTX 1080 GPUs and a 12core 2.2GHz CPU. For the sake of fairness, we run all algorithms with a single thread.
Parameter Settings.
For all datasets, we set the batch size to 1,000 and learning rate to 0.001. For different datasets, the number of sampled patterns and maximum path length are set differently as summarized in Table III. For all embedding algorithms, we set the embedding dimension to 128 for full graphs and 64 for subgraphs. Other parameters of the baseline algorithms are set as the default values as suggested in the original works. For algorithms that require given sets of metapaths (i.e., GHE, SemiHin, Metapath2vec and ESim), since the schemas of our experimented heterogeneous networks are relatively simple, we compose and give them the commonly used metapaths^{2}^{2}2DBLP: APA, APPA, APAPA, APVPA, APYPA; YAGO:PP, PWP, PRP, PSP, POP, PDP, PDDP, PDEDP; GitHub: RR, RUR, RUUR, RURUR, RUOUR. The weights are all uniform.. For NEP, we use singlelayer MLPs (one fullyconnected layer plus one Sigmoid activation layer) for all modules with the same size. We have also done a comprehensive study of the impacts of major hyperparameters.
Dataset  DBLP  YAGO  GitHub  subDBLP  subYAGO  subGitHub 

9000  7000  6000  6000  2000  2000  
5  6  6  7  5  4 
Research Problems.
Our experiments are designed to answer the following research questions:

[leftmargin=15pt]

Q1. Effectiveness
Given limited labeled data, how much does NEP improve over the stateoftheart graphbased SSL and network embedding algorithms?

Q2. Efficiency
How efficient is NEP regarding the leverage of labeled data and computational resources?

Q3. Robustness
How robust is NEP regarding different settings of model hyperparameters?
IvB Q1. Effectiveness
We quantitatively evaluate NEP against all baselines on the standard node classification task. Table IV shows the performance of all algorithms on the six datasets. All algorithms are trained and tested with 10 runs on different randomly split labeled data to compute the average classification accuracy. The performance gain of NEP over baselines all passed the significant test with value 0.005. The performance of baselines varies across different datasets, while NEP is able to constantly outperform all of them with significant margins, demonstrating its supreme and general advantages.
Algorithm  subDBLP  subYAGO  subGitHub 
LP  0.7830.000  0.5970.008  0.3740.003 
GHE  0.7780.014  0.5010.002  0.3530.011 
SemiHIN  0.7870.000  0.6300.000  0.3060.000 
ZooBP  0.6800.000  0.3820.000  0.3120.000 
Metapath2vec  0.8510.003  0.6040.003  0.3840.003 
ESim  0.8240.005  0.5630.004  0.3420.006 
Hin2vec  0.8560.005  0.6280.005  0.3410.003 
NEPlinear  0.8850.003  0.6480.003  0.4000.180 
NEP  0.8880.005  0.6510.002  0.4250.007 
Algorithm  DBLP  YAGO  GitHub 
LP  0.8110.033  0.6120.004  0.3400.006 
GHE  0.7590.048  0.4470.020  0.3510.019 
SemiHIN  0.7240.000  0.4570.000  0.3480.000 
ZooBP  0.6100.000  0.5610.000  0.3020.000 
Metapath2vec  0.7900.005  0.5900.005  0.3200.007 
ESim  0.6470.010  0.6070.004  0.3050.003 
Hin2vec  0.8360.001  0.6090.004  0.3380.006 
NEPlinear  0.8650.003  0.6290.002  0.3840.002 
NEP  0.8800.006  0.6340.005  0.3920.011 
Taking a closer look at the scores, we observe that NEP is much better than baselines on the subgraphs of YAGO and GitHub, where the graphs have relatively complex links but small sizes. For example, in GitHub, a repository and a user can have “watched by” and “created by” links, while in YAGO, a person and a location can have “lives in”, “died in”, “is citizen of” and other types of links. On these graphs, NEP easily benefits from its capability of distinguishing and leveraging different types of direct interactions through the individual neural network modules. However, when evaluated on full graphs, the performance gain of NEP is larger on DBLP. The fact is, since direct interactions are simpler in DBLP, where only a single type of link exists between any pair of objects, higherorder interactions matter more. For example, a path of “APVPA” exactly captures the pairs of authors within the same research communities. The full graphs, compared with the subgraphs, can provide much more instances of such longer paths, and NEP effectively captures such particular higherorder interactions through learning the dynamically composed modular networks.
The advantage of NEP mainly roots in two perspectives: the capability of modeling the compositional nature of metapaths and the flexibility of nonlinear propagation functions. To verify this, we also implement a linear version of NEP by simply removing all nonlinear activation functions. As we can clearly see, the performance slightly drops after removing the nonlinearity in a consistent way.
IvC Q2. Efficiency
In this subsection, we study the efficiency of NEP regarding the leverage of both labeled data and computational resources.
One of the major motivations of NEP is to leverage limited labeled data, so as to alleviate the deficiency of deep learning models when training data are hard to get. Therefore, we are interested to see how NEP performs when different amounts of training data are available. To this end, we change the amount of training data from to of all labeled data, while the rest labeled data are held out as testing data. We repeat the same process but split the data randomly for 10 times to take the average scores.
As we can observe in Figure 3, NEP can quickly capture the simple semantics in DBLP and reaches stable performance given only of the labeled data. Although YAGO and GitHub appear to be more complex and require relatively more training data, NEP maintains the best performances compared with all baselines. Such results clearly demonstrates the efficiency of NEP in leveraging limited labeled data.
Another major advantage of neural network models is that they can usually be efficiently trained on powerful computation resources like GPUs with welldeveloped optimization methods like batchwise gradient backpropagation. Particularly for NEP, as we have discussed in Section III.E, since our modular networks are dynamically composed according to randomly sampled paths, we have developed a novel training strategy based on twostep path sampling to fully leverage the computational resources and standard optimization methods. Here we closely study its effectiveness.
Figure 4 shows how the strategy of training with twostep path sampling influences the performance and runtime of NEP regarding different settings on the three datasets. To present a comprehensive study, we set the total number of sampled paths () to 100K, 500K, and 1M, respectively, and then simultaneously vary the number of patterns () and the number of paths per pattern (). As we can observe from the results, when , which equals to no usage of twostep sampling, the runtimes are quite high; as we increase to and , the runtimes rapidly drop, while the performances are not influenced much. Sometimes the performance actually increases, probably due to better convergence of the loss with batch training. Setting to too large values like does hurt the performance, but in practice, we can safely avoid it by simply setting to an appropriate value like , which leads to a satisfactory tradeoff between effectiveness and efficiency across different datasets.
IvD Q3. Robustness
We comprehensively study the robustness of NEP regarding different hyperparameter settings.
We firstly look at the path length . As shown in Table V, the performance of NEP regarding different path lengths does not differ significantly as we vary from 2 to 7. This is because shorter paths are usually more useful, and when is large, Algorithm 1 often automatically stops the sampling process at Line 9 upon reaching targeted types of objects before the actual path length reaches . Therefore, in practice, the ruleofthumb is to simply set to larger values like 56.
Then we look at the embedding size , by comparing NEP with GHE, Metapath2vec, ESim and Hin2vec, which also compute object embeddings. As shown in Figure 5, too small embedding sizes often lead to poor performance. As the embedding size grows, NEP quickly reaches the peak performance. It also maintains the best performance without overfitting the data as the embedding size further grows.
Finally we look at the number of total sampled paths , by comparing NEP with Metapath2vec, ESim and Hin2vec, which are also trained with path sampling. As shown in Figure 6, the improvement of NEP over compared baselines is more significant given fewer sampled paths, indicating the power of NEP to rapidly capture useful information in the networks.
Dataset  2  3  4  5  6  7 

DBLP  0.750  0.872  0.869  0.880  0.873  0.875 
YAGO  0.631  0.631  0.633  0.629  0.634  0.631 
GitHub  0.381  0.374  0.378  0.386  0.392  0.379 
subDBLP  0.763  0.875  0.880  0.886  0.881  0.888 
subYAGO  0.646  0.641  0.641  0.651  0.637  0.648 
subGitHub  0.425  0.412  0.414  0.412  0.405  0.407 
V Conclusions
In this work, we develop NEP (Neural Embedding Propagation), for semisupervised learning over heterogeneous networks. NEP is a powerful yet efficient neural framework that coherently combines an object encoder and a modular network to model the complex interactions among multityped multirelational objects in heterogeneous networks. Unlike existing heterogeneous network models, NEP does not assume a given set of useful metapaths, but rather dynamically composes and estimates the different importance and functions of arbitrary metapaths regarding embedding propagation onthefly. At the same time, the model is easy to learn, since the parameters modeling each type of links are shared across all underlying metapaths. For future works, it is straightforward to extend NEP to various object attributes and enable fully unsupervised training for by recovering different types of links.
Acknowledgements
Research was sponsored in part by U.S. Army Research Lab. under Cooperative Agreement No. W911NF0920053 (NSCTA), DARPA under Agreement No. W911NF17C0099, National Science Foundation IIS 1618481, IIS 1704532, and IIS1741317, DTRA HDTRA11810026, and grant 1U54GM114838 awarded by NIGMS through funds provided by the transNIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov).
References
 [1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
 [2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in ICLR, 2015.
 [3] X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled data with label propagation,” 2002.
 [4] X. Zhu, Z. Ghahramani, J. Lafferty et al., “Semisupervised learning using gaussian fields and harmonic functions,” in ICML, vol. 3, 2003.
 [5] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al., “Matching networks for one shot learning,” in NIPS, 2016, pp. 3630–3638.
 [6] C. Finn, P. Abbeel, and S. Levine, “Modelagnostic metalearning for fast adaptation of deep networks,” in ICLR, 2017.
 [7] S. Ravi and H. Larochelle, “Optimization as a model for fewshot learning,” in ICLR, 2017.
 [8] S. Baluja, R. Seth, D. Sivakumar, Y. Jing, J. Yagnik, S. Kumar, D. Ravichandran, and M. Aly, “Video suggestion and discovery for youtube: taking random walks through the view graph,” in WWW, 2008.
 [9] J. Weston, F. Ratle, H. Mobahi, and R. Collobert, “Deep learning via semisupervised embedding,” in Neural Networks: Tricks of the Trade. Springer, 2012, pp. 639–655.
 [10] B. Lin, J. Yang, X. He, and J. Ye, “Geodesic distance function learning via heat flow on vector fields,” in ICML, 2014, pp. 145–153.
 [11] Z. Yang, W. Cohen, and R. Salakhutdinov, “Revisiting semisupervised learning with graph embeddings,” in ICML, 2016.
 [12] T. N. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” in ICLR, 2017.
 [13] T. Chen and Y. Sun, “Taskguided and pathaugmented heterogeneous network embedding for author identification,” in WSDM, 2017.
 [14] H. Jiang, Y. Song, C. Wang, M. Zhang, and Y. Sun, “Semisupervised learning over heterogeneous information networks by ensemble of metagraph guided random walks,” in AAAI, 2017, pp. 1944–1950.
 [15] C. Yang, Y. Feng, P. Li, Y. Shi, and J. Han, “Metagraph based hin spectral embedding: methods, analyses, and insights,” in ICDM, 2018.
 [16] C. Yang, M. Liu, F. He, X. Zhang, J. Peng, and J. Han, “Similarity modeling on heterogeneous networks via automatic path discovery,” in ECMLPKDD, 2018.
 [17] C. Yang, L. Bai, C. Zhang, Q. Yuan, and J. Han, “Bridging collaborative filtering and semisupervised learning: a neural approach for poi recommendation,” in KDD, 2017, pp. 1245–1254.
 [18] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “Pathsim: Meta pathbased topk similarity search in heterogeneous information networks,” VLDB, vol. 4, no. 11, pp. 992–1003, 2011.
 [19] Y. Shi, P.W. Chan, H. Zhuang, H. Gui, and J. Han, “Prep: Pathbased relevance from a probabilistic perspective in heterogeneous information networks,” in KDD, 2017, pp. 425–434.
 [20] Y. Dong, N. V. Chawla, and A. Swami, “metapath2vec: Scalable representation learning for heterogeneous networks,” in KDD, 2017.
 [21] M. Ji, Y. Sun, M. Danilevsky, J. Han, and J. Gao, “Graph regularized transductive classification on heterogeneous information networks,” in ECMLPKDD, 2010, pp. 570–586.
 [22] C. Luo, R. Guan, Z. Wang, and C. Lin, “Hetpathmine: A novel transductive classification algorithm on heterogeneous information networks,” in ECIR, 2014, pp. 210–221.
 [23] C. Wang, Y. Song, H. Li, M. Zhang, and J. Han, “Knowsim: A document similarity measure on structured heterogeneous information networks,” in ICDM, 2015, pp. 1015–1020.
 [24] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Neural module networks,” in CVPR, 2016, pp. 39–48.
 [25] ——, “Learning to compose neural networks for question answering,” in NAACL, 2016, pp. 1545–1554.
 [26] G. Jeh and J. Widom, “Scaling personalized web search,” in WWW, 2003, pp. 271–279.
 [27] B. Perozzi, R. AlRfou, and S. Skiena, “Deepwalk: Online learning of social representations,” in KDD, 2014, pp. 701–710.
 [28] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: Largescale information network embedding,” in WWW, 2015, pp. 1067–1077.
 [29] A. R. Benson, D. F. Gleich, and J. Leskovec, “Higherorder organization of complex networks,” Science, vol. 353, no. 6295, pp. 163–166, 2016.
 [30] N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and K. M. Borgwardt, “Weisfeilerlehman graph kernels,” JMLR, 2011.
 [31] Y. Sun and J. Han, “Mining heterogeneous information networks: principles and methodologies,” SLKDD, vol. 3, no. 2, pp. 1–159, 2012.
 [32] Z. Liu, V. W. Zheng, Z. Zhao, F. Zhu, K. Chang, M. Wu, and J. Ying, “Semantic proximity search on heterogeneous graph by proximity embedding.” in AAAI, 2017, pp. 154–160.
 [33] S. Hou, Y. Ye, Y. Song, and M. Abdulhayoglu, “Hindroid: An intelligent android malware detection system based on structured heterogeneous information network,” in KDD, 2017, pp. 1507–1515.
 [34] Y. Sun, B. Norick, J. Han, X. Yan, P. S. Yu, and X. Yu, “Integrating metapath selection with userguided object clustering in heterogeneous information networks,” TKDD, vol. 7, no. 3, p. 11, 2013.
 [35] H. Zhao, Q. Yao, J. Li, Y. Song, and D. L. Lee, “Metagraph based recommendation fusion over heterogeneous information networks,” in KDD, 2017, pp. 635–644.
 [36] H. Zhuang, J. Zhang, G. Brova, J. Tang, H. Cam, X. Yan, and J. Han, “Mining querybased subnetwork outliers in heterogeneous information networks,” in ICDM, 2014, pp. 1127–1132.
 [37] M. Wan, Y. Ouyang, L. Kaplan, and J. Han, “Graph regularized metapath based transductive regression in heterogeneous information network,” in SDM, 2015, pp. 918–926.
 [38] X. Li, B. Kao, Y. Zheng, and Z. Huang, “On transductive classification in heterogeneous information networks,” in CIKM, 2016, pp. 811–820.
 [39] Y. Fang, W. Lin, V. W. Zheng, M. Wu, K. Chang, and X.L. Li, “Semantic proximity search on graphs with metagraphbased learning,” in ICDE, 2016, pp. 277–288.
 [40] J. Shang, M. Qu, J. Liu, L. M. Kaplan, J. Han, and J. Peng, “Metapath guided embedding for similarity search in largescale heterogeneous information networks,” arXiv preprint arXiv:1610.09769, 2016.
 [41] T.y. Fu, W.C. Lee, and Z. Lei, “Hin2vec: Explore metapaths in heterogeneous information networks for representation learning,” in CIKM, 2017, pp. 1797–1806.
 [42] D. Eswaran, S. Günnemann, C. Faloutsos, D. Makhija, and M. Kumar, “Zoobp: Belief propagation for heterogeneous networks,” VLDB, vol. 10, no. 5, pp. 625–636, 2017.
 [43] C. Yang, L. Zhong, L.J. Li, and L. Jie, “Bidirectional joint inference for user links and attributes on large social graphs,” in WWW, 2017, pp. 564–573.
 [44] F. Serafino, G. Pio, and M. Ceci, “Ensemble learning for multitype classification in heterogeneous networks,” TKDE, vol. 30, no. 12, pp. 2326–2339, 2018.
 [45] C. Yang, C. Zhang, X. Chen, J. Ye, and J. Han, “Did you enjoy the ride: Understanding passenger experience via heterogeneous network embedding,” in ICDE, 2018.
 [46] Y. Shi, X. He, N. Zhang, C. Yang, and J. Han, “Userguided clustering in heterogeneous information networks via motifbased comprehensive transcription.”
 [47] D. J. Watts and S. H. Strogatz, “Collective dynamics of smallworld networks,” nature, vol. 393, no. 6684, p. 440, 1998.

[48]
D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending highdimensional data analysis to networks and other irregular domains,”
ISPM, vol. 30, no. 3, pp. 83–98, 2013.  [49] B. Girault, P. Gonçalves, E. Fleury, and A. S. Mor, “Semisupervised learning for graph to signal mapping: A graph signal wiener filter interpretation,” in ICASSP, 2014, pp. 1115–1119.
 [50] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and unlabeled examples,” JMLR, vol. 7, no. Nov, pp. 2399–2434, 2006.
 [51] D. A. Spielman, “Spectral graph theory and its applications,” in FOCS, 2007, pp. 29–38.
 [52] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “Arnetminer: extraction and mining of academic social networks,” in KDD, 2008.
 [53] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago: a core of semantic knowledge,” in WWW, 2007, pp. 697–706.
Comments
There are no comments yet.