1. Introduction
Graph data is prevalent and models entities (nodes) and interactions between them (edges). The edges may corresponds to provided interactions between entities (likes, purchases, messages, hyperlinks) or are derived from metric data, for example, by connecting each point to its nearest neighbors.
Node embeddings
, which are representations of graph nodes in the form of low dimensional vectors, are an important component in graph analysis pipelines. They are used as taskagnostic representation with downstream tasks that include node classification, node clustering for community detection, and link prediction for recommendations
(deepwalk:KDD2014; node2vec:kdd2016; DBLP:journals/tkde/CaiZC18). Embeddings are computed with the qualitative objective of preserving structure – so that nodes that are more connected get assigned closer embedding vectors (imagenet:CVPR2009; Koren:kdd2008; Mikolov:NIPS13; deepwalk:KDD2014; BERRY95; PinSage:KDD2018). The optimization objective has the general form of a weighted sum over examples (pairs of nodes) of a perexample loss function and are commonly performed using stochastic gradient descent (SGD)
(SGDbook:1971; Koren:kdd2008; Salakhutdinov:ICML2007; Gemulla:KDD2011; Mikolov:NIPS13).1.1. Node embeddings via random walks
A first attempt to obtain positive training examples (node pairs) from the input graph is to use the provided set of edges (Koren:IEEE2009). A highly effective approach, pioneered by DeepWalk (deepwalk:KDD2014), is to instead select examples based on cooccurrence of pairs in short random walks performed on the input graph. These methods weight and greatly expand the set of positive examples. DeepWalk treats random walks on the graph as ”sentences” of nodes and applies the popular word embedding framework Word2Vec (Mikolov:NIPS13). Node2vec (node2vec:kdd2016) further refined the method by extending the family of random walks with hyperparameters that tune the depth and breadth of the walk. Prolific followup work (see summary in (DBLP:journals/tkde/CaiZC18; Wu_2020)) further extended the family of random walks but retained the general structure of producing ”node sentences.”
1.2. Lossguided training
Randomwalk base methods were studied in settings where the distribution of random walks and thus the distribution of training examples remain static in the course of training. A prolific research thread proposed methods that accelerate the training or improve accuracy by dynamically modifying the distribution of examples in the course of training (curriculumlearning:ICML2009; AlainBengioSGD:2015; ZhaoZhang:ICML2015; Shrivastava:CVPR2016; facenet:cvpr2015; LoshchilovH:ICLR2016; shalev:ICML2016): These approaches include Curriculum/selfpaced learning (curriculumlearning:ICML2009), where the selection is altered to mimic human learning: First the algorithm learns over the ”easy” examples and then moves to ”hard” examples, where margin is used as a measure of difficulty. A related approach guides the example selection process by the current magnitude of the gradient or the loss value. One proposed method applies importance sampling according to loss or gradient (AlainBengioSGD:2015; ZhaoZhang:ICML2015), which preserves the expected value of the stochastic gradient updates but spreads them differently. Other methods focus on higher loss examples in a biased fashion that essentially alters the objective: Hard examples for image training (Shrivastava:CVPR2016; facenet:cvpr2015), selecting examples by moving average of the loss (LoshchilovH:ICLR2016), or focusing entirely on the highest loss examples (shalev:ICML2016)
with a compelling theoretical underpinning, Overall, these methods were studied with supervised learning and as far as we know, were not explored for computing node embeddings.
1.3. Our contribution
We propose and study methods that incorporate dynamic training, in particular example selection that is focused on higher loss examples, in the particular context of popular randomwalk based example selection methods for node embedding. The hope is that we can obtain similar gains in performance as observed in other domains.
The application of lossguided training to randomwalk based methods poses some methodical and computational challenges. First, the methods used for other domains are not directly applicable. They were considered in supervised situations where the input data has the form of (example, label) pairs which are available explicitly and make the loss computation straightforward. In our setting, examples are produced during training using random walks: The potential number of examples can be quadratic in the number of nodes even when the input graph is sparse and the set is implicit in the graph representation. Thus, perexample state or loss evaluation on all potential examples cannot be efficiently maintained, which rules out approaches such as (shalev:ICML2016; LoshchilovH:ICLR2016).
Second, dynamic example selection, and in particular lossguided example selection, tends to be computation heavy and tradesoff the efficiency of em training (performing the gradient updates) and efficiency of preprocessing (the computation needed to generate the training sequence (AlainBengioSGD:2015; ZhaoZhang:ICML2015; LoshchilovH:ICLR2016)). Even with the baseline random walk methods, the computational resources needed increase with graph size, the length and type of the random walk, the number of generated examples from the walk, and the dimension of the embedding vectors. In practice, the cost of embedding computation tends to be a significant part of the overall downstream pipeline. We aim to enhance randomwalk based methods without compromising their scalability.
The components of training and preprocessing costs typically draw on different resources (e.g., gradient updates are communicated). We aim for efficiency and design lossguided training methods that provide tunable tradeoffs. Our most effective approaches work with the same random walk processes as the respective baseline methods and assign loss scores to walks (each generating a set of examples) instead of to individual examples. At each selection phase we generate a set of random walks according to the baseline model, assign loss scores to these walks (via methods to be detailed later on), and choose a sample of the walks for training that is weighted by their loss scores. We empirically show that across a variety of datasets, our lossguided methods provide dramatic reduction in training cost with a very small increase in preprocessing cost compared with the baseline methods that use a static distribution of training examples.
1.4. Related work
Graph Neural Networks (GNNs) are an emerging approach for graph learning tasks (see survey
(GNNsurvey:2020)). Notably, Graph Convolutional Networks (AtwoodT:NIPS2016; DefferrardBV:NIPS2016; KipfW:ICLR2017) work with node features and create representations in terms of node features (PinSage:KDD2018). Variational autoencoders (kipf2016variational) produce node embeddings in an unsupervised fashion but perform similarly to prior methods. Randomwalk based methods remain a viable alternative that obtains state of the art results for node representations computed from graph structure alone.1.5. Overview
The paper is organized as follows. In Section 2 we provide necessary background on the baseline node embedding methods DeepWalk (deepwalk:KDD2014) and Node2Vec (node2vec:kdd2016) and the Word2Vec SGNS framework (Mikolov:NIPS13) that they build on. In Section 3 we present our methods that dynamically modify the distribution of training examples according to loss. We provide details on our experimental setup in Section 4. We illustrate the benefits of lossguided training using a synthetic example network in Appendix A. The reallife datasets and tasks used in our experiments are described in Section 5 and results are reported and discussed in Section 6 and Appendix BD.
2. Preliminaries
We consider graph datasets of the form with a set of nodes , that represent entities, a set of edges that represent pairwise interactions, and an assignment of positive scalar weights to edges that correspond to the strength of interactions. Entities may be of different types (for example, users and videos with edges corresponding to views) or be of the same type (words in a text corpus with edges corresponding to cooccurrences or users in a social network and edges corresponding to interactions). A node embeddings is a mapping of nodes to vectors , where typically .
2.1. Overview of baseline methods
The node embeddings methods we consider here are based on the popular DeepWalk (deepwalk:KDD2014) and its refinement Node2vec (node2vec:kdd2016). Algorithm LABEL:alg:baseline provides a highlevel view of the baseline methods. These methods build on the word2vec (Mikolov:NIPS13) Skip Gram with Negative Sampling (SGNS) method. SGNS was originally designed for learning embeddings for words in a text corpus. The method generates short sequences (referred to as sentences of consecutive words from the text corpus and uses these sentences for training (more details are provided below). The node embeddings methods generate instead sequences of nodes using short random walks on the graph and apply the SGNS framework to these node ”sentences” in a black box fashion. The node embedding methods differ in the distribution over node sentences. Both our baselines specify distributions of random walks of length that start from a node . DeepWalk conducts a simple random walk, where the next node is selected independently of history according to weight of outgoing edges, that is, if the walk is node
then the probability of continuing to node
is . Node2Vecuses two hyperparameters
to control the ”breadth” and ”depth” of the walk, in particular, to what extent it remains in the neighbourhood of the origin node. The method initializes randomly the embedding vectors and updates them according to sentences. Sentences for training are generated by selecting a start node uniformly. With both baseline methods, the distribution over sentences is static, that is remains the same in the course of training. To streamline the presentation, we will use the baselines as black boxes that take an input graph and for a node and length provide samples from the baselinespecific distribution .2.2. Overview of SGNS
For completeness, we provide more details on SGNS (Mikolov:NIPS13). SGNS trains two vectors for each entity , a focus vector, , and a context vector .
SGNS takes as hyper parameters a ”skip window” and ratio of positive to negative examples. It works with input sentence as input. A sentence is processed by generating a randomized set of pairs that are then used as positive training examples:
(i)  
(ii) 
Skip lengths for each are selected independently uniformly at random from .
then includes all ordered pairs where
within that skip length from .For each positive example, random negative examples are drawn with the same focus and a randomly selected context according to entity frequencies in positive examples to the power of . Intuitively, negative examples (HuKorenV:2008) provide an ”antigravity” effect that prevents all embeddings from collapsing into the same vector. We denote by the probability that positive example pair is generated and by the probability that a negative example pair is generated. The hyper parameter specifies a ratio of negative to positive examples. The optimization objective when using this distribution over examples has the general form:
(1) 
At a high level, the gradient updates on positive example increase the inner product and an update on a negative example decreases that inner product. The SGNS objective is designed to maximize the log likelihood over all examples. This when the probability of positive example is modeled by a sigmoid of the inner product and that of a negative example by a sigmoid of the negated product . The logarithm of the likelihood function has the form (1).
To streamline the presentation, we will treat the SGNS as a closed module for computing embedding vectors for . The module inputs , length parameter , and window size . It has a procedure to initializes the embedding vectors. It then enters a training phase that takes as input sentences and updates the embedding vectors.
3. Lossguided training methods
We first discuss the challenges and design goals for lossguided training in the SGNSbased node embedding domain. Methods in prior work were designed for supervised learning, where examples are labeled. In our setting, the SGNS loss (Equation 1) has both positive examples (that are generated from pairs cooccurring in random walks) and negative examples (that are selected randomly according to the distribution of positive examples). The negative examples distribution is therefore determined by the positive example distribution. Hence, in our setting the knob we modify would only be the distribution of positive examples.
Most methods in prior work compute (or track) approximate loss values for all examples. In our setting, the set of potential positive examples is very large, can be quadratic in the representation of the input graph dataset, and these examples are generated rather than provided explicitly. Therefore, having to maintain even approximate loss values for all potential positive examples is not feasible and can severely impact efficiency. We will instead aim to draw subsets of examples and select from these subsets according to current loss values.
Finally, the baseline methods we build on do not work with examples individually but instead generate random walks and multiple examples from each walk . Using random walks rather than individual edges proved to be hugely beneficial and we do not want to lose that advantage in our lossguided methods. Therefore, our lossguided selection methods stick to the paradigm of generating random walks and training with .
3.1. Lossguided random walks
Perhaps the most natural method to consider is to incorporate the loss values of edges in the random walks. As in the baseline methods, the start node is selected uniformly at random. A walk of length is then computed so that
(4) 
where is a hyperparameter that tunes the dependence on loss. A choice of provides the basic random walks used in (deepwalk:KDD2014). A large value of will result in always selecting the highestloss outgoing edge. A value of will select an edge proportionally to the product of its weight and loss value. A drawback of this method is that it is less efficient computationally: When the random walk distribution is static we can preprocess the graph so that walk generation is very efficient. Here we need to recompute loss values and edge probabilities of all outgoing edges while generating the walk. We observed empirically, however, that its performance in terms of training computation (per number of walks used for training) on almost all datasets is generally unstable and poor. This prompted us to consider instead lossguided selection of walks, where the candidate random walks for training are generated as in the baseline method but the selection of walks is made according to assigned loss scores.
3.2. Lossscore based selection of walks
We propose a design that addresses the general issues and those specific to our settings. At a high level, we use the same random walk distribution and update and training procedures as the baseline methods (see Algorithm LABEL:alg:baseline) but we modify the selection of walks for training. Algorithm LABEL:alg:walkscore is a metaalgorithm for our lossguided walk selection methods. The pseudocode treats components as ”blackboxes:” (i) The random walk distribution generated from a graph according to a random process, specified start node and specified length . (ii) A training algorithm Train (such as a variant of SGNS) that includes an initialization method of the embedding vectors and an update method that inputs sentences (walks) , generates from them positive and negative examples (according to parameters on example generation and ), and performs the respective parameter updates. A component that is used only with the lossguided methods is a loss scoring function of walks. Our choice of functions will be detailed later but they depend on specified power and may also depend on the specifics of example generation from walks (see in Section 2.2).
For training, we initialize the embedding vectors and then repeat the following rounds: We draw random walks , one generated for each node , we compute loss scores for each of those walks. We then select a subset of these walks for training in a way that is biased towards the walks with the higher loss score. Specifically we will use an integer parameter and select for training of the scored walks. The selection within each round is done using a weighted sampling without replacement method according to the loss scores of the walks . The weighted sampling can be implemented very efficiently in a single distributed pass over walks using each one of a variety of known order/bottom/varopt sampling methods (e.g. example (Rosen1997a; Ohlsson_SPS:1998; Cha82; bottomk07:ds; bottomk07:est; DLT:jacm07; varopt_full:CDKLT10)). Finally, the selected walks from the round are handed to the training algorithm.
The meta procedure selects in each of #epochs
rounds a set of walks. Therefore, selecting a total of #epochs walks in total for training. In order to compare with the baseline methods that select walks per epoch (one from each node), we use the term epoch to refer to providing walks for training. The pseudocode lists parameters that are used in various ”black box” components: The length of the generated walks, window size used to generate positive examples from a walk , and a power which we will use later as a parameter in the scoring of walks are passed to the respective components.algocf[htbp]
algocf[htbp]
3.3. Loss scoring of walks
We consider several ways to assign loss scores to a walk and respective . All methods use a power parameter . Our first scoring function uses the average loss of all positive examples generated from walk :
(5) 
The second function heuristically scores a walk by its first
edges(6) 
With , the walk is scored by its first edge .
The advantage of the loss score over is that we can compute the loss scores for a candidate walk from a node without explicitly computing the walk: It suffices to draw only the first edges of a walk. If the node is selected to the sample , only then we can sample the remaining edges of the walk , conditioned on its prefix being . Since we compute loss scores to times many walks than we actually train with, this is considerable saving in our perround preprocessing cost. The disadvantage of the loss score (6) is that we are only using examples from the set to determine the loss score, so we can expect a reduction in effectiveness.
The power in the computation of loss scores has the role of a hyperparameter: High values of focus the selection more on walks with high loss examples whereas lower values allow for a broader representation of walks in the training. Interestingly, ShalevShwartz and Wexler (shalev:ICML2016)
considered the more extreme objective of minimizing the maximum perexample loss. This objective is very sensitive to outliers (persistent high loss examples) and in some cases can divert all the training effort to be futilely spent on erroneous examples. Note that in our setting, we are not as exposed because the walks pool we select from in each round is randomized and we use without replacement sample to select a
fraction of that pool for training.3.4. Complexity analysis
The perepoch training cost with both the baseline and lossguided selection methods amounts to computing the gradients of the loss functions or (for positive and negative examples) and applying gradient updates. The training cost is proportional to the total number of examples generated from walks. The expected number of positive examples, , depends on the walk length and window . The total number also depends on the negatives to positives ratio (see Section 2.2). Therefore, each walk generates in expectation training examples. We train on walks in each epoch and thus the perepoch training cost is:
(7) 
We next consider the perepoch total computation cost, that includes preprocessing cost. For the baseline methods, the preprocessing cost corresponds to generating random walks ( edge traversals each^{1}^{1}1Node2Vec requires keeping large state for efficient walk generation, but this will not affect much our comparative analysis of baseline versus lossguided methods.). The total cost is dominated by the gradient computations of the training cost and is:
(8) 
For the lossguided methods, the preprocessing cost involves evaluations of the loss on positive examples. With loss score in each round we generate the first steps of a random walk from each node . We then evaluate the loss score for each of the walks, which amounts to evaluating on pairs (only of the walks are selected for training). The total number of loss evaluation per epoch ( rounds) is . With the loss score we generate in each round a complete walk from each node and evaluate the for each pair in . The total number of loss evaluations is . The total computation cost combines the training and preprocessing cost and is measured by the number of loss or gradient evaluations. Note that loss or gradient evaluations have similar complexity and amount to computing for loss and for the gradient. Summarizing, we have
(9)  
(10) 
4. Empirical Evaluation Setup
As baseline and for walk generation with our methods we used DeepWalk (deepwalk:KDD2014) and Node2Vec (node2vec:kdd2016). These methods define the random walk distributions . When evaluating our methods we fit hyperparameters to the respective baseline method. With Node2Vec we searched over values .
We trained models using the Gensim package (rehurek_lrec) that builds on a classic implementation of SGNS (Mikolov:github2013). We used the default parameters that generally perform well: for the length of the walk (sentence), for window size, and for the number of negative examples generated for each positive example. With these values, we have in expectation positive examples generated from each walk and examples in total generated for each walk when training.
In our implementation we applied the baseline method (Algorithm LABEL:alg:baseline) for the first epoch (training on walks ) and applied lossguided methods (Algorithm LABEL:alg:walkscore) starting from the second epoch. This is because we expect scoring by loss to not be helpful initially, with random initialization. We used rounds per epoch and power of the loss value . Each experiment is repeated
times and we report the average quality and standard error. We fit parameters on one dataset from each collection using one set of repetitions and use the same parameters with all datasets and a fresh set of repetitions.
As mentioned, SGNS determines the distribution of negative examples according to the frequencies of words in the provided positive examples. With the baseline methods, the distribution of random walks and hence the frequencies of words in positive examples remain fixed throughout training and are approximated by maintaining historic counts from the beginning of training. With our lossguided selection, the distribution of positive examples changes over time. We experimented with different variations that use a recent positive distribution (perround or for few recent epochs) to guide the negative selection. We did not observe significant effect on performance and report results with respect to frequencies collected since the start of training.
4.1. Tasks and metrics
We evaluated the quality of the embeddings on the following tasks, using corresponding quality metrics:

Clustering: The goal is to partition the nodes into clusters. The embedding vectors are used to compute a means clustering of nodes. We used sklearn.cluster.KMeans from scikitlearn package (scikitlearn) with default parameters. Our quality measure is the modularity score (Modularity:PhysRev2004) of the clustering.

Multiclass (or multilabel) classification: Nodes have associated classes (or labels) from a set
. The class (or the set of labels) are provided for some nodes and the goal is to learn the class/labels of remaining nodes. An embedding is computed for all nodes (in an unsupervised fashion). Following that, a supervised learning algorithm is trained on embedding and class/label pairs. We used OnevsRest logistic regression from the
scikitlearn package^{2}^{2}2sklearn.multiclass.OneVsRestClassifier with default parameters (scikitlearn). For multilabels we used the multinomial option. In a multiclass setting, we obtain a class prediction from the embedding vector for each of the remaining nodes and report the fraction of correct predictions. In a multilabel setting, we provide the number of labels and the embedding vector and obtain a set of predicted labels for each node. We report the microaveraged F1 score.
4.2. Measuring gain
Across our datasets, peak accuracy with lossguided selection was equal or higher than baseline. We thus consider efficiency, which we measure using , the average number of training epochs over repetitions needed for the method to reach of peak accuracy. We can now express the training and computation cost and respective gains. With the parameter values we use, the perepoch training cost is and the perepoch computation cost is and . Accordingly, we express the gain of a lossguided method with scoring function with respect to the baseline:

Training gain: is the relative decrease in number of training epochs (recall that training cost per epoch is similar for all methods).
(11) When reporting the training gain, we report the error over repetitions: We compute the (sample) standard deviation of the number of epochs used by the method to reach peak (over repetitions) and normalize it by dividing by
. 
Computation gain: The relative decrease in computation cost
(12) With the computation gain is and with it is .
5. Datasets and tasks
We evaluate our methods on three collections of realworld datasets, summarized in Table 1. The datasets have different learning tasks (see Section 4.1):

Facebook page networks (clustering): The collection represent mutual ”like” networks among verified Facebook pages. There are six datasets for different communities (TV shows, athletes, and more) (DBLP:conf/asunam/RozemberczkiDSS19). The task (following (DBLP:conf/asunam/RozemberczkiDSS19)) is to compute embeddings with and cluster the data to clusters. .

Citation networks (multiclass): The collection has three networks (Citeseer, Cora and Pubmed) (DBLP:journals/aim/SenNBGGE08). Networks are formed by having a node for each document and an (undirected, unweighted) edge for each citation link. Each document has a class label. Following (node2vec:kdd2016; Yang:ICML2016), we train a dimensional embedding and use a random selection of 20 nodes per class as labeled training examples.

ProteinProtein Interactions (PPI) (multilabel): The dataset is a graph of human ProteinProtein interactions (node2vec:kdd2016). Each protein (node) has multiple labels and the goal is to predict this set of labels. Following (node2vec:kdd2016), we use and use 50% of nodes (selected uniformly at randomly) for training.
dataset  
(DBLP:conf/asunam/RozemberczkiDSS19) Facebook pages  clustering  
Athletes  20  
Company  20  
Government  20  
New Sites  20  
Politicians  20  
Public Figures  20  
TV Shows  20  
(DBLP:journals/aim/SenNBGGE08) Citation networks  Multiclass  
Pubmed  
Cora  
Citeseer  
(node2vec:kdd2016) Protein Interactions  Multilabel  
PPI 
6. Empirical Results
We evaluate our methods using three key metrics: Quality, training gain, and computation gain. We use figures to show quality in the course of training: We plot average performance over repetitions and provide error bars that correspond to one SD. We use tables to report training and computation gains for different methods and hyperparameter settings. In Appendix B we provide parameter sweeps on the number of rounds per epoch , the loss power , and in the loss score . In this section we report results for rounds per epoch, which seems to be a sweet spot for the training cost (see Appendix B for other values of ). We use both DeepWalk and Node2Vec baselines (Additional results reported in Appendix C). For each loss scoring function we used the best performing overall power : performed well with (selecting the highest loss walks in each round). performed well with (weighted sampling that is biased towards higher loss). Interestingly, did not perform better than even in terms of training cost (since it is computation heavy, there is also no improvement in computation cost). We show performance with in plots but do not report it in tables. Appendix D provides additional exploration on the loss patterns of lossguided versus baseline training.
6.1. Facebook networks (clustering task)
Representative results based on repetitions are reported in Table 2 for both baselines. Figure 1 shows the modularity score in the course of training for representative datasets and methods. We fitted the Node2vec parameters to and on the Athletes datasest and applied with all datasets and methods. We see that lossbased selection obtained 13%25% reduction in training and 6%20% reduction in computation for both baselines. We can see that on almost all datasets in this collection outperformed in terms of training cost but in most cases had a lower overall gain in computation cost.
Training  Comp  Training  Comp  
dataset  ,  %gain  %SD  %gain  %gain  %SD  %gain 
DeepWalk baseline  Node2Vec baseline  
Athletes  ,  12  1.8  9.8  12.91  2.40  10.7 
,  18.2  3.10  0.70  18.2  2.33  0.14  
Company  ,  18.2  1.86  16.0  20.0  2.21  17.7 
,  22.3  1.71  5.5  22.6  1.50  5.38  
Government  ,  10.7  2.67  8.47  10.4  2.10  8.13 
,  21.9  1.90  5.60  20.3  2.10  2.61  
New Sites  ,  15.2  3.5  10.1  12.5  3.63  10.5 
,  4.21  9.58  17.7  7.08  8.70  14.5  
Politicians  ,  17.9  2.19  15.8  18.2  2.51  16.0 
,  24.6  1.60  9.44  24.1  1.73  8.16  
Public  ,  10.2  5.08  7.93  7.10  3.81  4.9 
Figures  ,  24.3  2.97  7.69  23.3  1.84  6.05 
TV Shows  ,  21.76  1.14  19.65  21.6  1.63  20.0 
,  25.2  1.51  9.88  24.4  1.3  8.88 
6.2. Citation networks (Multiclass)
Representative results with repetitions are reported in Table 3. Figure 2 shows performance in the course of training for the Pubmed dataset. The node2vec parameters were fitted on the Cora dataset to . We can see that the lossguided methods had training gains of 8%12% on the Pubmed and Citeseer datasets, but due to large error bars there is no significance for the improvements on Cora. The loss score outperformed others also in terms of training cost.
Training  Comp  Training  Comp  
dataset  ,  %gain  %SD  %gain  %gain  %SD  %gain 
DeepWalk baseline  Node2Vec baseline  
Citation Networks,  
Pubmed  ,  9.07  3.91  7.38  9.06  3.55  8.28 
,  5.21  7.02  10.5  6.14  5.96  10.7  
Cora  ,  1.80  8.60  0.00  4.08  7.45  4.23 
,  5.20  8.12  12.4  8.27  6.24  9.84  
Citeseer  ,  7.64  6.60  5.81  11.57  5.43  9.6 
,  5.73  6.20  11.2  7.90  8.37  9.90  
Protein Interaction Network,  
PPI  ,  12.7  3.91  3.90  10.4  7.82  10.7 
,  20.7  3.77  14.75  21.4  3.73  11.8  
,  22.2  3.38  4.06  22.2  3.90  4.50 
6.3. PPI network (multilabel)
Representative results with repetitions with loss scores for are reported in Table 3 and Figure 3. Node2vec parameters were fitted to . We observe that training costs improves with , and in particular the training gain with is significantly higher than with , but most of that gain is already attained by . The computation gain is largest with , which attains nearly the same training gain as but at lower perepoch computation. Overall, we see training gains of 22% and computation gains of 12%15%.
7. Conclusion
We study lossguided example selection, known to accelerate training in some domains, for methods such as DeepWalk and Node2Vec that learn node embeddings using random walks. The random walk base methods use a static distribution over an implicitlyrepresented extended set of training examples and seems less amenable for dynamic lossguided example selection. We propose efficient methods that facilitate lossbased dynamic example selection while retaining the highly effective structure of random walks and scalability. We demonstrate empirically the effectiveness of the proposed methods. An interesting open question is to explore such benefits with other frameworks that generate training examples onthefly from an implicit representation such as example augmentation or together with methods that work with feature representation of nodes such as Pinsage (PinSage:KDD2018).
Acknowledgements This research is partially supported by the Israel Science Foundation (Grant No. 1595/19). We thank the anonymous GRADESNDA ’20 reviewers for many helpful comments.
References
Appendix A Synthetic communities graph
We start with a simple synthetic network that demonstrates the benefits of lossguided selection. The example network structure is illustrated in Figure 4. We have three communities (red, green, and blue) of the same size. The red and green communities are interconnected and the blue community is isolated. The goal is to reconstruct the ground truth community affiliations from the learned embedding. Our construction is inspired by random GnP graphs (each community is a GnP graph) and the planted partition model [DBLP:journals/rsa/CondonK01]. Each of the communities has nodes. We generated intracommunity edges so that each pair of samecommunity nodes has a connecting edge with probability . Each intercommunity pair from the red and green communities has a connecting edge with probability .
We trained node embeddings using DeepWalk and using lossguided selection with DeepWalk as a baseline. The baseline method DeepWalk selects a start node of a random walk uniformly and hence the distribution of training examples remains balanced among the three communities through the course of training. The lossguided selection will focus more training on walks with a higher loss score. We expect the isolated community to separate out early in training and for the two interconnected communities to require more training to ”separate” from each other. The loss of a samecommunity pair will be lower earlier for the isolated community. A lossguided method after the initial stage of training is more likely to select training examples from the two interconnected communities and thus be more productive. The benefit is further boosted by the corresponding selection of negative examples, where a community not selected for positive examples also does not participate in negative examples. The quality was measured by treating the problem as a 3class classification problem as explained in Section 4.1 with classes assigned according to community with . Half the nodes (selected randomly) were used as labeled examples for the supervised component. We used repetitions for each method and report representative results for embedding dimension . Figure 5 shows the fraction of correct classifications as a function of training epochs. We observe that the different methods behave the same in the initial phase of training, until the blue community separates out from the other two but after that the lossguided methods are more effective. The lossscore function that uses the first edge of the walk attains the full advantage of the loss guided methods. This because the first edge identifies the community. Figure 6 reports the fraction of training spent at each community. We can see that in the initial phase all methods are balanced but as expected, the baseline DeepWalk remains balanced whereas the lossguided variant spend increasing fraction of training resources on the green and red communities, where it is more helpful.
Appendix B Hyperparameter sensitivity
We explore the dependence of the performance of our lossguided methods on the following parameters: The number of edges out of the walk edges that are used in the walk loss scoring function , the number of rounds per epoch, and the loss power value which determines how we weigh the loss of examples when we compute loss scores of walks.
b.1. Loss scoring methods of walks
We proposed (see Section 3.3) several loss scoring functions of walks: for which uses the average loss of the first edges of the walk and which uses the average loss of all positive training examples generated from the walk . We observed empirically that rarely outperformed , even in terms of training cost. We note that due to technical reasons we used the expected loss on the selected walk (under random draws of ) instead of the precise evaluation on the pairs generated from the selected walk. This could have impacted adversely the reported performance of .
We explore the training and computation cost with as we vary . Representative results (with and ) are reported in Table 4 (We report results for datasets for which the error bars are small compared with the gain and its variation.) We see a general trend of improved training cost as we increase , but the extent of this improvement widely varies between datasets. For example, the improvement is small for the TV shows dataset, moderate for the Politicians dataset, and significant for the Government and PPI datasets. Note that the perepoch computation cost also increases with (see analysis in Section 3.4 and Section 4.2). The overall computation gain as we increase reflects both the decrease in the number of epochs and the increase in perepoch computation. We can see that the computation gain is often maximized for lower values of than the value that maximizes the training gain.
Training  Comput  
dataset  %gain  %SD  %gain  
DeepWalk baseline  
PPI  12.7  3.91  10.4  
20.7  3.77  14.8  
23.3  3.90  14.0  
25.9  3.20  8.50  
Pubmed  9.07  3.91  7.38  
7.31  5.77  4.11  
1.49  9.00  6.83  
10.5  19.8  29.9  
Athletes  12.0  1.89  9.82  
13.5  2.34  9.54  
16.8  2.60  7.66  
15.5  4.49  2.62  
Company  18.2  1.86  16.0  
18.4  2.06  14.5  
20.6  1.47  11.8  
20.1  3.34  2.87  
Government  10.7  2.67  8.47  
12.7  3.17  8.67  
19.7  2.99  10.7  
23.6  2.53  7.11  
Politicians  17.9  2.19  15.8  
19.5  2.21  15.8  
23.6  1.75  15.6  
24.8  1.38  9.72  
Public figures  10.2  5.08  7.93  
12.2  4.93  8.10  
23.0  1.84  14.2  
26.9  3.41  10.9  
TV shows  21.8  1.14  19.7  
22.6  1.78  18.9  
23.8  1.43  15.7  
24.4  1.55  8.82  
b.2. Rounds per epoch
The parameter controls the number of rounds per epoch. Recall that in each round we score walks and select of these walks for training. The setting corresponds to the baseline method. In Table 5 we report training and computation gains over the DeepWalk baseline method for . We report on all datasets for which the standard deviation on the gain allowed for meaningful comparisons. Gains are reported with configurations and . We highlight the value that maximizes the training cost or computation cost for each configuration. We can see a trend where the training gain increases with . We see that the computation gain is often maximized at a lower value than the value that maximizes the training gain. This is because the perepoch computation also increases with (see Section 3.4 and Section 4.2) (12)). For some value we reach a sweet spot that balances the benefits from reduced training (that increase with ) and the higher perepoch preprocessing computation (that increases with ).
Qualitatively, higher values mean that the training is more focused on high loss examples and that the loss values are more current. This is helpful to some point, but with high enough we might direct all the training to outliers or erroneous examples. In the table we do not see a point where the training cost starts increasing with but we do see that there is almost no gain between and .
Training  Comp  Training  Comp  

dataset  %gain  %SD  %gain  %gain  %SD  %gain  
,  ,  
PPI  10.4  5.30  9.71  16.7  5.20  12.4  
13.2  5.30  11.8  20.0  4.80  10.4  
12.7  3.91  10.4  22.3  3.38  4.06  
10.1  7.50  5.70  21.5  3.40  14.7  
Athletes  7.04  4.30  6.46  10.4  4.80  6.35  
11.7  2.57  10.5  17.5  3.13  8.45  
12.0  1.89  9.82  18.2  3.09  0.70  
9.93  2.51  5.85  18.6  2.11  15.8  
Company  12.0  2.93  11.3  13.2  2.68  9.14  
19.2  2.20  17.8  20.0  2.11  11.1  
18.2  1.86  16.1  22.3  1.71  5.54  
18.0  1.69  14.1  22.2  1.26  10.8  
Government  6.93  3.19  6.35  15.3  4.06  11.3  
10.5  3.31  9.27  20.3  2.47  11.4  
10.7  2.67  8.47  22.0  1.89  5.06  
9.75  2.64  5.60  22.4  1.86  10.7  
New Sites  13.3  2.95  12.6  15.6  5.65  11.4  
17.1  3.48  15.7  8.30  7.58  2.40  
15.2  3.53  12.86  4.21  9.58  17.8  
11.1  4.60  6.81  4.33  10.6  39.5  
Politicians  11.2  4.55  10.6  14.6  2.00  10.8  
16.8  2.38  15.6  21.9  2.14  13.7  
17.9  2.19  15.8  24.6  1.60  9.44  
19.1  1.83  15.4  25.2  0.88  4.08  
Public  5.63  7.39  5.07  15.5  4.58  11.5  
Figures  10.8  4.20  9.56  21.6  4.19  12.7  
10.2  5.08  7.93  24.3  2.97  7.69  
7.66  3.12  3.41  26.3  1.98  5.30  
TV Shows  10.5  3.09  9.88  17.4  3.59  13.6  
20.7  1.40  19.4  23.3  1.72  15.1  
21.8  1.14  19.7  25.2  1.51  9.90  
21.8  1.20  18.2  26.1  1.34  3.40 
Training  Comp  Training  Comp  
dataset  %gain  %SD  %gain  %gain  %SD  %gain  
PPI  6.02  4.33  3.71  10.9  3.80  9.70  
12.1  4.10  9.8  22.3  3.38  4.06  
12.7  3.91  10.4  22.8  3.00  5.00  
Pubmed  7.31  4.16  5.62  6.70  6.79  8.70  
8.80  4.92  7.11  5.21  7.02  10.5  
9.07  3.91  7.38  10.5  19.8  29.9  
Athletes  8.73  2.81  6.56  12.6  2.88  6.18  
12.1  1.89  9.88  18.2  3.09  0.70  
12.0  1.89  9.88  15.5  4.49  2.62  
Company  16.8  2.43  14.6  17.9  1.75  0.13  
19.4  1.83  17.2  22.3  1.71  5.54  
18.2  1.86  16.1  20.1  3.34  2.87  
Government  10.3  2.43  8.07  15.4  3.06  2.98  
11.7  2.75  9.45  22.0  1.89  5.06  
10.7  2.67  8.47  23.6  2.53  7.11  
New sites  16.5  3.70  14.2  11.3  3.80  8.70  
16.6  3.70  14.3  9.70  5.60  10.7  
15.2  3.50  12.9  4.00  6.90  17.7  
Politicians  15.7  2.52  13.6  18.5  1.60  1.98  
17.2  2.19  15.1  24.6  1.60  9.44  
17.9  2.19  15.8  24.8  1.40  9.72  
Public  10.5  3.48  8.25  17.5  3.21  0.70  
Figures  9.51  4.09  7.29  24.3  2.97  7.69  
10.2  5.08  7.93  26.9  3.41  10.9  
TV shows  18.3  1.61  16.2  19.3  1.89  2.63  
21.3  1.40  19.1  25.2  1.55  9.88  
21.8  1.14  19.7  24.4  1.55  8.82  
b.3. Loss power
The loss power is used in the walk loss scoring function (see Section 3.3). The value of determines to what extent the training selection made in each round is biased towards examples with higher loss. A lower allows for broader selection of walks into a round and higher focuses the training more on highest loss score walks. In particular, means that we essentially select walks with highestloss examples whereas means that we select walks for training proportionately to the loss values of their example(s). The loss power selection should be dependent on , because lower allows for a broader selection of walks as well. It also needs to be fitted to the we use. Table 6 reports training and computation gains when we vary the loss power . This for the loss scoring functions and . We can see that higher values, or , perform better overall than and that the improvement are fairly robust to the particular choice of .
Appendix C Results for Node2Vec baseline
Plots for the quality in the course of training with the Node2Vec baseline for representative datasets are provided in Figure 7. Recall that the respective training and computation gains of the loss guided methods were reported in Table 2 and Table 3.
Patterns of perexample loss values: The ratio of edges loss versus background loss (left), the 90% quantile versus average edge loss (middle) and quality in the course of training (right). For lossguided methods with loss scoring
and and the baseline method DeepWalk.Appendix D Loss behavior
We observed that lossguided walk selection improves performance of the different downstream tasks. To obtain insights on the behavior of the lossguided versus the baseline methods we consider properties of the perexample loss values. Figure 8 provides sidebyside plots of these properties and plots of quality in the course of training. For our purposes here, we treat the graph edges as an approximate set of strong positive examples. These examples tend to be weighted higher (have larger values) in the distribution generated from random walks. We consider two qualities of the distribution of the loss values on these edges:

The ratio of the average edgeloss to the background loss. The average edgeloss is the average over graph edges and the background loss is measured by the average loss over random nonedge pairs . We observed that the loss scale shifts significantly during training and in particular both these average loss values decrease by orders of magnitude. The ratio serves as a normalized measures of separation between edge and background loss and we expect it to be lower (more separation) when the training is more effective.

The ratio of the 90% quantile of edge loss values to the average edge loss. The ratio is a measure of spread and indicates how well the training method balances its progress across the positive examples. A ratio that is closer to 1 means a smaller spread and a better balance.
Results on representative datasets are reported in Figure 8. We can see that across datasets and in the training regime before performance peaks, the lossguided methods have a lower spread than the baseline DeepWalk method: The ratio of the 90% percentile to the average loss on edges is uniformly lower. Moreover, the lossguided method with has a lower spread than . Overall, this is consistent with what we expect with lossguided training, where more training is directed to higher loss examples and better representing the current loss than .
Interestingly, the baseline method DeepWalk has a lower ratio, which corresponds to stronger separation of edgeloss and background loss. The lower ratio of the baseline starts early on and surprisingly perhaps, on some of the datasets (PPI and the citation networks), persists also in regimes where DeepWalk is outperformed by the lossguided methods.
These patterns showcase the advantage of lossguided selections that are more geared to minimize spread rather than average loss. The average loss seems to indeed be effectively minimized by baseline methods, but on its own does not fully reflects on quality.
Comments
There are no comments yet.