Network data often exhibits underlying community structure, and there is a vast literature devoted to uncovering communities in complex networks; see, for example, [36, 48, 41, 46]. In many applications, one community in the network is of particular interest to the researcher. For example, in neuroscience connectomics, researchers might want to identify the region of the brain responsible for a particular neurological function; in a social network, a marketing company might want to find a group of users with similar interests; in an internet hyperlink network, a journalist might want to find blogs with a certain political leaning or subject matter. If we are given a few vertices known to be from from the community of interest, and perhaps a few vertices known to not be from the community of interest, the task of vertex nomination is to order the remaining vertices in the network into a nomination list, with the aim of having a concentration of vertices from the community of interest at the top of the list; for alternate formulations of the vertex nomination problem, see [39, 28].
In , three novel vertex nomination schemes were introduced:
the canonical vertex nomination scheme ,
the likelihood maximization vertex nomination scheme ,
and the spectral partitioning vertex nomination scheme .
Under mild model assumptions, the canonical vertex nomination scheme —which
is the vertex nomination analogue of the Bayes’ classifier—was proven to be the optimal vertex nomination scheme according to a mean average precision metric (see Definition
—which is the vertex nomination analogue of the Bayes’ classifier—was proven to be the optimal vertex nomination scheme according to a mean average precision metric (see Definition3). Unfortunately, is not practical to implement on graphs with more than a few tens of vertices. The likelihood maximization vertex nomination scheme utilizes novel graph matching machinery, and is shown to be highly effective on both simulated and real data sets. However, is not practical to implement on graphs with more than a few thousand vertices. The spectral partitioning vertex nomination scheme is less effective then the canonical and the likelihood maximization vertex nomination schemes on the small and moderately sized networks where the canonical and the likelihood maximization vertex nomination schemes can respectively be implemented in practice. Nonetheless, the spectral partitioning vertex nomination scheme has the significant advantage of being practical to implement on graphs with up to tens of millions of vertices.
1.1 Extending and
In this paper we present extensions of the and vertex nomination schemes. Our extension of the canonical vertex nomination scheme , which we shall call the canonical sampling vertex nomination scheme and denote via , is an approximation of that can be practically computed for graphs with hundreds of thousands of vertices, and our extension of the spectral partitioning vertex nomination scheme , which we shall the extended spectral partitioning vertex nomination scheme and denote via , can be practically computed for graphs with approaching one hundred thousand vertices, with significantly increased precision over on moderately sized networks.
and are practical to implement
on very large graphs,
the former has the important theoretical advantage of
directly approximating the provably optimal vertex nomination scheme ,
with this approximation getting better and better
when more and more sampling is used (and converging to in this limit).
However, as with , the canonical sampling scheme can be held back by the
need to know/estimate the parameters of the underlying graph model before implementation.
While this may be impractical in settings where these estimates are infeasible,
, the canonical sampling scheme can be held back by the need to know/estimate the parameters of the underlying graph model before implementation. While this may be impractical in settings where these estimates are infeasible,allows us to approximately compute optimal performance in a larger array of synthetic models, thereby allowing us to better assess the performance of other, more feasibly implemented, procedures. Indeed, given unlimited computational resources (for sampling purposes), when the model parameters are known a priori or estimated to a suitable precision, would outperform every vertex nomination scheme other than .
In contrast, is implemented without needing to estimate the underlying graph model parameters; indeed, including known parameter estimates into the framework is nontrivial. This can lead to superior performance of versus , especially in the setting where parameter estimates are necessarily highly variable. Additionally, given equal computational resources (i.e., when limiting the sampling allowed in ), often outperforms , even when the model parameters are well estimated.
See Figure 1 for an informal visual representation that succinctly compares the various vertex nomination schemes on the basis of performance and also computational practicality, as the scale of the number of vertices changes. The colors red, blue, green, purple, and orange correspond respectively to the canonical , canonical sampling , likelihood maximization , extended spectral partitioning , and spectral partitioning vertex nomination schemes. The lines dim to reflect increased computational burden. The red line on top represents the canonical vertex nomination scheme ; it quickly dims out at a few tens of vertices, since at this point is no longer practical to compute. Otherwise, the red line would have extended in a straight line across the figure, above all of the other lines, since it is the optimal nomination scheme, and is thus the benchmark for comparison of all of the other nomination schemes. Next, the dark/light/lighter blue regions correspond to the canonical sampling vertex nomination scheme ; it isn’t a single line, but rather layers of lines for the different amounts of sampling that could be performed. As the number of vertices grows, it requires more sampling—i.e. computational burden—to be more effective, hence the blue color lightens away upwards in the figure, as it approaches the red line—or where the red line would have extended to. For graphs with few vertices the dark blue line is just below the red line; indeed, the canonical sampling scheme is effortlessly just as effective as the canonical scheme. Even with more vertices, enough sampling would have it converge to , but with an ever increasing computational burden, hence the dimming of the blue towards the top. Next, the green line corresponds to the likelihood maximization vertex nomination scheme ; the green color dims out at a few thousand vertices, since at this point it is no longer practical to compute. Finally, the purple and orange lines, respectively, correspond to the extended spectral partitioning , and spectral partitioning vertex nomination schemes, the former being uniformly more effective then the latter. When there are only a few vertices the spectral methods are essentially useless, and these methods only become effective when there are a moderate number of vertices. The extended spectral partitioning scheme is practical to compute until there are close to a hundred thousand vertices, while the spectral partitioning scheme is practical to compute for several hundred thousand vertices.
The paper is laid out as follows. In Section 3.1, we describe the canonical vertex nomination scheme, and prove its theoretical optimality in a slightly different model setting than considered in . In Section 3.2, we use Markov chain Monte Carlo methods to extend the canonical vertex nomination scheme to the canonical sampling vertex nomination scheme . In Section 3.3, we describe the spectral partitioning nomination scheme. In Section 3.4, we extend the spectral partitioning vertex nomination scheme to the extended spectral partitioning vertex nomination scheme , utilizing a more sophisticated clustering methodology than in , without an inordinately large sacrifice in scalability. In Section 4, we demonstrate and compare the performance of and on both simulated and real data sets.
We develop our vertex nomination schemes in the setting of the stochastic block model, a random graph model extensively used to model networks with underlying community structure. See, for example, [20, 52, 1]. The stochastic block model is a very simple random graph model that provides a principled approximation for more complicated network data (see, for example, [38, 53, 22]), with the advantage that the theory associated with the stochastic block model is quite tractable.
The stochastic block model random graph is defined as follows; let be a fixed positive integer.
A random graph is an SBM graph if
The vertex set is the disjoint union of sets such that, for each , it holds that . (For each , is called the th block.)
The block membership function is such that, for all and all , it holds that if and only if .
In the setting of vertex nomination, we assume that is only partially observed. Specifically, is partitioned into two disjoint sets, (the set of seeds) and (the set of ambiguous vertices), and we assume that the values of are known only on . We denote the restriction of to as . For each , we denote , , , then we define , and . Of course, and .
Given an SBM model where the parameters are unknown, these parameters can be approximated in all of the usual ways utilizing a graph realized from SBM. First, can be consistently estimated by spectral methods (such as in [12, 51]). Alternatively, since is observed, we would be observing if we knew that was a surjective function. Given , and assuming that the vertex memberships were realized via a multinomial distribution, then can be estimated by , for each . Then, for any such that , we can estimate by the number of edges in the bipartite subgraph induced by , divided by ; i.e.,
For , we can estimate by the number of edges in the subgraph induced by , divided by ; i.e.,
In simulations, when it useful or simplifying to do so, we assume that the model parameters , , are known. Else, they are estimated as above.
Next, the most general inference task here would be, given observed from SBM and a partially observed block membership function , to estimate the parameter ; that is, to estimate the remaining unobserved block memberships. Indeed, there are a host of graph clustering algorithms that could be used for this purpose; see, for example, [41, 40, 45, 5, 36, 48] among others. However, in the vertex nomination [32, 8, 44, 7, 11] setting of this manuscript, the task of interest is much more specialized. We assume that there is only one block “of interest”—without loss of generality it is —and we want to prioritize ambiguous vertices per the possibility of being from . Specifically, our task is, given an observed and a partially observed block membership function , to order the ambiguous vertices into a list such that there would be an abundance of vertices from that appear as near to the top of the list as can be achieved. More formally:
Given and , a vertex nomination scheme is a function where is the set of all graphs on vertex set , and is the set of all orderings of the set . For any given , denote the ordering of as ; this ordering is also called the nomination list associated with and .
The effectiveness of a vertex nomination scheme is quantified in the following manner. Given a realization of SBM and the partially observed block membership function , and for any integer , define the precision at depth of the list to be
that is, the fraction of the first vertices on the nomination list that are in the block of interest, . The average precision of the list is defined to be the average of the precisions at depths ; that is, it is equal to
Of course, average precision is defined for a particular instantiation of , and hence does not capture the behavior of as varies in the SBM model. To account for this, we define the mean average precision, the metric by which we will evaluate our vertex nomination schemes:
Let SBM. The mean average precision of a vertex nomination scheme is defined to be
where the expectation is taken over the underlying probability space, the sample space being .
It is immediate that, for any given vertex nomination scheme , the mean average precision satisfies , with values closer to indicating a more successful nomination scheme; i.e., a higher concentration of vertices from near the top of the nomination list.
In the literature, mean average precision is often defined as the integral of the precision over recall, i.e., it is defined as
Herein, we focus on the definition of mean average precision provided in Definition 3 because, in the vertex nomination setting, recall is not as important as precision; the goal is explicitly to have an abundance of vertices of interest at the top of the list, and less explicitly about wanting all the vertices of interest to be high in the list.
3 Extending the vertex nomination schemes
In this section, we extend the canonical vertex nomination scheme (described in Section 3.1) to a “sampling” version (defined in Section 3.2), and we extend the spectral partitioning vertex nomination scheme (described in Section 3.3) to (defined in Section 3.4).
3.1 The canonical vertex nomination scheme
The canonical vertex nomination scheme , introduced in the paper , is defined to be the vertex nomination scheme which orders the ambiguous vertices of according to the order of their conditional probability—conditioned on —of being members of the block of interest . Indeed, it is intuitively clear why this would be an excellent (in fact, optimal) nomination scheme. However, since is a parameter, this conditional probability is not yet meaningfully defined. We therefore expand the probability space of the SBM model given in Section 2, and construct a probability measure for which the canonical vertex nomination scheme can be meaningfully defined, with its requisite conditional probabilities. The probability measure is constructed as follows:
Define to be the collection of functions such that for all , and such that for all . Also, recall that is the set of all graphs on . The probability measure has sample space , and it is sampled from by first choosing discrete-uniform randomly and then, conditioned on , is chosen from the distribution SBM. So, for all , ,
where is defined as the number of edges in such that of one endpoint is and of the other endpoint is , and we define if , and . This probability measure, with uniform marginal distribution on , reflects our situation where we have no prior knowledge of specific block membership for the ambiguous vertices (beyond block sizes). Note that is an intermediate measure used to show that is optimal as stated in Theorem 5.
The first step in the canonical nomination scheme is to update this uniform distribution on
The first step in the canonical nomination scheme is to update this uniform distribution onto reflect what is learned from the realization of the graph. Indeed, conditioning on any , the conditional sample space of collapses to become and, for any , we have by Bayes Rule that
In all that follows in this subsection, let respectively denote the random graph and the random function, together distributed as ; in particular, the random is -valued, and the random is -valued. For each , the event is the event and, by Bayes’ Rule,
The canonical vertex nomination scheme is then defined as ordering the vertices in by decreasing value of (with ties broken arbitrarily);
In  it is proved that the canonical vertex nomination scheme is an optimal vertex nomination scheme, in the sense of Theorem 5. We include the proof of Theorem 5 for completeness, and to reflect minor changes in our setting from the setting in .
For any stochastic block model SBM( and vertex nomination scheme , it holds that .
Proof of Theorem 5: A first (of two) preliminary observation: For each , define and then, for each of , define . Note that the sequence of ’s is nonnegative and nonincreasing. Thus, for any other nonnegative and nonincreasing sequence of real numbers and any rearrangement of the sequence , it is straightforward to see that
A second preliminary observation: Suppose that are isomorphic via the isomorphism from the vertices of to the vertices of , such that is the identity function on . It is easy to see that the canonical vertex nomination scheme ranks vertices as if they were unlabeled; that is, for all . Thus, by symmetry in and the fixed values of , we have, for any , that
Without loss of generality, we may assume that the vertex nomination scheme also satisfies Equation (9).
Now, to the main line of reasoning in the proof:
From this, we have
which completes the proof of Theorem 5. ∎
3.2 The canonical sampling vertex nomination scheme
The formula in Equation 6 can be directly used to compute for all , to obtain the ordering that defines the canonical vertex nomination scheme , but due to the burgeoning number of summands in the numerator and in the denominator of Equation 6, this direct approach is computationally intractable, feasible only when the number of vertices is on the order of a few tens. We next introduce an extension of the canonical vertex nomination scheme called the canonical sampling vertex nomination scheme . The purpose of the canonical sampling vertex nomination scheme is to provide a computationally tractable estimate of , for all . The nomination list for consists of the vertices ordered by nonincreasing values of , exactly as the nomination list for consists of the vertices ordered by nonincreasing values of .
Given the realized graph instance of the random graph ,
we obtain the approximation
of for all
by sampling from the conditioned-on- probability space
on , then, for each ,
is defined as the fraction of the sampled functions
( is a set of functions) that map to the integer .
The formula for the conditional probability distribution
. The formula for the conditional probability distributionis given in Equation 5; unfortunately, straightforward sampling from this distribution is hampered by the intractability of directly computing the denominator of Equation 5. Fortunately, sampling in this setting can be achieved via Metropolis-Hastings Markov chain Monte Carlo. For relevant background on Markov chain Monte Carlo, see, for example, [17, Chapter 11] or [2, Chapter 11].
The base chain that we employ in our Markov chain Monte Carlo approach is the well-studied Bernoulli-Laplace diffusion model . The state space for the Markov chain is the set , and the one-step transition probabilities, denoted P, for this chain are defined, for all , as
where . In other words, if at state , a move transpires as follows. A pair of vertices is chosen discrete-uniformly at random, conditional on the fact that , and then the move is to state , which is defined as agreeing with , except that and are defined respectively as and . We will see shortly that the simplicity of this base chain greatly simplifies the computations needed to employ Metropolis-Hastings with target distribution .
The Metropolis-Hastings chain has state space , and one-step transition probabilities, defined for all as
In other words, if at state , a candidate state is proposed according to and is independently accepted as the next state of the Markov chain with probability . It is immediate that the stationary distribution for is and that the chain is reversible with respect to ; that is, for any , .
Note that the simplicity of the underlying base chain greatly aids in the speedy computation of during the computation of transition probabilities . Indeed, since and for which we might want to compute are such that , we would have that and differ only on two vertices, call them ,, and say that and are such that and . Then
This reduces the number of operations to compute from the naive down to . As an implementation note, in practice we would utilize a logarithm to convert Equation 10 from a multiplicative expression into an additive expression, which will greatly reduce roundoff error that can arise when working with numbers that are orders of magnitude different from each other.
Now, the canonical sampling vertex nomination scheme is defined in the exact same manner as , except that, for all , the value is approximated as follows. Denoting the Metropolis-Hastings Markov chain by , we set Uniform). After evolving the chain past a “burn-in” period, , we approximate via
for a predetermined number of Metropolis-Hastings steps For fixed , we then have as an immediate consequence of the Ergodic Theorem (see, for example, [2, Chapter 2, Theorem 1]). that for each .
3.3 The spectral partitioning vertex nomination scheme
As in Section 2, we assume here that the graph is realized from an SBM distribution, where is known. Furthermore, we assume that the values of the block membership function are known only on the set of seeds , and are not known on the set of ambiguous vertices . In contrast to Section 3.1, here we do not need to assume that and are explicitly known or estimated, except that is known, or an upper bound for is known. As before, say that is the “block of interest.”
The spectral partitioning vertex nomination scheme
is computed in three stages; first is the adjacency spectral embedding of , then
clustering of the embedded points, and then ranking the ambiguous vertices into the nomination list.
(The first two of these stages are collectively called
adjacency spectral clustering
adjacency spectral clustering; for a good reference, see .) We begin by describing the first stage, adjacency spectral embedding:
Let graph have adjacency matrix , and suppose has eigendecompostion
and the diagonal of is composed of the
greatest eigenvalues of
greatest eigenvalues ofin nonincreasing order. The -dimensional adjacency spectral embedding of is then given by In particular, for each , the row of corresponding to , denoted , is the embedding of into .
After the adjacency spectral embedding, the second stage is to cluster the embedded vertices—i.e. the associated points in —using the -means clustering algorithm . The clusters so obtained are estimates of the different blocks, and the cluster containing the most vertices from is an estimate of the block of interest ; let denote the centroid of this cluster. (Note that this clustering step, as described here for , is fully unsupervised, not taking advantage of the observed memberships of the vertices in . In Section 3.4, incorporating these labels into a semi-supervised clustering step is a natural way to extend and improve performance.)
The third stage is ranking the ambiguous vertices into the nomination list; the vertices are nominated based on their Euclidean distance from , the centroid of the cluster ∫which is the estimate for the block of interest. Specifically, define:
For definiteness, any ties in the above procedure should be broken by choosing uniform-randomly from the choices. This concludes the definition of the spectral partitioning vertex nomination scheme
If is unknown, singular value thresholding
is unknown, singular value thresholding can be used to estimate from a partial SCREE plot . We note that the results of  suggest that there will be little performance lost if is moderately overestimated. Additionally, an unknown can be estimated by optimizing the silhouette width of the resulting clustering . A key advantage of the spectral nomination scheme is that, unlike , and need not be estimated before applying the scheme.
3.4 The extended spectral partitioning vertex nomination scheme
In this section, we extend the spectral partitioning vertex
nomination scheme (described
in the previous section) to the extended spectral partitioning vertex
nomination scheme .
Just like in computing , computing the extended
spectral partitioning vertex nomination
scheme starts with adjacency
spectral embedding. Whereas the next stage of is
unsupervised clustering using the k-means algorithm,
is unsupervised clustering using the k-means algorithm,will instead utilize a semi-supervised clustering procedure which we describe below.
There are numerous ways to incorporate the known block memberships for
into the clustering step of adjacency spectral clustering
(see, for example, [49, 56]).
The results of  suggest that, for each vertex of ,
the distribution of ’s embedding is approximately normal, with
parameters that depend only on which block is a member of,
and this normal approximation
gets closer to exact as grows.
We thus model ’s embedded vertices
as independent draws from a -component Gaussian mixture model
(except for vertices of , where the Gaussian component is specified);
i.e., there exists a fixed nonnegative vector
-component Gaussian mixture model (except for vertices of
, where the Gaussian component is specified); i.e., there exists a fixed nonnegative vectorsatisfying , and for each , there exists and such that, independently for each vertex , the block of is with respective probabilities , and then, conditioning on model block membership—say the block of is —the distribution of is Normal. If denotes the sequence of mean vectors , denotes the sequence of covariance matrices , and (random) denotes the Gaussian mixture model block membership function—i.e., for each and , it holds that precisely when the Gaussian mixture model places in block —then the complete data log-likelihood function can be written as
which meaningfully incorporates the seeding information contained in .
If is known (indeed, it was assumed to be known in the formulation of , but was not assumed to be known in the formulation of ) then, for each , we would substitute in place of .
With this model is place, it is natural to cluster the rows of using a (semi-supervised) Gaussian mixture model (GMM) clustering algorithm rather than (unsupervised) -means employed by . We now return to the description of the extended spectral partitioning vertex nomination scheme after the first stage—adjacency spectral embedding—has been performed. The next stage—clustering—can be cast as the problem of uncovering the latent ’s as are present in the log-likelihood in Equation 12. We employ a semi-supervised modification of the model-based Mclust Gaussian mixture model methodology of [13, 14]; we call this modification ssMclust; note that ssMclust first appeared in , and we include a brief outline of its implementation below for the sake of completeness.
As in , ssMclust uses the expectation-maximization (EM) algorithm to
approximately find the maximum likelihood estimates of Equation
uses the expectation-maximization (EM) algorithm to approximately find the maximum likelihood estimates of Equation12, denote them by . For each , the cluster of —which is an estimate for the block of —is then set to be
Details of the implementation of the semi-supervised EM algorithm can be found in [34, 55, 56], and are omitted here for brevity. We note here that we initialize the class assignments in the EM algorithm by first running the semi-supervised -means++ algorithm of  on . This initialization, in practice, has the effect of greatly reducing the running time of the EM step in ssMclust; see .
|EII||Equal||Equal, spherical||Coordinate axes|
|VII||Varying||Equal, spherical||Coordinate axes|
|EEI||Equal||Equal, ellipsoidal||Coordinate axes|
|VEI||Varying||Equal, ellipsoidal||Coordinate axes|
|EVI||Equal||Varying, ellipsoidal||Coordinate axes|
|VVI||Varying||Varying, ellipsoidal||Coordinate axes|