1 Introduction and background
Given graphs and and vertices of interest , the aim of the vertex nomination (VN) problem is to rank the vertices of into a nomination list with the corresponding vertices of interest concentrating at the top of the nomination list.
In recent years, a host of VN procedures have been introduced (see, for example, [coppersmith2012vertex, marchette2011vertex, LeePri2012, FisLyzPaoChePri2015, patsolic2017vertex, yoder2018vertex]) that have proven to be effective information retrieval tools in both synthetic and real data applications.
Moreover, recent work establishing a fundamental statistical framework for VN has led to a novel understanding of the limitations of VN efficacy in evolving network environments [lyzinski2017consistent] .
Herein, we consider a general statistical model for adversarial contamination in the context of vertex nomination—here the adversary model can both randomly add or remove edges and/or vertices in the network—and we examine the effect of both this contamination and subsequent data regularization ( effectively removing outlier nodes) on VN performance.
To motivate our mathematical and statistical results further, we first consider an illustrative real data example in Section
. Herein, we consider a general statistical model for adversarial contamination in the context of vertex nomination—here the adversary model can both randomly add or remove edges and/or vertices in the network—and we examine the effect of both this contamination and subsequent data regularization ( effectively removing outlier nodes) on VN performance. To motivate our mathematical and statistical results further, we first consider an illustrative real data example in Section1.1 in which we demonstrate the following: A VN scheme that works effectively; network contamination adversely impacting the performance of our VN scheme; and network regularization successfully mitigating the impact of the contamination. Note that we will provide a more thorough background of the relevant literature after the motivating example in Section 1.2.
1.1 Motivating example
Consider the pair of high school friendship networks in [mastrandrea2015contact]: The first, , has nodes, each representing a student, and has two vertices adjacent if the two students made contact with each other at school in a given time period; the second, , has vertices, again with each vertex representing a student, and has two vertices adjacent if the two students are friends on Facebook. There are students appearing in both and , and we pose the VN problem here as follows: given a student-of-interest in , can we nominate the corresponding student (if they exist) in . We note here that the vertex nomination approach outlined below easily adapts to the multiple vertices of interest (v.o.i.) scenario (i.e., given students-of-interest in , can we nominate the corresponding students, if they exist, in )—and we will provide the necessary details for handling both single and multiple v.o.i. below.
In one idealized data setting, all students would appear in both graphs as this would potentially maximize the signal present in the correspondence of labels across graphs. This bears itself out in the following illustrative VN experiment. Consider the following simple VN scheme, which we denote : Given vertex (or vertices) of interest in and seeded vertices (seeds here represent vertices whose identity across networks is known a priori), proceed as follows (see Section LABEL:sec:asegmm for full detail):
1. Use Adjacency Spectral Embedding (ASE) [sussman2014consistent] to separately embed and into a common Euclidean space ;
Use model-based Gaussian mixture modeling (GMM; e.g., the
2. Solve the orthogonal Procrustes problem [schonemann1966generalized] to find an orthogonal transformation aligning the seeded vertices across graphs; use this transformation to align the embeddings of and in ;
Use model-based Gaussian mixture modeling (GMM; e.g., theR package MClust [fraley1999mclust]) to simultaneously cluster the vertices of the embedded graphs. If and are clustered points in this embedding with respective covariance matrices and in their components of the GMM, then compute
are the respective Mahalanobis distances from to .
4. In the single v.o.i. setting, rank the candidate v.o.i. in by increasing value of (so that the smallest are ranked first). In the multiple v.o.i. setting, we rank the candidate v.o.i. in by increasing value of .
|a) Mean achieving rank versus||b) Chance normalized mean|
|achieving rank versus|
We can consider running the above procedure in the idealized data setting where we only consider the induced subgraphs of and containing the common vertices across graphs (call these graphs and ), and we can also consider running the procedure in the setting where the vertices in without matches across graphs are added to as a form of contamination. These unmatchable vertices can have the effect of obfuscating the correspondence amongst the common vertices across graphs, and thus can diminish VN performance. Indeed, we see this play out in Figure 1.
In Figure 1, we plot the performance of averaged over random seed sets of size . In the left figure, the -axis shows the ranks in the nomination list and the -axis shows the mean ( 2s.e.) number of vertices , when viewed as the lone v.o.i., that had their corresponding vertex of interest ranked in the top by . The right figure shows the same results normalized by chance performance, where we plot
versus . The gold line represents performance in the idealized networks and , and the red line represents performance in the contaminated network pair . We see that the contamination detrimentally affects the performance of at all levels, as for all , the number of v.o.i. in with their corresponding v.o.i. ranked in the top in the second graph is larger in versus in .
How can we mitigate the effect of the contamination in ? Network regularization is a natural solution, and we here consider as a regularization strategy the network analogue of the classical trimmed mean estimator. To wit, we consider the regularization procedure in Algorithm 1 inspired by the network trimming procedure in [edge2018trimming]; see also the work in [le2017concentration] for the impact of trimming regularization on random graph concentration.
The parameters and appearing in Algorithm 1 are unknown a priori, and to data-adaptively choose and , we sweep over possible values and choose the values of and that leads to the maximum network modularity in when clustering the vertices of via clustering; i.e., embed using ASE and cluster the embedding using a model-based GMM procedure. Given a clustering , the modularity is defined as usual via
where the number of edges in ; is the -th element of the adjacency matrix of ; is the degree of vertex in ; and is the cluster containing vertex in .
In the left panel of Figure 2, we plot the modularity of the GMM clustering in the trimmed as a function of . Note that we average the modularity values over seed sets of size (the same seed sets as used in Figure 1). The color indicates the value of the modularity, with darker red indicating lower values and lighter yellow–to–white indicating larger values. From the figure, we can see that modularity is maximized when (i.e., no large degree vertices trimmed) and –. We note that this trimming process can cut core vertices as well as junk vertices, and core vertices cut from can never be recovered via . This is demonstrated in the right panel of Figure 2, where the horizontal asymptotes for each trimming value indicates the maximum number of core vertices that are recoverable after regularization.
In Figure 1, we see the effect of regularization play out. Indeed, mean performance in the regularized setting increases versus in the contaminated setting for , whereas mean regularized performance decreases for . From the right figure, we observe that mean performance in the regularized setting is significantly better than chance, while over-regularizing induces worse than chance performance (the pink line in Figure 1 panel b). While over-regularizing can adversely affect performance, this data-adaptive regularization— while not fully recovering the performance of the idealized setting—nonetheless effectively mitigates the impact of the contamination on our algorithm.
1.1.1 The role of seeds
Figure 1 shows performance of averaged over randomly chosen seed sets of size . While performance, on the whole, increases with proper regularization, the story can vary wildly from seed set to seed set. To demonstrate this, we plot the performance of over two particular seed sets (out of the total used in Figure 1) in Figure 3. In the top panels, we plot performance in the setting of “bad” seeds; i.e., those seeds for which the regularization is unable to effectively mitigate the performance loss due to contamination. In the bottom panels, we plot performance in the setting of “good” seeds; i.e., those seeds for which the contamination negatively impacts performance, but subsequent regularization is able to effectively mitigate this performance loss. These two figures (and their respective chance normalizations in the right panels) point to the primacy of seed selection and of understanding what differentiates “good” versus “bad” seeds. While a full exploration of this is beyond the scope of the present text, this is an active area of our work.
|a) achieving rank versus||b) Chance normalized|
|achieving rank versus|
|c) achieving rank versus||d) Chance normalized|
|achieving rank versus|
In modern statistics and machine learning, graphs are a common way to take into account the complex relationships between data objects, and graphs have been used in applications across the biological (see, for example,
In modern statistics and machine learning, graphs are a common way to take into account the complex relationships between data objects, and graphs have been used in applications across the biological (see, for example,[neu1, neu2, neu3, bio1, bio2, bio3]) and social sciences (see, for example, [socnet1, socnet2, resp1, resp2]). In addition to more traditional statistical inference tasks such as clustering [rohe2011spectral, qin2013dcsbm, networks08:_v, newman2006modularity], classification [vogelstein2013graph, chen2016robust, neu3], and estimation [bickel2013asymptotic, BicChe2009, sussman2014consistent], there has been significant work in more network-specific inference tasks such as graph matching [ConteReview, foggia2014graph, yan2016short], and vertex nomination [marchette2011vertex, coppersmith2014vertex, FisLyzPaoChePri2015].
Loosely speaking, the vertex nomination problem can be stated as follows: given graphs and and vertices of interest , rank the vertices of into a nomination list with the corresponding vertices of interest concentrating at the top of the nomination list (see Definition LABEL:def:VN for full detail). While vertex nomination has found applications in a number of different areas, such as social networks in [patsolic2017vertex] and data associated with human trafficking in [FisLyzPaoChePri2015], there are relatively few results establishing the statistical properties of vertex nomination. In [FisLyzPaoChePri2015], consistency is developed within the stochastic blockmodel random graph framework, where interesting vertices were defined via community membership. In [lyzinski2017consistent], the authors develop the concepts of consistency and Bayes optimality for a very general class of random graph models and a very general definition of what makes the v.o.i. interesting. In this paper, we further develop the ideas in [lyzinski2017consistent], with the aim of developing a theoretical regime in which to ground the notion of adversarial contamination in VN.
There has been significant recent attention towards better understanding the impact of adversarial attacks on machine learning methodologies (see, for example, [huang2011adversarial, cai2015robust, papernot2016limitations, adv1, adv2]). Herein, we define an adversarial attack on a machine learning algorithm to be a mechanism that changes the data distribution in order to negatively affect algorithmic performance; see Definition LABEL:def:Adv. From a practical standpoint, adversarial attacks model the very real problem of having data compromised; if an intelligent agent has access to the data and algorithm, the agent may want to modify the data or the algorithm to give the wrong prediction/inferential conclusion. Although there has been much work on adversarial modeling in machine learning, there has been less theory developed for adversarial attacks from a statistical perspective.
The adversarial framework we consider is similar to the model considered in [cai2015robust], and it is motivated by the example in the previous section in which the addition of the vertices without correspondences to negatively impacted VN performance.
Suppose that we are interested in performing vertex nomination on a graph pair,
but an adversary randomly adds and deletes some edges and/or vertices in the second graph.
For example, suppose we are trying to find influencers on Instagram by vertex matching to Facebook.
An influencer that has knowledge of our procedure may attempt to make our algorithm fail in its nominations, perhaps by friending and de-friending people on Facebook.
Even if our vertex nomination scheme was working well prior to encountering the adversary, it may not be after modification by the adversary.
However, if the adversary adds edges/vertices to a graph with some probability and deletes edges/vertices with another probability, it may be possible to partially recover the structure of the original graph by removing vertices with unusual degree behavior
negatively impacted VN performance. Suppose that we are interested in performing vertex nomination on a graph pair, but an adversary randomly adds and deletes some edges and/or vertices in the second graph. For example, suppose we are trying to find influencers on Instagram by vertex matching to Facebook. An influencer that has knowledge of our procedure may attempt to make our algorithm fail in its nominations, perhaps by friending and de-friending people on Facebook. Even if our vertex nomination scheme was working well prior to encountering the adversary, it may not be after modification by the adversary. However, if the adversary adds edges/vertices to a graph with some probability and deletes edges/vertices with another probability, it may be possible to partially recover the structure of the original graph by removing vertices with unusual degree behavior[edge2018trimming]. Such a modification is the graph analogue of the “trimmed mean” estimator [stigler1973asymptotic] from classical statistics.
Empirically, if we assume the adversary is modifying the data randomly, can we still predict whether our VN scheme will perform well on the regularized graph? From a statistical standpoint, what can we say about the statistical consistency of our original vertex nomination rule? Our motivating example suggests that it may be possible to recover performance after regularization, but theory is needed both to explain why that may be the case and to properly frame the problem. Hence, to answer these questions, we further develop the theory in [lyzinski2017consistent] to situate the notion of adversarial contamination within the idea of maximal consistency classes for a given VN rule (Section LABEL:sec:CC). In this framework, the goal of an adversary is to move a model out of a rule’s consistency class, while regularization enlarges the consistency class to (hopefully) thwart the adversary. While we are unable to rigorously establish this for the VN rule, , considered herein, we demonstrate with real and synthetic data examples that countering such an adversarial attack via network regularization can effectively ameliorate VN performance (Section LABEL:sec:data).
Notation: Note that the following notation will be used throughout. For a positive integer , we will let denote the set of -vertex labled graphs, and we will let .
2 Vertex Nomination and Consistency
We will now rigorously define the VN problem and consistency within the VN framework. Combined with the results on consistency classes in Section LABEL:sec:CC, this will allow us to provide a statistical basis for understanding adversarial attacks in VN.
As in our motivating work in [lyzinski2017consistent], we will situate our analysis of the VN problem in the very general framework of nominatable distributions.
For a given , the set of Nominatable Distributions of order , denoted , is the collection of all families of distributions of the following form
where is a distribution on parameterized by satisfying:
The vertex sets and satisfy for . We refer to as the core vertices. These are the vertices that are shared across the two graphs and imbue the model with a natural notion of corresponding vertices.
Vertices in and , satisfy . We refer to and as junk vertices. These are the vertices in each graph that have no corresponding vertex in the other graph
The induced subgraphs and are conditionally independent given .
The vertices in are those that have a corresponding paired vertex in each graph; where corresponding can be defined very generally. Corresponding vertices need not correspond to the same person/user/account, rather corresponding vertices are understood as those that share a desired property across graphs. In particular, we will assume that the vertices of interest in correspond to the vertices of interest in . Having access to the vertex labels would then render the VN problem trivial. To model the uncertainty often present in data applications, where the vertex labels (or correspondences) are unknown a priori we adopt the notion of obfuscation functions from [lyzinski2017consistent].