Being Corrupt Requires Being Clever, But Detecting Corruption Doesn't

09/27/2018 ∙ by Yan Jin, et al. ∙ MIT 0

We consider a variation of the problem of corruption detection on networks posed by Alon, Mossel, and Pemantle '15. In this model, each vertex of a graph can be either truthful or corrupt. Each vertex reports about the types (truthful or corrupt) of all its neighbors to a central agency, where truthful nodes report the true types they see and corrupt nodes report adversarially. The central agency aggregates these reports and attempts to find a single truthful node. Inspired by real auditing networks, we pose our problem for arbitrary graphs and consider corruption through a computational lens. We identify a key combinatorial parameter of the graph m(G), which is the minimal number of corrupted agents needed to prevent the central agency from identifying a single corrupt node. We give an efficient (in fact, linear time) algorithm for the central agency to identify a truthful node that is successful whenever the number of corrupt nodes is less than m(G)/2. On the other hand, we prove that for any constant α > 1, it is NP-hard to find a subset of nodes S in G such that corrupting S prevents the central agency from finding one truthful node and |S| ≤α m(G), assuming the Small Set Expansion Hypothesis (Raghavendra and Steurer, STOC '10). We conclude that being corrupt requires being clever, while detecting corruption does not. Our main technical insight is a relation between the minimum number of corrupt nodes required to hide all truthful nodes and a certain notion of vertex separability for the underlying graph. Additionally, this insight lets us design an efficient algorithm for a corrupt party to decide which graphs require the fewest corrupted nodes, up to a multiplicative factor of O( n).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Corruption Detection and Problem Set-up

We study the problem of identifying truthful nodes in networks, in the model of corruption detection on networks posed by Alon, Mossel, and Pemantle [AMP15]. In this model, we have a network represented by a (possibly directed) graph. Nodes can be truthful or corrupt. Each node audits its outgoing neighbors to see whether they are truthful or corrupt, and sends reports of their identities to a central agency. The central agent, who is not part of the graph, aggregates the reports and uses them to identify truthful and corrupt nodes. Truthful nodes report truthfully (and correctly) on their neighbors, while corrupt nodes have no such restriction: they can assign arbitrary reports to their neighbors, regardless of whether their neighbors are truthful or corrupt, and coordinate their efforts with each other to prevent the central agency from gathering useful information.

In [AMP15], the authors consider the problem of recovering the identities of almost all nodes in a network in the presence of many corrupt nodes; specifically, when the fraction of corrupt nodes can be very close to . They call this the corruption detection problem. They show that the central agency can recover the identity of most nodes correctly even in certain bounded-degree graphs, as long as the underlying graph is a sufficiently good expander. The required expansion properties are known to hold for a random graph or Ramanujan graph of sufficiently large (but constant) degree, which yields undirected graphs that are amenable to corruption detection. Furthermore, they show that some level of expansion is necessary for identifying truthful nodes, by demonstrating that the corrupt nodes can stop the central agency from identifying any truthful node when the graph is a very bad expander (e.g. a cycle), even if the corrupt nodes only make up fraction of the network.

This establishes that very good expanders are very good for corruption detection, and very bad expanders can be very bad for corruption detection. We note that this begs the question of how effective graphs that do not fall in either of these categories are for corruption detection. In the setting of [AMP15], we could ask the following: given an arbitrary undirected graph, what is the smallest number of corrupt nodes that can prevent the identification of almost all nodes? When there are fewer than this number, can the central agency efficiently identify almost all nodes correctly? Alon, Mossel, and Pemantle study these questions for the special cases of highly expanding graphs and poorly expanding graphs, but do not address general graphs.

Additionally, [AMP15] considers corruption detection when the corrupt agencies can choose their locations and collude arbitrarily, with no bound on their computational complexity. This is perhaps overly pessimistic: after all, it is highly unlikely that corrupt agencies can solve NP-hard problems efficiently and if they can, thwarting their covert operations is unlikely to stop their world domination. We suggest a model that takes into account computational considerations, by factoring in the computation time required to select the nodes in a graph that a corrupt party chooses to control. This yields the following question from the viewpoint of a corrupt party: given a graph, can a corrupt party compute the smallest set of nodes it needs to corrupt in polynomial time?

In addition to being natural from a mathematical standpoint, these questions are also well-motivated socially. It would be naïve to assert that we can weed out corruption in the real world by simply designing auditing networks that are expanders. Rather, these networks may already be formed, and infeasible to change in a drastic way. Given this, we are less concerned with finding certain graphs that are good for corruption detection, but rather discerning how good existing graphs are; specifically, how many corrupt nodes they can tolerate. In particular, since the network structure could be out of the control of the central agency, algorithms for the central agency to detect corruption on arbitrary graphs seem particularly important.

It is also useful for the corrupt agency to have an algorithm with guarantees for any graph. Consider the following example of a corruption detection problem from the viewpoint of a corrupt organization. Country A wants to influence policy in country B, and wants to figure out the most efficient way to place corrupted nodes within country B to make this happen. However, if the central government of B can confidently identify truthful nodes, they can weight those nodes’ opinions more highly, and thwart country A’s plans. Hence, the question country A wants to solve is the following: given the graph of country B, can country A compute the optimal placement of corrupt nodes to prevent country B from finding truthful nodes? We note that in this question, too, the graph of country B is fixed, and hence, country A would like to have an algorithm that takes as input any graph and computes the optimal way to place corrupt nodes in order to hide all the truthful nodes.

We study the questions above for a variant of the corruption detection problem in [AMP15], in which the goal of the central agency is to find a single truthful node. While this goal is less ambitious than the goal of identifying almost all the nodes, we think it is a very natural question in the context of corruption. For one, if the central agency can find a single truthful node, they can use the trusted reports from that node to identify more truthful and corrupt nodes that it might be connected to. The central agency may additionally weight the opinions of the truthful nodes more when making policy decisions (as alluded to in the example above), and can also incentivize truthfulness by rewarding truthful nodes that it finds and giving them more influence in future networks if possible (by increasing their out-degrees). Moreover, our proofs and results extend to finding larger number of truthful nodes as we discuss below.

Our results stem from a tie between the problem of finding a single truthful node in a graph and a measure of vertex separability of the graph. This tie not only yields an efficient and relatively effective algorithm for the central agency to find a truthful node, but also allows us to relate corrupt party’s strategy to the problem of finding a good vertex separator for the graph. Hence, by analyzing the purely graph-theoretic problem of finding a good vertex separator, we can characterize the difficulty of finding a good set of nodes to corrupt. Similar notions of vertex separability have been studied previously (e.g. [Lee17, ORS07, BMN15]), and we prove NP-hardness for the notion relevant to us assuming the Small Set Expansion Hypothesis (SSEH). The Small Set Expansion Hypothesis is a hypothesis posed by Raghavendra and Steurer [RS10] that is closely related to the famous Unique Games Conjecture of Khot [Khot02]. In fact, [RS10] shows that the SSEH implies the Unique Games Conjecture. The SSEH yields hardness results that are not known to follow directly from the UGC, especially for graph problems like sparsest cut and treewidth ([RST12] and [APW12] respectively), among others.

1.2 Our Results

We now outline our results more formally. We analyze the variant of corruption detection where the central agency’s goal is to find a single truthful node. First, we study how effectively the central agency can identify a truthful node on an arbitrary graph, given a set of reports. Given an undirected graph444Unless explicitly specified, all graphs are undirected by default. , we let denote the minimal number of corrupted nodes required to stop the central agency from finding a truthful node, where the minimum is taken over all strategies of the corrupt party (not just computationally bounded ones). We informally call the “critical” number of corrupt nodes for a graph . Then, we show the following:

Theorem 1.

Fix a graph and suppose that the corrupt party has a budget . Then the central agency can identify a truthful node, regardless of the strategy of the corrupt party, and without knowledge of either or . Furthermore, the central agency’s algorithm runs in linear time (in the number of edges in the graph ).

Next, we consider the question from the viewpoint of the corrupt party: can the corrupt party efficiently compute the most economical way to allocate nodes to prevent the central agency from finding a truthful node? Concretely, we focus on a natural decision version of the question: given a graph and a upper bound on the number of possible corrupted nodes , can the corrupt party prevent the central agency from finding a truthful node?

We actually focus on an easier question: can the corrupt party accurately compute , the minimum number of nodes that they need to control to prevent the central agency from finding a truthful node? Not only do we give evidence that computing exactly is computationally hard, but we also provide evidence that is hard to approximate. Specifically, we show that approximating to any constant factor is NP-hard under the Small Set Expansion Hypothesis (SSEH); or in other words, that it is SSE-hard.

Theorem 2.

For every , there is a constant such that the following is true. Given a graph , it is SSE-hard to distinguish between the case where and . Or in other words, the problem of approximating the critical number of corrupt nodes for a graph to within any constant factor is SSE-hard.

This Theorem immediately implies the following Corollary 1.

Corollary 1.

Assume the SSE Hypothesis and that P NP. Fix any . There does not exist a polynomial-time algorithm that takes as input an arbitrary graph and outputs a set of nodes with size , such that corrupting prevents the central agency from finding a truthful node.

We note that in Corollary 1, the bad party’s input is only the graph : specifically, they do not have knowledge about the value of .

Our proof for Theorem 2 is similar to the proof of Austrin, Pitassi, and Wu [APW12] for the SSE-hardness of approximating treewidth. This is not a coincidence: in fact, the “soundness” in their reduction involves proving that their graph does not have a good vertex separator, where the notion of vertex separability (from [BGHK95]) is very related to the version we use to categorize the problem of hiding a truthful vertex. We give the proof of Theorem 2 in Section 3.2.

However, if one allows for an approximation factor of , then can be approximated efficiently. Furthermore, this yields an approximation algorithm that the corrupt party can use to find a placement that hinders detection of a truthful node.

Theorem 3.

There is a polynomial-time algorithm that takes as input a graph and outputs a set of nodes with size , such that corrupting prevents the central agency from finding a truthful node.

The proof of Theorem 3, given in Section 3.2, uses a bi-criterion approximation algorithm for the -vertex separator problem given by [Lee17]. As alluded to in Section 1.1, Theorems 2 and 3 both rely on an approximate characterization of in terms of a measure of vertex separability of the graph , which we give in Section 3.

Additionally, we note that we can adapt Theorems 1 and 2 (as well as Corollary 1) to a more general setting, where the central agency wants to recover some arbitrary number of truthful nodes, where the number of nodes can be proportional to the size of the graph. We describe how to modify our proofs to match this more general setting in Section 5.

Together, Theorems 1 and 2 uncover a surprisingly positive result for corruption detection: it is computationally easy for the central agency to find a truthful node when the number of corrupted nodes is only somewhat smaller than the “critical” number for the underlying graph, but it is in general computationally hard for the corrupt party to hide the truthful nodes even when they have a budget that far exceeds the “critical” number for the graph.

Results for Directed Graphs

As noted in [AMP15], it is unlikely that real-world auditing networks are undirected. For example, it is likely that the FBI has the authority to audit the Cambridge police department, but it is also likely that the reverse is untrue. Therefore, we would like the central agency to be able to find truthful nodes in directed graphs in addition to undirected graphs. We notice that the algorithm we give in Theorem 1 extends naturally to directed graphs.

Theorem 4.

Fix a directed graph and suppose that the corrupt party has a budget . Then the central agency can identify a truthful node, regardless of the strategy of the corrupt party, and without the knowledge of either or . Furthermore, the central agency’s algorithm runs in linear time.

The proof of Theorem 4 is similar to the proof of Theorem 1, and effectively relates the problem of finding a truthful node on directed graphs to a similar notion of vertex separability, suitably generalized to directed graphs.

Results for Finding An Arbitrary Number of Good Nodes

In fact, the problem of finding one good node is just a special case of finding an arbitrary number of good nodes, , on the graph . We define as the minimal number of bad nodes required to prevent the identification of good nodes on the graph . We relate it to an analogous vertex separation notion, and prove the following two theorems, which are extensions of Theorems 1 and 2 to this setting.

Theorem 5.

Fix a graph and the number of good nodes to recover, . Suppose that the corrupt party has a budget . If then the central agency can identify truthful nodes, regardless of the strategy of the corrupt party, and without knowledge either of or . Furthermore, the central agency’s algorithm runs in linear time.

Theorem 6.

For every and every , there is a constant such that the following is true. Given a graph , it is SSE-hard to distinguish between the case where and Or in other words, the problem of approximating the critical number of corrupt nodes such that it is impossible to find good nodes within any constant factor is SSE-hard.

The proof of Theorem 6 is similar to the proof of Theorem 1, and the hardness of approximation proof also relies on the same graph reduction and SSE conjecture. Proofs are presented in Section 5.

1.3 Related Work

The model of corruptions posed by [AMP15] is identical to a model first suggested by Perparata, Metze, and Chien [PMC67], who introduced the model in the context of detecting failed components in digital systems. This work (as well as many follow-ups, e.g. [Kameda75, KR80]) looked at the problem of characterizing which networks can detect a certain number of corrupted nodes. Xu and Huang [XH95] give necessary and sufficient conditions for identifying a single corrupted node in a graph, although their characterization is not algorithmically efficient. There are many other works on variants of this problem (e.g. [Sull84, DM84]

), including recovering node identities with one-sided or two-sided error probabilities in the local reports 

[MH76] and adaptively finding truthful nodes [HA74].

We note that our model of a computationally bounded corrupt party and our stipulation that the graph is fixed ahead of time rather than designed by the central agency, which are our main contributions to the model, seem more naturally motivated in the setting of corruptions than in the setting of designing digital systems. Even the question of identifying a single truthful node could be viewed as more naturally motivated in the setting of corruptions than in the setting of diagnosing systems. We believe there are likely more interesting theoretical questions to be discovered by approaching the PMC model through a corruptions lens.

The identifiability of a single node in the corruptions setting was studied in a recent paper of Mukwembi and Mukwembi [MM17]. They give a linear time greedy algorithm to recover the identify of a single node in many graphs, provided that corrupt nodes always report other corrupt nodes as truthful. Furthermore, this assumption allows them to reduce identifying all nodes to identifying a single node. They argue that such an assumption is natural in the context of corruptions, where corrupt nodes are selfishly incentivized not to out each other. However, in our setting, corrupt nodes can not only betray each other, but are in fact incentivized to do so for the good of the overarching goal of the corrupt party (to prevent the central agency from identifying a truthful node). Given [MM17], it is not a surprise that the near-optimal strategies we describe for the corrupt party in this paper crucially rely on the fact that the nodes can report each other as corrupt.

Our problem of choosing the best subset of nodes to corrupt bears intriguing similarities to the problem of influence maximization studied by [KKT15], where the goal is to find an optimal set of nodes to target in order to maximize the adoption of a certain technology or product. It is an interesting question to see if there are further similarities between these two areas. Additionally, social scientists have studied corruption extensively (e.g.[Fjeldstad03], [Nielsen03]), though to the best of our knowledge they have not studied it in the graph-theoretic way that we do in this paper.

1.4 Comparison to Corruption in Practice

Finally, we must address the elephant in the room. Despite our theoretical results, corruption is prevalent in many real-world networks, and yet in many scenarios it is not easy to pinpoint even a single truthful node. One reason for that is that some of assumptions do not seem to hold in some real world networks. For example, we assume that audits from the truthful nodes are not only non-malicious, but also perfectly reliable. In practice this assumption is unlikely to be true: many truthful nodes could be non-malicious but simply unable to audit their neighbors accurately. Further assumptions that may not hold in some scenarios include the notion of a central agency that is both uncorrupted and has access to reports from every agency, and possibly even the assumption that the number of corrupt nodes is less than . In addition, networks may have very low critical numbers in practice. For example, there could be a triangle (named, “President”, “Congress” and “Houses”) that is all corrupt and cannot be audited by any agent outside the triangle. It is thus plausible that a corrupt party could use the structure of realistic auditing networks for their corruption strategy to overcome our worst-case hardness result.

While this points to some shortcomings of our model, it also points out ways to change policy that would potentially bring the real world closer to our idealistic scenario, where a corrupt party has a much more difficult computational task than the central agency. For example, we can speculate that perhaps information should be gathered by a transparent centralized agency, that significant resources should go into ensuring that the centralized agency is not corrupt, and that networks ought to have good auditing structure (without important agencies that can be audited by very few nodes).

2 Preliminaries

2.1 General Preliminaries

We denote undirected graphs by , where is the vertex set of the graph and is the edge set. We denote directed graphs by . When the underlying graph is clear, we may drop the subscripts. Given a vertex in an undirected graph , we let denote the neighborhood (set of neighbors) of the vertex in . Similarly, given a vertex in a directed graph , let denote the set of outgoing neighbors of : that is, vertices such that .

2.1.1 Vertex Separator

Definition 1.

(k-vertex separator)([ORS07],[BMN15]) For any , we say a subset of vertices is k-vertex separator of a graph , if after removing and incident edges, the remaining graph forms a union of connected components, each of size at most .

Furthermore, let

denote the size of the minimal -vertex separator of graph .

2.1.2 Small Set Expansion Hypothesis

In this section we define the Small Set Expansion (SSE) Hypothesis introduced in [RS10]. Let be an undirected -regular graph.

Definition 2 (Normalized edge expansion).

For a set of vertices, denote as the normalized edge expansion of ,

where is the number of edges between and

The Small Set Expansion Problem with parameters and , denoted SSE(), asks whether has a small set which does not expand or all small sets of are highly expanding.

Definition 3 ((Sse())).

Given a regular graph distinguish between the following two cases:

  • Yes There is a set of vertices with and

  • No For every set of vertices with it holds that

The Small Set Expansion Hypothesis is the conjecture that deciding SSE() is NP-hard.

Conjecture 1 (Small Set Expansion Hypothesis [Rs10]).

For every , there is a such that SSE() is NP-hard.

We say that a problem is SSE-hard if it is at least as hard to solve as the SSE problem. The form of conjecture most relevant to our proof is the following “stronger” form of the SSE Hypothesis. [RST12] showed that the SSE-problem can be reduced to a quantitatively stronger form of itself. In order to state this version, we first need to define the Gaussian noise stability.

Definition 4.

(Gaussian Noise Stability) Let . Define by

where and

are jointly normal random variables with mean

and covariance matrix

The only fact that we will use for stating the stronger form of SSEH is the asymptotic behavior of when is close to and bounded away from 0.

Fact 1.

There is a constant such that for all sufficiently small and all 555Note that the lower bound on can be taken arbitrarily close to . So the statement holds with for any constant .

Conjecture 2 (SSE Hypothesis, Equivalent Formulation [Rst12]).

For every integer and , it is NP-hard to distinguish between the following two cases for a given regular graph :

  • Yes There is a partition of into equi-sized sets such that for every

  • No For every letting , it holds that

where the is the Gaussian noise stability.

We present two remarks about the Conjecture 2 from [APW12], which are relevant to our proof of Theorem 2.

Remark 1.

[APW12] The Yes instance of Conjecture 2 implies that the number of edges leaving each is at most so the total number of edges not contained in one of the is at most

Remark 2.

[APW12] The No instance of Conjecture 2 implies that for sufficiently small, there exists some constant such that provided that and setting . In particular, for any 666Recall that Fact 1 is true for for any constant . Therefore, Remark 6 can be strengthened and states, for any . This will be a useful fact for proving hardness of approximation of for finding many truthful nodes in Section 5.

Remark 1 follows from the definition of normalized edge expansion and the fact that sum of degree is two times number of edges. Remark 6 follows from Fact 1. The strong form of SSE Hypothesis 2, Remark 1, and Remark 6 will be particularly helpful for proving our SSE-hardness of approximation result (Theorem 2).

2.2 Preliminaries for Corruption Detection on Networks

We model networks as directed or undirected graphs, where each vertex in the network can be one of two types: truthful or corrupted. At times, we will informally call truthful vertices “good” and corrupt vertices “bad.” We say that the corrupt party has budget if it can afford to corrupt at most nodes of the graph. Given a vertex set , and a budget , the corrupt entity will choose to control a subset of nodes under the constraint . The rest of the graph remains as truthful vertices, i.e., . We assume that there are more truthful than corrupt nodes (). It is easy to see that in the case where , the corrupt nodes can prevent the identification of even one truthful node, by simulating truthful nodes (see e.g. [AMP15]).

Each node audits and reports its (outgoing) neighbors’ identities. That is, each vertex will report the type of each

, which is a vector in

. Truthful nodes always report the truth, i.e., it reports its neighbor if is truthful, if is corrupt. The corrupt nodes report their neighbors’ identities adversarially. In summary, a strategy of the bad agents is composed of a strategy to take over at most nodes on the graph, and reports on the nodes that neighbor them.

Definition 5 (Strategy for a corrupt party).

A strategy for the corrupt party is a function that maps a graph and budget to a subset of nodes with size , and a set of reports that each node gives about its neighboring nodes,

Definition 6 (Computationally bounded corrupt party).

We say that the corrupt party is computationally bounded if its strategy can only be a polynomial-time computable function.

The task for the central agency is to find a good node on this corrupted network, based on the reports. It is clear that the more budget the corrupt party has, the harder the task of finding one truthful node becomes. It was observed in [AMP15] that, for any graph, it is not possible to find one good node if . If , it is clear that the entire set is truthful. Therefore, given an arbitrary graph , there exists a critical number , such that if the bad party has budget lower than , it is always possible to find a good node; if the bad party has budget greater than or equal to it may not be possible to find a good node. In light of this, we define the critical number of bad nodes on a graph . First, we formally define what we mean when we say it is impossible to find a truthful node on a graph .

Definition 7 (Impossibility of finding one truthful node).

Given a graph , the bad party’s budget and reports, we say that it is impossible to identify one truthful node if for every there is a configuration of the identities of the nodes where is bad, and the configuration is consistent with the given reports, and consists of fewer than or equal to bad nodes.

Definition 8 (Critical number of bad nodes on a graph , ).

Given an arbitrary graph , we define as the minimum number such that there is a way to distribute corrupt nodes and set their corresponding reports such that it is impossible to find one truthful node on the graph , given , the reports and that the bad party’s budget is at most .

For example, for a star graph with , the critical number of bad nodes is . If there is at most corrupt node on , the central agency can always find a good node, thus . If there are at most bad nodes on , then the bad party can control the center node and one of the leaves. It is impossible for central agency to find one good node.

Given a graph , by definition there exists some set of nodes that can make it impossible to find a good node if they are corrupted. However, this does not mean that the corrupt party can necessarily find this set in polynomial time. Indeed, Theorem 2 establishes that they cannot always find this set in polynomial time if we assume the SSE Hypothesis (Conjecture 2) and that P NP.

3 Proofs of Theorems 1, 2, and 3

In the following section, we state our main results by first presenting the close relation of our problem to the -vertex separator problem. Then we use this characterization to prove Theorem 1. This characterization will additionally be useful for the proofs of Theorems 2 and 3, which we will give in Section 3.2 and Section 3.3.

3.1 2-Approximation by Vertex Separation

Lemma 1 (2-Approximation by Vertex Separation).

The critical number of corrupt nodes for graph , , can be bounded by the minimal sum of -vertex separator and , , up to a factor of 2. i.e.,

Proof of Lemma 1.

The direction follows simply. Let . If the corrupt party is given nodes to corrupt on the graph, it can first assign nodes to the separator, thus the remaining nodes are partitioned into components of size at most . Then it arbitrarily assigns one of the components to be all bad nodes. The bad nodes in the connected components report the nodes in the same component as good, and report any node in the separator as bad. The nodes in the separator can effectively report however they want (e.g. report all neighboring nodes as bad). It is impossible to identify even one single good node, because all connected components of size can potentially be bad, and all vertices in the separator are bad.

The direction can be proved as follows. When there are corrupt nodes distributed optimally in , it is impossible to find a single good node by definition, and therefore, in particular, the following algorithm (Algorithm 1) cannot always find a good node:

Input: Undirected graph

  • If the reports on edge does not equal to , remove both and any incident edges. Remove a pair of nodes in each round, until there are no bad reports left.

  • Call the remaining graph . Declare the largest component of as good.

Algorithm 1 Finding one truthful vertex on undirected graph

Run Algorithm 1 on , and suppose the first step terminates in rounds, then:

  • No remaining node reports neighbors as corrupt

  • nodes remain in graph

  • bad nodes remain in the graph, because each time we remove an edge with bad report, and one of the end points must be a corrupt vertex.

Note that if two nodes report each other as good, they must be the same type (either both truthful, or both corrupt.) Since graph only contains good reports, nodes within a connected component of have the same types. If there exists a component of size larger than , it exceeds bad party’s budget, and must be all good. Therefore, Algorithm 1 would successfully find a good node.

Since Algorithm 1 cannot find a good node, the bad party must have the budget to corrupt the largest component of , which means it has size at most . Hence, Plugging in we get that

where the first inequality comes from

Furthermore, the upperbound in Lemma 1 additionally tells us that if corrupt party’s budget , the set output by Algorithm 1 is guaranteed to be good.

Theorem 1.

Fix a graph and suppose that the corrupt party has a budget . Then the central agency can identify a truthful node, regardless of the strategy of the corrupt party, and without knowledge of either or . Furthermore, the central agency’s algorithm runs in linear time (in the number of edges in the graph ).

Proof of Theorem 1.

Suppose the corrupt party has budget . Run Algorithm 1. We remove nodes in the first step, and separate the remaining graph into connected components. Notice each time we remove an edge with bad report, at least one of the end point is a corrupt vertex. So we have removed at most nodes. Therefore, the graph is nonempty, and the nodes in any connected component of have the same identity. Let be the size of the maximum connected component of . We can conclude that , since is a possible size of -vertex separator of .

Notice there are at most bad nodes in by the same fact that at least one bad node is removed each round. By the upper bound in Lemma 1,

Since the connected component of size exceeds the bad party’s remaining budget , and must be all good.

Algorithm 1 is linear time because it loops over all edges and removes any “bad” edge that does not have reports (takes time when we use a list with “bad” edges at the front), and counts the size of the remaining components ( time), and thus is linear in . ∎

Remark 3.

Both bounds in Lemma 1 are tight. For the lower bound, consider a complete graph with an even number of nodes. For the upper bound, consider a complete bipartite graph with one side smaller than the other.

To elaborate on Remark 3, for the lower bound, in a complete graph with nodes, the critical number of bad nodes is , and .

For the upper bound, consider a complete bipartite graph . The vertex set is partitioned into two sets where the induced subgraphs on and consist of isolated vertices, and every vertex is connected with every vertex . The smallest sum of -vertex separator with is obtained with , i.e., We argue that this is also the minimal number of bad nodes needed to corrupt the graph. Without loss of generality , let If the bad party controls all of plus one node in , it can prevent the identification of a good node. On the other hand, if the bad party controls nodes, then we can always identify a good node. Specifically, we are in one of the following cases:

  1. The bad party does not control all of . Then there will be a connected component of size that report each other as good, because the bad nodes cannot control all of , and any induced subgraph of a complete bipartite graph with nodes on both sides is connected.

  2. The bad party controls all of . In this case, the largest connected component of nodes that all report each other as good is only 1. However, in this case, we conclude that the bad nodes must control all of and no other node (due to their budget). Hence, any node in is good.

We end by discussing that the efficient algorithm given in this section does not address the regime when the budget of the bad party, , falls in . Though by definition of , the central agency can find at least one truthful node as long as , by, for example, enumerating all possible assignments of good/bad nodes consistent with the report, and check the intersection of the assignment of good nodes. However, it is not clear that the central agency has a polynomial time algorithm for doing this. Of course, one can always run Algorithm 1, check whether the output set exceeds , and concludes that the output set is truthful if that is the case. However, there is no guarantee that the output set will be larger than if We propose the following conjecture:

Conjecture 3.

Fix a graph and suppose that the corrupt party has a budget such that . The problem of finding one truthful node given the graph , bad party’s budget and the reports is NP-hard.

3.2 SSE-Hardness of Approximation for

In this section, we show the hardness of approximation result for within any constant factor under the Small Set Expansion (SSE) Hypothesis [RS10]. Specifically, we prove Theorem 2.

Theorem 2.

For every , there is a constant such that the following is true. Given a graph , it is SSE-hard to distinguish between the case where and . Or in other words, the problem of approximating the critical number of corrupt nodes for a graph to within any constant factor is SSE-hard.

In order to prove Theorem 2, we construct a reduction similar to [APW12], and show that the bad party can control auxiliary graph of the Yes case of SSE with and cannot control the auxiliary graph of the No case of SSE with .

Given an undirected -regular graph , construct an auxiliary undirected graph in the following way [APW12]. Let . For each vertex , make copies of and add to the vertex set of , denoted . Denote the resulting set of vertices as . Each edge of becomes a vertex in , denoted . Denote this set of vertices as . In other words, . There exists an edge between a vertex and a vertex of if and were adjacent edge and vertex pair in . Note that is a bipartite -regular graph with vertices.

Lemma 2.

Suppose , and can be partitioned into equi-sized sets such that for every Then the bad party can control the auxiliary graph with at most nodes.

Proof of Lemma 2.

Notice by Remark 1, the total number of edges in not contained in one of the is at most

This implies that a strategy for the bad party to control graph is as follows. Control vertex if is not contained in any of the s in . Call the set of such vertices . Let be the set that contains all copies of nodes in . Control one of the say . Corrupt nodes in report their neighbors in as good, and report as bad. Nodes in can effectively report however they want; suppose they report every neighboring node as bad. Then, it is impossible to identify even one truthful node, since assigning any as corrupt is consistent with the report and within bad party’s budget.

If , this strategy amounts to controlling nodes on Notice, this number is guaranteed to be smaller than as long as , because the bad party controls less than of all the "edge vertices" , and controls less than of all the "vertex vertices" . ∎

Note that, different from the argument in [APW12], we cannot take to be arbitrarily large (e.g. ). This is because when is large, and will not be comparable with the in Lemma 3.

Lemma 3.

Let be an undirected -regular graph with the property that for every we have . If bad party controls nodes on the auxiliary graph constructed from , we can always find a truthful node on .

Proof of Lemma 3.

Assume towards contradiction that the bad party controls vertices of graph and we can’t identify a truthful node.

Claim 1.

If the bad party controls vertices of graph and it is impossible to identify a truthful node, then there exists a set of size and separates into sets , each of size .

Proof of Claim 1.

Since the bad nodes can control with vertices, . By the lower bound in Lemma 1, . Let . Then , . By definition of , there exists a set of size whose removal separates the remainder of the graph to connected components of size at most . ∎

Let and be the sets guaranteed by Claim 1. Note we have taken , and thus . In other words, half of the are “vertex” vertices , and half are “edge” vertices . Therefore, with sufficiently small , , , for every . Therefore, we can merge the different s in Claim 1, and have two sets and such that and . Furthermore, and are disjoint, and , and cover .

Similar to the proof of Lemma 5.1 in [APW12], we let (resp. ) be the set of vertices such that some copy of appears in (resp. ). Let be the set of vertices such that all copies of appear in . Since both Furthermore, we observe that , which follows since . Now we can lower bound as follows.

where the first equality uses the fact that and that is disjoint from , and the following inequality uses the fact that , which follows by definition.

Since is sufficiently large, we can find a balanced partition of into sets , , such that , and . From the property of that in Lemma 3 and the fact that is -regular, we know that

for some constant . In the first equality we use the fact that form a partition of . Thus

Note that since and , and and do not have edge between them in , the edges all have to land as "edge vertices" in . In other words, for any , and if , then the vertex has to be included in the set , thus .

This contradicts the fact that there are only vertices in .

Combining Lemma 2 and Lemma 3, Theorem 2 follows in standard fashion. We give a proof here for completeness.

Proof of Theorem 2.

Suppose for contradiction that there exists some constant such that there is polynomial time algorithm that does the following. For any and an arbitrary graph , it can distinguish between the case where and . Specifically, we will suppose this holds for . Then we can use this algorithm to decide the SSE problem as follows.

Fix , , sufficiently small ( suffices). Let be an arbitrary input to the resulting instance of the SSE decision problem (from Conjecture 2). Construct the graph from as done in the beginning of Section 3.2.

If was from the YES case of Conjecture 2, then (Lemma 2). If was from the NO case of Conjecture 2, then (Lemma 3). We can invoke our algorithm to distinguish these two cases, by letting and noting that by design, which would decide the problem in Conjecture 2 in polynomial time. ∎

Now, we can obtain the following Corollary 1 from Theorem 2.

Corollary 1.

Assume the SSE Hypothesis and that P NP. Fix any . There does not exist a polynomial-time algorithm that takes as input an arbitrary graph and outputs a set of nodes with size , such that corrupting prevents the central agency from finding a truthful node.

In summary, the analysis in this section tells us that given an arbitrary graph, it is hard for bad party to corrupt the graph with minimal resources. On the other hand, if the budget of bad nodes is a factor of two less than , a good party can always be detected with an efficient algorithm, e.g. using Algorithm 1.

3.3 An Approximation Algorithm for

In light of the SSE-hardness of approximation of within any constant, and the close relation of with -vertex separator, we leverage the best known approximation result for -vertex separator to propose an approximation algorithm for . It is useful as a test for central authorities for measuring how corruptible a graph is. Notably, it is also a potential algorithm for (computationally restricted) bad party to use to decide which nodes to corrupt.

The paper [Lee17] presents an bicritera approximation algorithm for -vertex separator, with the guarantee that for each , the algorithm finds a subset such that , and the induced subgraph is divided into connected components each of size at most vertices.

Proposition 1 (Theorem 1.1, [Lee17]).

For any there is a polynomial time - bicriteria approximation algorithm for -vertex separator.

Interested readers can refer to [Lee17] Section 3 for the description of the algorithm. Leveraging this algorithm for -vertex separator, we can obtain a polynomial-time algorithm for seeding corrupt nodes and preventing the identification of a truthful node.

Theorem 3 ( Approximation Algorithm).

There is a polynomial-time algorithm that takes as input a graph and outputs a set of nodes with size , such that corrupting prevents the central agency from finding a truthful node.

Proof.

The algorithm is as follows. Call the bicriteria algorithm for approximating -vertex separator in [Lee17] times, once for each in . Each time the algorithm outputs a set of vertices that divides the remaining graph into connected components with maximum size . Choose the for which the algorithm outputs the smallest value of . The bad party can control and one of the remaining connected components (the size of which is at most ), and be sure to prevent the identification of one good node, by the same argument that lead to the upper bound in Lemma 1.

We now prove that is an approximation for the quantity of consideration . For each , we denote our approximation for as Then by the guarantee given in Proposition 1, we know

Thus

The last inequality follows from the fact that that in Lemma 1, and by taking to be a fixed constant, e.g. . So provides an approximation of . The algorithm consists of calls of the polynomial-time algorithm in Proposition 1, so is also polynomial-time. ∎

4 Directed Graphs

Here we present the variant of our problem on directed graphs. As discussed in [AMP15], this is motivated by the fact that in various auditing situations, it may not be natural that any will be able to inspect whenever inspects .

Given a directed graph , we are asked to to find , the minimal number of corrupted agents needed to prevent the identification of a single truthful agent. Firstly, since undirected graphs are special cases of directed graphs, it is clear that the worst case hardness of approximation results still hold. In this section, we will define a analogous notion of vertex separator relevant to corruption detection for directed graphs, and state the version of Theorem 1 for directed graphs.

Definition 9 (Reachability Index).

On a directed graph , say a vertex can reach a vertex if there exists a sequence of adjacent vertices (i.e. a path) which starts with and ends with . Let be the set of vertices that can reach a vertex . Define the reachability index of as , or in other words, as the total number of nodes that can reach .

Based on the notion of reachability index, we design the following algorithm, Algorithm 2, for detecting one good node on directed graphs:

Input: Directed graph

  • If node reports node as corrupt, remove both and any incident edges (incoming and outgoing). Remove a pair of nodes in each round. Continue until there are no bad reports left.

  • Call the remaining graph . Declare a vertex in with maximum reachability index as good.

Algorithm 2 Finding one truthful vertex on directed graph

Run Algorithm 2 on directed graph , and suppose the first step terminates in rounds. Then:

  • No remaining node reports out-neighbors as corrupt

  • nodes remain in graph

  • bad nodes remain in the graph, because each round in step 1 removes at least one bad node.

The main idea is that, if there exists a node with reachability index larger than , at least nodes claim (possibly indirectly) that is good, which means at least one good node also reports as good, and thus must be good. In the rest of the section, we use this observation to generalize Theorem 1.

We define a notion similar to -vertex separator on directed graphs, show that our notion provides a -approximation for when is a directed graph, and that the equivalent of Theorem 1 also holds in the directed case.

Definition 10 (-reachability separator).

We say a set of vertices is a -reachability separator of a directed graph if after the removal of and any adjacent edges, all vertices in the remaining graph are of reachability at most .

Since in an undirected graph, any pair of vertices can reach each other if and only if they belong to the same connected component, one can check that -reachability separator on an undirected graph is exactly equivalent to a -vertex separator. Thus we use a similar notation, , to denote the size of the minimal -reachability separator on .

Lemma 4 (2-Approximation Lemma on Directed Graphs).
Proof.

The direction is proved as follows. Let . If the corrupt party is given nodes to allocate on , it can first assign nodes to a -reachability separator , such that the remaining nodes have reachability index at most . Then it arbitrarily assigns one of the vertices with maximum reachability index plus its as bad. The bad nodes in report any neighbor in the separator as bad and any other neighbor as good. The nodes in the separator can effectively report however they want (e.g. report all neighboring nodes as bad).

It is impossible to detect a single good node, because every node can only be reached by and . For every , it being assigned as corrupt or good is consistent with the reports. If is corrupt, is also assigned as corrupt, thus all nodes in receive good reports from , bad reports from and give bad reports to . If is truthful, all nodes still receive and give the same reports. So for every , assigning as bad, and as good is consistent with the observed reports. It is impossible to find a good node in by definition.

The proof for is given by Algorithm 2. Let there be bad nodes distributed optimally on the graph. By definition, these nodes prevent the identification of a good node. Run Algorithm 2, and suppose the first step terminates in rounds. This means we have removed at least bad nodes, and there are at most bad nodes left on . If there exists a node on with reachability , then this node must be truthful, since there are not enough bad nodes left to corrupt all the nodes that can reach it, and all the reports in the remaining graph are good. Thus for any . Therefore, the set of removed nodes must be an reachability separator. Hence, we can bound as follows.

where the first inequality follows from the fact that . ∎

Theorem 4.

Fix a directed graph and suppose that the corrupt party has a budget . Then the central agency can identify a truthful node, regardless of the strategy of the corrupt party, and without the knowledge of either or . Furthermore, the central agency’s algorithm runs in linear time.

Proof of Theorem 4.

Suppose the corrupt party has budget . Run Algorithm 2. Notice each time we remove an edge with bad report, at least one of the end point is a corrupt vertex. So we have removed at most nodes. Therefore, the graph is nonempty. Let be the maximum reachability index in . Since , and there are no bad reports in , the reachability index of a bad node in graph is at most .

Then a vertex with reachability index must be found by Algorithm 2, and must be a truthful node. The linear runtime follows from the same analysis as in the proof of Theorem 1. ∎

5 Finding an Arbitrary Fraction of Good Nodes on a Graph

Being able to detect one good node may seem limited, but in fact, the same arguments and construction can be adapted to show that approximating the critical number of bad nodes to prevent detection of any arbitrary fraction of good nodes is SSE-hard. In this section, we propose the definition of -remainder -vertex separator, a vertex separator notion related to identifying arbitrary number of good nodes, present a -approximation result, and prove hardness of approximation with arguments similar to proof of Theorem 2 in Section 3.2.

We abuse notation and define to be the minimal number of bad nodes needed to prevent the identification of nodes.

Definition 11 ().

We define as the minimal number of bad nodes such that it is impossible to find good nodes in . In particular, .

Definition 12 (-remainder -vertex Separator).

Consider the following separation property: after the removal of a vertex set , the remaining graph is a union of connected components, where connected components of size larger than sum up to size less than . We call such a set a -remainder -vertex separator of .

For any integer , we denote the minimal size of such a set as In particular, a minimal -vertex separator is a -remainder -vertex separator, i.e.,

Theorem 5.

Fix a graph and the number of good nodes to recover, . Suppose that the corrupt party has a budget . If then the central agency can identify truthful nodes, regardless of the strategy of the corrupt party, and without knowledge either of or . Furthermore, the central agency’s algorithm runs in linear time.

Input: Undirected graph

  • If the reports on edge does not equal to , remove both and any incident edges. Remove a pair of nodes in each round, until there are no bad reports left.

  • Suppose the previous step terminates in rounds. In the remaining graph , rank the connected component from large to small by size. Declare the largest component as good and remove the declared component until we have declared nodes as good.

Algorithm 3 Finding truthful vertices on an undirected graph
Proof of Theorem 5.

We claim that central agency can use Algorithm 3, and output at least