With the rise of social media, where both positive and negative interactions take place, signed graphs, whose study was initiated by Heider, Cartwright, and Harary [8, 19, 18], have become prevalent in graph mining. A key graph mining problem is the edge sign prediction problem, that aims to predict whether an interaction between a pair of nodes will be positive or negative [22, 23]. Recent works have developed numerous heuristics for this task that perform relatively well in practice [22, 23].
In this work we propose a theoretical model for the edge sign prediction problem that is inspired by active learning, and the famous balance theory: “the friend of my enemy is my enemy”, or “the enemy of my enemy is my friend” [8, 15, 19, 36]. Specifically, we model the edge sign prediction problem as a noisy correlation clustering problem [5, 25, 24], where we are able to query a pair of nodes to test whether they belong to the same cluster (edge sign ) or not (edge sign ). The query fails to return the correct answer with some probability . Correlation clustering is a basic data mining primitive with a large number of applications ranging from social network analysis [18, 22] to computational biology . The details of our model follow.
Model. Let be the set of items that belong to two clusters. Set , and let and be the sets/groups of red and blue nodes respectively, where . For any pair of nodes define (i.e., , if is reported to be in the different cluster than ). The coloring function is unknown and we wish to recover the two sets by querying pairs of items. (We need not recover the labels, just the clusters.) Let be iid noise in the edge observations, with for all pairs . The oracle returns
Equivalently, for each query we receive the correct answer with probability , where is the corruption probability. Our goal is answer the following question.
Can we recover the clusters efficiently with high probability by performing a small number of queries?
The constraint of querying a pair of nodes only once in the presence of noise appears not only in settings where a repeated query is constrained to give the same answer but naturally in more complex settings. For example, in crowd-sourcing applications repeated querying does not help much in reducing errors [26, 27, 38], and in biology testing for one out of several millions of potential interactions in the human protein-protein interaction network involves both experimental noise, and a high cost.
Main results. Our two theoretical results show that we can recover the clusters with high probability111An event holds with high probability (whp) if . in polynomial time. Our first result is stated as the next theorem.
There exists a polynomial algorithm with query complexity that returns both clusters of whp.
Our algorithm improves the current state-of-the-art due to Mazumdar and Saha . Specifically, their information theoretical optimal algorithm that performs queries requires quasi-polynomial runtime and is unlikely to be improved assuming the planted clique conjecture. On the other hand, their efficient poly-time algorithms require queries. Our algorithm is optimal for all but tiny , i.e., as long as the first term dominates (asymptotically) the second term .
We also provide an additional algorithm that is sub-optimal with respect to both the number of queries and the runtime. Nonetheless, we believe that our algorithm is of independent interest (i) for the novelty of the techniques we develop, and (ii) for the insights that suggest the use of signed edge-disjoint paths as features for predicting whether an interaction between two agents in an online social network will be positive or negative. Our second algorithm is non-adaptive, i.e., it performs all queries upfront, in contrast to our first algorithm. Also, the algorithm itself is simple, using breadth first search as its main algorithmic primitive. Our result is stated as Theorem 2.
Let , and . There exists a polynomial time algorithm that performs edge queries and recovers the clustering whp for any bias .
Notice that when is constant, asymptotically queries suffice to recover the clustering whp. Our algorithm is path based, i.e., in order to predict the sign of an edge , it carefully creates sufficiently many paths between . While our algorithm (see Section 3 for the details) is intuitive, its analysis involves mathematical arguments that may be of independent interest. Our analysis improves significantly a previous result by the first two authors .
Inspired by our path-based algorithm, we use edge-disjoint paths of short length in a heuristic way to predict the sign of an edge in a given
signed network. Specifically, we perform logistic regression using edge-disjointpaths of short length as a class of features in addition to the features introduced in 
to predict positive and negative links in online social networks. Our experimental findings across a wide variety of real-world signed networks suggest that such paths provide additional useful information to the classifier, with paths of length three being most informative. The improvement we observe is significantly pronounced for edges with no common neighbors.
2 Related Work
Clustering with Noisy Queries. Closest to our work lies the recent work of Mazumdar and Saha . Specifically, the authors study Problem 1 in  as well, as well as the more general version where the number of clusters is . Each oracle query provides a noisy answer on whether two nodes belong to the same cluster or not. They provide an algorithm that performs queries, recovers all clusters of size where is the number of clusters, but whose runtime is quasi-polynomial hence impractical, and unlikely to be improved under the planted clique hardness assumption. They also design a computationally efficient algorithm that runs in time and performs queries. Finally, for they provide a non-adaptive algorithm that performs and runs in time.
Signed graphs. Fritz Heider introduced the notion of a signed graph in the context of balance theory . The key subgraph in balance theory is the triangle: any set of three fully interconnected nodes whose product of edge signs is negative is not balanced. The complete graph is balanced if every one of its triangles is balanced. Early work on signed graphs focused on graph theoretic properties of balanced graphs . Harary proved the famous balance theorem which characterizes balanced graphs as graphs with two groups of nodes .
Predicting signed edges. Since the rise of social media, there has been a surging interest in understanding how users interact among each other. Leskovec, Huttenlocher, and Kleinberg  formulate the edge sign prediction problem as follows: given a social network with signs on all its edges except for the sign on the edge from node to node , how reliably can we infer
from the rest of the network? In their original work, Leskovec et al. proposed a machine learning framework to solve the edge sign prediction problem. They trained a logistic regression classifier using 23 features in total. Specifically, the first seven features are the following: positive and negative out-degreesof node , positive and negative in-degrees of node , the total out- and in-degrees of nodes respectively, and the number of common neighbors (forgetting directions of edges) between . The quantity was referred to as the embeddedness of the edge in 
, and we will follow the same terminology. In addition to these seven features, Leskovec et al. used a 16-dimensional count vector, with one coordinate for each possible triad configuration between. Given a directed edge and a third neighbor connected to both, there are two directions for the edge between and and two possible signs for this edge, and similarly for and , giving 16 possible triads. The 16 possible triads are shown in Table 1.
In the original work of Leskovec et al.  the classifier’s evaluation is only evaluated on edges whose endpoints have embeddedness at least 25. However, these kind of thresholds on the embeddedness discard a non-negligible fraction of edges in a graph. For instance, the fraction of edges with zero embeddedness is 29.83%, and 6.23% in the Slashdot and Wikipedia online social networks (see Table 2) respectively. Edges with small embeddedness are “hard” to classify, because triads tend to be a significant feature for sign prediction . The lack of common neighbors, and therefore of triads, raises the importance of degree-based features for these edges, and these features are known to introduce some damaging bias, see  for an explanation.
We will see in Section 4 –perhaps against intuition– that edge-disjoint paths of length three, may be even more informative than triads. For example, in the Wikipedia social network, if we train a classifier using only triads we obtain 57% accuracy, and if we train a classifier using only paths of length 3, we obtain 74.06% accuracy.
Correlation Clustering. Bansal et al.  studied Correlation Clustering: given an undirected signed graph partition the nodes into clusters so that the total number of disagreements is minimized. This problem is NP-hard [3, 35]. Here, a disagreement can be either a positive edge between vertices in two clusters or a negative edge between two vertices in the same cluster. Note that in Correlation Clustering the number of clusters is not specified as part of the input. The case when the number of clusters is constrained to be at most two is known as 2-Correlation-Clustering.
We remark that the notion of imbalance studied by Harary is the 2-Correlation-Clustering cost of the signed graph. Mathieu and Schudy initiated the study of noisy correlation clustering . They develop various algorithms when the graph is complete, both for the cases of a random and a semi-random model. Later, Makarychev, Makarychev, and Vijayaraghavan proposed an algorithm for graphs with edges under a semi-random model . For more information on Correlation Clustering see the recent survey by Bonchi et al. .
Planted bisection model. The following well-studied bisection model is closely connected to our model. Suppose that there are two groups (clusters) of nodes. A graph is generated as follows: the edge probabilities are within each cluster, and across the clusters. The goal is to recover the two clusters given such a graph. If the two clusters are balanced, i.e., each cluster has nodes, then one can recover the clusters whp, see [28, 39, 2]. Hajek, Wu, and Xu proved that when each cluster has nodes (perfect balance), the average degree has to scale as for exact recovery . Also, they showed that using semidefinite programming (SDP) exact recovery is achievable at this threshold .
Notice that if (i) we have two balanced clusters, and (ii) we remove all negative edges from a signed graph generated according to our model, then one can apply such techniques to recover the clusters. We observe that when the lower bound of Hajek et al. scales as . The techniques we develop in Section 3 work independently of cluster size constraints.
Other Techniques. Chen et al. [10, 11] consider also Model i@ and provide a method that can reconstruct the clustering for random binomial graphs with edges. Their method exploits low rank properties of the cluster matrix, and requires certain conditions, including conditions on the imbalance between clusters, see [11, Theorem 1, Table 1]. Their method is based on a convex relaxation of a low rank problem. Mazumdar and Saha similarly study clustering with an oracle in the presence of side information, such as a Jaccard similarity matrix
Cesa-Bianchi et al.  take a learning-theoretic perspective on the problem of predicting signs. They use the correlation clustering objective as their learning bias, and show that the risk of the empirical risk minimizer is controlled by the correlation clustering objective. Chiang et al. point out that the work of Candès and Tao  can be used to predict signs of edges, and also provide various other methods, including singular value decomposition based methods, for the sign prediction problem
can be used to predict signs of edges, and also provide various other methods, including singular value decomposition based methods, for the sign prediction problem. The incoherence is the key parameter that determines the number of queries, and is equal to the group imbalance . The number of queries needed for exact recovery under our Model is , which is prohibitive when clusters are imbalanced.
3 Proposed Method
Pythia2Truth, Theorem 1. We describe the algorithm Pythia2Truth that achieves the guarantees of Theorem 1. The algorithm arbitrarily chooses two sets such that and . Then, it performs all possible queries between . The total number of queries at this step is . The algorithm then uses the set of labels to make a guess for for each pair . This works as follows: for any given pair each casts a vote . Specifically, if , and if . The prediction is if the majority of votes is , and otherwise.
The aforementioned steps ensure that for all pairs whp. Clearly, there exist at least nodes from at least one of the two clusters. This set of nodes is found by finding the largest connected component (that is actually a clique) of the graph induced by the positive edges in . This set serves as a seed set. For each node we perform all queries for each . If the majority of the oracle answers is then we add in . The procedure outputs and its complement as the true clusters. Now we prove the correctness of our proposed algorithm. First, we prove the following lemma.
Let such that . Consider any pair of nodes , and let majority. Then, with probability at least .
Consider any pair of nodes , and let be an indicator random variable for
be an indicator random variable forthat is equal to 1 if the product of the two noisy labels is the true label . Then, For notation simplicity let . Also, we define . Notice that iff . Using Chernoff bounds , we obtain that the probability of misclassification is bounded by
A straight-forward corollary of lemma 1 derived by taking a union bound over all pairs of nodes in is that our algorithm predicts the labels of all such interactions correctly whp. Using lemma 1 we are also able to prove the correctness of our Algorithm.
Proof of Theorem 1.
Using lemma 1 by setting we obtain that all pairwise interactions within the set are correctly labeled with high probability. By the pigeonhole principle, since , one of the two clusters has at least nodes in . This set can easily be found: since within all labels are equal to , for , disregarding the negative labels will result in at most two connected cliques. We can find the largest such clique in time (since one step of BFS finds all other nodes). Let be the corresponding set of nodes.
Let . We perform all possible queries between and , and we decide that belongs to if the majority of the oracle answers is . Define to be an indicator random variable that is equal to 1 if the oracle answer for the pair is correct, and 0 otherwise. Let be the random variable distributed according to . The probability of failure is bounded by
By combining the above results, and a union bound our proposed algorithm succeeds whp to recover both clusters. ∎
The total runtime of our method is that simplifies to .
A path-based approach, Theorem 2. Before we go into mathematical details (cf. Section 5), we describe how our algorithm behind Theorem 2 works. We perform queries uniformly at random to predict all possible edge signs under our model, as called in Theorem 2. Let be the resulting graph. To predict the sign of the node pair , our algorithm performs –at high level– two steps. First, we construct a subgraph . This subgraph is constructed using breadth first search (BFS), and consists of two isomorphic trees , each one rooted at respectively. The leaves of these trees can be matched and linked with edge disjoint paths; more details are given in Section 5. Pairs of nodes that map to each other under the isomorphism are written as , so is also . The isomorphic copies of the leaves of the two trees are connected by edge disjoint paths. This subgraph is shown in Figure 1.
Given the subgraph , our algorithm estimates the relative coloring of pairs of nodes recursively, working from the leafs of the trees
, our algorithm estimates the relative coloring of pairs of nodes recursively, working from the leafs of the treesup to the roots. That is, we first estimate for the leaves based on the path between them, and then, moving toward the roots and , we estimate based on a majority vote derived by the children. More formally, let be the estimate of for any vertex given by the algorithm below. (Formally, this algorithm defines the random variables ).
Base case: For leaf nodes , we define
where is our estimate of based just on observations from the path from (that is, ).
Induction on depth: For nodes at depth in , let be children of (at depth ). Then, define
Our induction approach collapses each path between each pair of nodes (that are children of respectively) at depth into a single edge, which we estimate based on our previous estimates . Then, in this “collapsed” graph, we take the vote over all (disjoint) paths . At the end, we output . Using Fourier analytic techniques  we prove in Section 5 that
A union bound over all pairs yields Theorem 2. Observe that algorithmically we do not need to perform all queries to recover the two clusters, but any set of queries that form a spanning tree between the nodes.
A machine learning formulation. Our algorithm is heavily based on paths to predict the sign of . Inspired by this result, we use paths as an informative feature in the context of predicting positive and negative links in online social networks. Specifically, we enrich the machine learning formulation proposed by Leskovec et al.  by adding four new global features as follows: for each edge , we find a number of edge-disjoint paths of length three that connect , and similarly we find edge-disjoint paths of length four. We calculate the product of the weights of each path and tally the number of positive and negative products for each path length. We add these four counts as four new dimensions. (We also tried paths of length five, but they are not as informative and are also more computationally expensive, so we do not study such paths henceforth.) We ignore directions of edges both for computational efficiency, and in order to avoid introducing too many features, as for a path of length there are possible directed versions of the path. We describe some key elements of the framework in  for completeness.
Features: In addition to our four new global features, we use 23 local features to predict the sign of the edge : , , where is the embeddedness, i.e., the number common neighbors of (in an undirected sense), and a 16-dimensional count vector, with one coordinate for each possible configuration of a triad.
We train a logistic regression classifier that learns a model of the form . Here is our 27-dimensional feature vector.
We create balanced datasets so that random guessing results in 50% accuracy. We perform 10-fold cross validation, i.e., we create 10 disjoint folds, each consisting of 10% of the total number of edges. For each fold, we use the remaining 90% of the edges as the training dataset for the logistic regression. We report average accuracies over these 10 folds.
4 Experimental Results
4.1 Experimental Setup
Experimental setting. Since finding the maximum number of edge-disjoint paths of short length is NP-hard , we implement a fast greedy heuristic: to find edge-disjoint paths of length ( in our experiments) between , we discard edge directionality, and we start BFS from . As soon as we find a path of length to , we check if its edges have been removed from the graph using a hash table; if not, we add the path to our collection, we remove its edges from the graph, we add them to the hash table, and we continue. At termination, we count how many positive and negative paths exist in our collection. To train a classifier, we use logistic regression. For this purpose we use Scikit-learn .
Datasets. Table 2 shows various publicly available online social networks (OSN) we use in our experiments together with the number of nodes and the number of edges . We present in detail our findings for the first two datasets described in the following. The results for the other graphs are very similar.
Slashdot is a news website. Nodes correspond to users, and edges to their interactions. A positive sign means that a user likes another user’s comments.
Wikipedia is a free online encyclopedia, created and edited by volunteers around the world. Nodes correspond to editors, and a signed link indicates a positive or negative vote by one user on the promotion of another.
Machine specs. All experiments run on a laptop with 1.7 GHz Intel Core i7 processor and 8GB of main memory.
Code. Our code was written in Python. A demo of our code is available as a Python notebook online at github/Prediction.ipynb.
|Slashdot (Feb. 21)||82 144||549 202||OSN |
|Wikipedia||7 118||103 747||OSN |
|Epinions||119 217||841 200||OSN |
|Slashdot (Nov. 6)||77 350||516 575||OSN |
|Slashdot (Feb. 16)||81 867||545 671||OSN |
|Highlands tribes||16||58||SN |
4.2 Empirical findings
We experiment with various combinations of the 27 features that we described in Section 3. All refers to using all 27 features, Triads to the 16-dimensional vector of triad counts, Deg to degree features, Tr. (short for triangles) to the number of common neighbors, Leskovec et al. to the 23 features used in , and P3, P4 to the number of negative and positive edge-disjoint paths of length 3, 4 respectively. A combination of the form P3+P4 means using the union of these features, for example counts of positive and negative edge disjoint paths of length 3 and 4 respectively.
Figures 2(a), (b) shows the performance of our classifier using different combinations of features, broken down by a lower bound on the embeddedness. For the Slashdot dataset, we observe that when we classify all edges (embeddedness ) P3 performs better than Triads, i.e., 68.8% vs 57.8%. Also, the performance of a Triads-based classifier is not monotonic as a function of the embeddedness lower bound. For example, when embeddedness is at least 10 the accuracy is 88.9%, whereas when it is at least 25 it becomes 86.1%. However, in general the prediction problem becomes easier as the embeddedness increases. Also, using all features, i.e., the addition of the four new features P3, P4 to the existing Leskovec et al. results in the best possible performance. Finally, paths of length 3 are more informative than paths of length 4. This is clearly seen by the logistic regression coefficients shown in Figure 3(c). We also observe that different types of triads can have significantly different regression coefficients, and that the coefficients depends signficantly on the graph, as seen in Figures 3(c) and 3(f).
Figures 3(a), (d) shows the average accuracy of predicting edge signs for edges with embeddedness equal to zero for the Slashdot and Wikipedia datasets respectively. When we use Triads the predictive accuracy is as only about as good as random guessing, i.e., 50%. P3 results in 65.74%, and 71.96% accuracy, P4 in 50.78%, and 69.90% accuracy for Slashdot and Wikipedia respectively. We observe that using all features leads to the best possible performances of 78.63%, and 80.92% accuracy respectively for the two datasets. The importance of paths of length 3, and 4 for edges with zero embeddedness is seen by the logistic regression coefficients in Figures 3(b), (e).
5 Algorithmic Analysis
We use the following notation. Let , and
be the average degree. We perform in total queries, and for simplicity, let the bias be a constant, independent of . Hence, asymptotically . Finally, let be the diameter of the resulting random graph we obtain whp .
5.1 Subgraph construction
The next lemma follows from standard Chernoff bounds (and a union bound over vertices).
Let be a random binomial graph. Then whp all vertices have degree greater than .
Now we proceed to our construction of sufficiently enough almost edge-disjoint paths. Our construction is based on standard techniques in random graph theory [6, 14, 16, 37], we include the full proofs for completeness.
Let where . Fix and . Then, whp there does not exist a subset , such that and .
Let be a rooted tree of depth at most and let be a vertex not in . Then with probability , has at most neighbors in , i.e., .
Let be a rooted tree of depth at most and let consist of , the neighbors of in plus the ancestors of these neighbors. Set . Then and . It follows from Lemma 3 with and , that we must have with probability . ∎
We show that by growing trees iteratively we can construct sufficiently many edge-disjoint paths for sufficiently large.
Let . For all pairs of vertices there exists a subgraph of as shown in figure 1, whp. The subgraph consists of two isomorphic vertex disjoint trees rooted at each of depth . and both have a branching factor of . If the leaves of are then where is a natural isomorphism. Between each pair of leaves there is a path of length . The paths are edge disjoint.
Because we have to do this for all pairs , we note without further comment that likely (resp. unlikely) events will be shown to occur with probability (resp. )).
To find the subgraph shown in Figure 1 we grow tree structures as shown in Figure 4. Specifically, we first grow a tree from using BFS until it reaches depth . Then, we grow a tree starting from again using BFS until it reaches depth . Finally, once trees have been constructed, we grow trees from the leaves of and using BFS for depth . We analyze these processes, explaining in detail for and outlining the differences for the other trees. We use the notation for the number of vertices at depth of the BFS tree rooted at .
First we grow . As we grow the tree via BFS from a vertex at depth to vertices at depth certain bad edges from may point to vertices already in . Lemma 4 shows with probability there can be at most 10 bad edges emanating from .
Hence, we obtain the recursion
Therefore the number of leaves satisfies
We can make the branching factors exactly by pruning. We do this so that the trees are isomorphic to each other. With a similar argument . Specifically, the only difference is that now we also say an edge is bad if the other endpoint is in . This immediately gives
and the required conclusion.
Similarly, from each leaf and we grow trees of depth using the same procedure and arguments as above. Lemma 4 implies that there are at most 20 edges from the vertex being explored to vertices in any of the trees already constructed (at most 10 to plus any trees rooted at an and another 10 for ). The number of leaves of each now satisfies
The result is similar for .
Observe next that BFS does not condition on the edges between the leaves of the trees and . That is, we do not need to look at these edges in order to carry out our construction. On the other hand we have conditioned on the occurrence of certain events to imply a certain growth rate. We handle this technicality as follows. We go through the above construction and halt if ever we find that we cannot expand by the required amount. Let be the event that we do not halt the construction i.e. we fail the conditions of Lemmas 3 or 4. We have and so,
We conclude that whp there is always an edge between each and thus a path of length at most between each . ∎
Using elementary data structures, our algorithm runs in total expected run time .
5.2 Algorithm Correctness
Recall from Section 3 that . Therefore, note that at any level in the tree, the random variables are independent for all nodes at level . (This is true in the base case by path-disjointedness, and preserved by the induction). The key Lemma 6 follows. In simple terms, it shows that the bias of our estimator improves by roughly a factor at each level.
Suppose that for all at depth , we have
Then, for all at depth , we have
for some universal .
The proof invokes the Majority Bias Lemma (see Lemma 8) that we prove at the end of this section.
It is more convenient to work with the bias
By the recursive definition,
For any single we have . Then by Lemma 8, taking over such coins amplifies the bias to , as desired. ∎
To conclude the analysis, we show in Lemma 7 that doing levels of this amplifies the bias to a constant. Then we are done, because the root will take the majority of independent coins, each with bias , and so the estimate is correct with high probability. The following amplification result holds:
For nodes that are levels up from the leaves, we have that
Note that at a leaf , the bias is
where is the path from , of length at most .
Then we apply the amplification lemma inductively for levels, starting with this bias at the leaves. It suffices to show that
This means that . Equivalently, solving for
which holds for our choice of as long as is a constant. ∎