The contagious spread of online social behaviors can often be attributed to peer influence, wherein the adoption likelihood of a behavior increases through exposures of past adoptions by friends in a network. Examples include the adoption of opinions (Muchnik et al., 2013), communication technologies (Karsai et al., 2014), memes (Yang and Leskovec, 2010), and the use of non-standard language (Goel et al., 2016).
Computational analysis of social influence is usually motivated by one of two distinct goals: prediction and explanation. To predict the future trajectory of event cascades, it is necessary to accurately model the underlying social processes, including peer influence (e.g., Bakshy et al., 2011). In some formulations of this problem, the goal is to predict the ultimate size of the cascade, using features of the cascade, its early adopters, and the overall social network (e.g., Cheng et al., 2014)
. In other cases, the goal is to predict who will adopt the behavior next, by estimating peer-to-peer influence parameters and overall susceptibility(e.g., Du et al., 2013). In general, these methods are applied in cases in which social influence is assumed to be present.
Another line of work focuses on explanation: estimating the causal effect of peer influence, and determining whether peer influence is implied by the evidence of a cascade or set of cascades. The detection of peer influence is best suited to an experimental framework, in which some individuals are randomly assigned to a “treatment condition” in which they are exposed to peer influence (Bond et al., 2012; Salganik et al., 2006). However, there are many contexts in which experimental methods cannot be applied, due to expense, lack of scalability, or the difficulty of replicating socially meaningful situations in an experiment: for example, it is not possible to measure the effect of peer influence on the passage of real legislation by a randomized experiment. These challenges motivate the development of observational techniques for detecting and measuring peer influence. However, influence is confounded with homophily — the tendency of individuals with similar properties to establish social network connections. Even when peer influence is absent, the presence of homophily can give rise to network-correlated behaviors which are indistinguishable from the effects of social influence (Shalizi and Thomas, 2011). This concern can be partially addressed by techniques from causal inference, which create comparisons that approximate a randomized experiment, under various limiting assumptions (e.g., Anagnostopoulos et al., 2008; Aral et al., 2009; La Fond and Neville, 2010). A key remaining question is whether such methods are robust to real-world phenomena such as missing data and misspecification.
In this paper, we present a new method that unifies predictive and explanatory accounts of social influence. The core of our approach is a discriminatively trained ranker
, which learns to assign a high rank to nodes that are likely to host the next event in a cascade. Note that this is a simpler problem then modeling the probability of a cascade as a series of time-stamped events. This simplicity enables the application of powerful methods from supervised machine learning, which can account for homophily through social network node embeddings(Tang et al., 2015). To detect social influence, we train two ranking models: , the most accurate ranker that we can build without including features that measure social influence, and , which contains all features in , plus features that measure social influence. We then compare and on held out data: if
is more accurate, than this is evidence for social influence, under the usual assumption of no unobserved confounds. We validate this approach on a battery of synthetic data experiments, showing that it is well-calibrated (Type-I error rate at or below the desired-value) when there is homophily, self-excitation, missing data, and model misspecification. The method of ranker comparison obtains higher statistical power than a classical shuffle test across all scenarios, and outperforms a Hawkes Process goodness-of-fit test under conditions of missing data and misspecification. In addition, the Type-I error rate of the Hawkes Process test approaches one in some settings involving missing data or misspecification of the temporal kernel.
We also use our method to analyze two real-world datasets of cascades. First, we consider the United States House of Representatives, which can be characterized by a social network in which legislators are linked when they share a major campaign donor, and by cascades in which events correspond to decisions to cosponsor a piece of legislation. We show that the campaign finance social network improves the ability to predict the trajectory of these cascades, demonstrating the influence of campaign finance on the early stages of the legislative process. Second, we consider a large-scale cascade of “scientific rumors” about the Higgs-Boson discovery. This dataset demonstrates the ability of our method to detect social influence and predict cascade trajectories in networks with nodes and edges.
2. Related Work
Social influence induces a systematic temporal pattern in the order of adoptions. This insight is leveraged by Anagnostopoulos who propose the shuffle test to detect social influence (Anagnostopoulos et al., 2008). In this test, a statistic measuring the social correlation is calculated from the observed sequence of adoptions, and is compared to a distribution of the same statistic calculated after repeatedly shuffling the order of adoptions. Shuffling eliminates any systematic patterns present due to social influence, while preserving patterns that are due to homophily. If the observed social correlation significantly differs from the social correlation after shuffling, then social influence is a likely cause. The shuffle test assumes a static underlying network, but the test can be generalized to dynamic networks (La Fond and Neville, 2010). The shuffle test offers a simple method to detect social influence, without assumptions about the data generation process. However, it makes no attempt to incorporate fine-grained temporal information or node-level covariates. Our method incorporates these features, and is found to be a more powerful statistical test on a range of synthetic data scenarios. The shuffle test also only has explanatory power, whereas our discriminative learning approach can also make predictions about future adoptions.
An alternative to shuffling is proposed by Aral (Aral et al., 2009), who divide nodes in a network into control and treatment groups, depending on whether they have an adopter friend in their ego network. To account for homophily, units from both groups are matched using propensity scores, which are based on demographic covariates. The average adoption rate in both groups is then compared across groups. As we show in § 3, our discriminative learning model can capture homophily in the network structure. Our method can also incorporate personal attributes, but otherwise uses structural similarity as a proxy for homophily. As with the shuffle test, the method proposed by Aral is primarily explanatory and lacks predictive ability.
An alternative approach is to try to recover a parametric influence network by modeling timestamped event cascades through the Hawkes Process (HP; Du et al., 2013; Yang and Zha, 2013; Zhao et al., 2015). If the excitation parameters are estimated to be non-zero, this can be viewed as evidence of social influence (Xu et al., 2016); alternatively, a goodness-of-fit test can be used to compare nested parametric Hawkes Process models, with and without access to features of the social network (Goel et al., 2016). Our discriminative learning model is similar in some respects: it is a learning based model, and also builds on the assumption that past events modulate the rate of future events. However, the Hawkes process is a generative model of events in a cascade, while our proposed approach is discriminative. As we show, violations of the assumptions of the HP generative model lead to incorrect inferences, with high Type-I error rates. In contrast, our discriminative modeling approach is agnostic to the event generation process, and is robust to missing data and relatively insensitive to misspecification. While there are targeted solutions for misspecification and missing data (e.g., He et al., 2016; Duong et al., 2011; Lokhov, 2016; Xu et al., 2017), there is no general approach that makes the Hawkes process immune to all such concerns.
Finally, our method is closely related to the concept of Granger Causality, which relates causation to the improvement of predictive accuracy (Granger, 1969). Granger Causality has been applied to Hawkes process models, with the goal of identifying non-zero dyad parameters (Xu et al., 2016). In contrast, our approach is aimed at testing for the presence of social influence throughout a large network, rather than on individual dyads. This makes it possible to formulate our approach through the relatively simpler framework of model comparison, rather than attempting to induce group-sparse excitation parameters.
3. Discriminative Model
Our objective is to detect social influence given a cascade of timestamped events from individual nodes in a social network. We do this by solving an auxiliary task: predicting which node will be the next to be activated in the cascade. This auxiliary task is solved by learning a discriminative ranking function, which scores each node by weights on features of the individual nodes and the cascade history. Observed confounds — variables that predict both the presence of social network connections and participation in the event cascade — can be incorporated into the ranker as features on individual nodes. (The usual assumption of no unobserved confounds is required (Shalizi and Thomas, 2011).) The question of whether there is social influence is then transformed to the proxy question of whether the prediction task is aided by the inclusion of features that measure social influence. This is similar to the principle of Granger causality (Granger, 1969), which states that Granger-causes if prediction of is aided by knowledge of .
3.1. Features for discriminative ranking
The ranking task is to order all nodes in the network by their likelihood of being the next node to be activated in the cascade. We compute a scoring function, which depends on static features of the node as well as the previous history. The static features can account for confounders, which relate to the base propensity of each node to participate in the cascade; the history features can account for both self-excitation and social influence. The scoring function is:
where both and are node-specific features for node . The features capture intrinsic properties of the node; these features are static in all subsequent evaluations, but the model generalizes trivially to dynamic features. The function transforms the features into a scalar value, and is parameterized by
; this function could be, e.g., a simple inner product of features and weights, or a multilayer neural network.
The features capture the properties of each dyad. Let indicate an event whose source is and whose time is . The features for the ordered dyad are written , and the associated weights are written . As in the Hawkes Process, these features are weighted by a temporal kernel , with bandwidth parameter .
The ranker is applied in an online fashion, recomputing scores for each node at the time of every new event. While the scoring function is closely related to the intensity function of the Hawkes process, there is a key difference: a discriminative ranker is not a probabilistic model over event cascades, and there is therefore no need to compute integrals over the intensity function over inter-event periods.
3.1.1. Intrinsic features.
The intrinsic features can act as a proxy for homophily between the nodes, which might otherwise confound the detection of social influence. In some cases, specific covariates are available: for example, the political party of each legislator, or the age of members of a social network. But in other cases, the relevant covariates are unknown. In this case, we assume that the latent confounds that drive the event cascades are related to the properties that drive the formation of social network ties. These properties can be captured by computing node embeddings, where neighbors in the network are nearby in the embedding space. There are various ways of calculating the node embeddings (Grover and Leskovec, 2016; Tang et al., 2015); in the analyses that follow, we use spectral embeddings obtained from the graph Laplacian (Donetti and Munoz, 2004). To increase the expressive power of the ranker, these embeddings are transformed into a scalar by a feedforward network , with parameters .
3.1.2. Dyadic features.
In addition to the intrinsic node features, the ranker also utilizes dyadic features extracted from past events. As in the Hawkes Process, these features are multiplied by a decay kernel, and aggregated over the entire history of past events. The decay kernel captures the intuition that events in the distant past should have less impact on the score than more recent events. In our experiments, we use a simple exponential decay kernel , with acting as a bandwidth parameter. However, the decay kernel can be a more complex function, such as a linear combination of simple kernels, or a Gaussian process (Bacry et al., 2012). We incorporate two dyadic features: self-excitation, which is the tendency for a node to repeatedly activate after the first activation; and social influence, which is the tendency for a node to be activated by exposure from its neighbors. The complete set of features used in our model and their description is given in Table 1.
To learn the ranking weights, we apply the WSABIE algorithm for minimizing an approximation to the WARP ranking loss (Weighted Approximate Rank Pairwise) (Usunier et al., 2009; Weston et al., 2011). Let be the set of nodes which have an event at time , let be the rank of node according to the scoring function , and let be the margin-penalized rank,
This is equal to the number of margin violations for active node .
Using these terms, we can compute the WARP loss at time :
where is a ranking error function. We use,
which means that the error is one whenever the node is not top-ranked. However, the formalism enables a flexible class of alternative error functions (Usunier et al., 2009).
Calculating the loss in (3
) naively is inefficient when the number of nodes is large. The WSABIE algorithm approximates this loss by repeatedly drawing inactive nodes randomly until we found a node against which the activated node is not ranked higher by a margin of one. The rank of the activated node is approximated by the inverse of the number of samples required to find such a node. The overall loss function used by the WSABIE algorithm is given by,
where is the total number of nodes, is the number of trials required to find a margin violation, and is the violating node. This loss is calculated at each time and the parameters of the scoring function are updated by taking a gradient step to minimize the loss.
|emd||Node||The spectral embeddings for node|
|self||Dyad||Self excitation; active for node if has a previous event and|
|social||Dyad||Social feature; active for node if has a previous event and|
3.3. Model Comparison
To test for social influence, we compare the performance of alternative ranking models and , which are identical in all respects except one: includes features that are activated under the condition of social influence, and does not. For example, given a social network, includes a feature that fires for node at time if any of the social network neighbors of have an event at time
. These features are included in the vectorin Equation 1. Each ranker is then applied to heldout data, and scored according to a ranking metric. In most cases, we use mean reciprocal rank (this includes all cases in the synthetic data experiments). In the legislative co-sponsorship setting (subsection 5.1), there can be multiple events at a single time point, so we use mean average precision instead.
To determine whether the performance difference between and is unlikely to have arisen due to chance alone, we must apply a statistical significance test to compare their performance on a heldout test set. Of the various statistical significance tests that have been proposed for rankers (e.g., Cormack and Lynam, 2006; Sakai, 2006; Wilbur, 1994), we select the non-parametric permutation test (Smucker et al., 2007)
. In this test, the predictions between the two rankers are repeatedly exchanged (permuted), to create an empirical distribution of the difference in ranker performance under the null hypothesis that the two rankers are identical.
The right-tailed -value is the fraction of permutations in which the difference in ranker performance was greater than the observed difference in the original unpermuted data. This tests the null hypothesis that adding social influence features does not improve ranking accuracy on heldout data, which is a proxy for the null hypothesis of no social influence. In the following section, we demonstrate the reliability of this proxy through a series of experiments on synthetic data.
4. Synthetic Data Evaluation
To test the validity and efficacy of our proposed test for social influence, we now describe a set of synthetic data experiments. The experiments are performed on a real network, using synthetic cascades. We generate some cascades without social influence, to test the Type I error rate of our method; other cascades are generated with social influence, to test the power. In both cases, we consider the impact of homophily, self-excitation, missing data, and misspecification of the model. The use of synthetic data makes it possible to quantify these characteristics precisely, before moving to real data in the next section.
We generate cascades of events using a multivariate Hawkes process (HP), a special kind of inhomogenous Poisson process such that the rate of generating the events is modulated by the history of events in the past (Hawkes, 1971). Like any Poisson process, the HP is described by its intensity function for every node in the network The intensity of every node represents the instantaneous rate at which it is activated and is given by,
where , and are the parameters of the intensity function; is the total number of nodes in the network; and is the set of events from node . is the base rate at which nodes generate events, whereas is the pairwise influence between nodes in the network111The code can be found at https://github.com/sandeepsoni/MHP.
To generate cascades under conditions of social influence and various confounds, we specify the parameters in (6) as follows:
is the second eigenvector of the Laplacian matrix of the network;
is the sigmoid function used to keep the base rate to be positive; andis the adjacency matrix of the network. We can then generate cascades under various conditions of interest:
- Social influence:
By setting , we generate cascades without social influence. These cascades are used to test the Type I error rate of our test. As increases, so does the impact of social influence.
The similarity of nodes is captured by the similarity of the second eigenvectors and . By conditioning the base adoption rate of these nodes on this parameter, it is possible to generate cascades in which events are strongly correlated with the network, even without social influence. This corresponds to the case in which nodes and form a friendship because they both share an interest, and then participate in cascades that reflect that same interest. By varying the parameter , the importance of homophily in shaping the cascades can be increased or decreased.
For some behaviors, nodes will participate in a cascade repeatedly. Self-excitation occurs when a node’s own participation in the cascade spurs further participation in the future. For instance, after learning a new slang or a hashtag and using it once, a user is likely to repeat its usage. This tendency can also be controlled by setting .
Rather than generating a synthetic network, we use a real “mention” network from Twitter. This static and directed network was constructed as follows. First, we select all individuals who used a partisan political hashtag (e.g., #clintonkaine2016) between October 1-15, 2016. Then we identify all individuals who were mentioned by someone in this initial set. The directed edge indicates that mentioned on Twitter during this time period. We then select the giant weakly connected component as the underlying network for all synthetic cascades.
4.2. Evaluation Metrics
We evaluate the performance of ranking test on two metrics: validity and power.
A test is statistically valid if its -values are well-calibrated: at the threshold , the test should reject a true null hypothesis with probability less than or equal to . To establish validity, we evaluate the test on multiple cascades where the null hypothesis is known to be true. The rate at which the test rejects the null hypothesis is the Type I error rate.
Conversely, a test has high statistical power if it consistently rejects a false null hypothesis. Failure to reject a false null hypothesis leads to a Type II error, and a powerful test should make few Type II errors. To establish power, we evaluate the test on multiple cascades where the null hypothesis is known to be false. The rate at which the test fails to reject the null hypothesis is the Type II error rate. An ideal test should be valid under all conditions and should have high statistical power.
We compare the performance of the ranking test against two well known alternatives: (1) the shuffle test from Anagnostopoulos (Anagnostopoulos et al., 2008), and (2) a test that compares the goodness of fit between two parametric Hawkes Processes (HP): one with access to social influence parameters, and one in which these parameters are clamped to zero. We now describe these tests in detail.
4.3.1. Shuffle test
The shuffle test infers social influence by calculating a measure of social correlation from an observed cascade, and comparing it with the distribution of the same measure after repeatedly shuffling the order of events in the cascade. This shuffling breaks the effect of social influence, providing an estimate of the amount of social correlation that can be attributed to factors other than influence. We use the infection risk as our measure of social correlation, which is calculated as the ratio of adopters to innovators:
adopters are nodes which got activated only after at least one of their network neighbors were activated;
innovators are nodes which are activated before any of their network neighbors.
To calculate the infection risk, only the first activation of a node is considered. The infection risk has been used as a measure for social correlation in other studies (e.g., Aral et al., 2009). We refer to this test as .
4.3.2. HP goodness-of-fit test
The Hawkes Process (HP) can be used to test for social influence by comparing the goodness-of-fit between two nested models: an HP that includes social influence parameters, and an HP that does not (e.g., Goel et al., 2016). To perform this test, we estimate the parameters of the HP described in equations 6 and 7 from training data. To test whether the goodness of fit improved significantly by the addition of the social feature, we use the likelihood ratio test (Wilks, 1938). Note that the maximum likelihood estimation of the parameters effectively reverses the data generation process and we therefore expect this baseline to perform well, except when there is missing data or misspecification in the data generation process. We refer to this test as .
4.4.1. Full data
We first consider the case when every test has access to full data: all the events in the cascade and all the edges in the network are known to each test. We also assume that the generative process is correctly specified, meaning that the exponential decay kernel and the bandwidth parameter that modulate the rate of generation of future events are known during learning and testing. We will relax these assumptions later.
To check for validity, 100 cascades of 5000 events each are generated under conditions of no social influence (), varying homophily () and varying self-excitation (). The null hypothesis of no social influence is true by design for every such cascade. Figure 1 shows the calibration of the tests when there is high homophily and low (0(a)) or high self-excitation (0(b)). Both the ranking and shuffle tests are well-calibrated. The goodness-of-fit test of HP produces conservative -values, but satisfies the condition of validity, which is that the Type-I error rate is bounded by -value. All three tests have low Type I error rates even under different amounts of homophily and self-excitation.
To check for statistical power, 100 cascades of 5000 events each are generated by varying social influence at fixed values of homophily and self-excitation. As shown in Figure 2, the power increases with social influence for all tests, as expected. The is the most powerful across these conditions, and the is least powerful. The higher statistical power of HP should not come as a surprise: the data was generated in exactly the reverse manner to the estimation procedure, and in the absence of any noise, sampling or misspecification, it should have optimal performance; in contrast, the ranking test is agnostic to the generative process. Because the ranking test outperforms the test, in the subsequent evaluations we only compare the ranking test to .
4.4.2. Model misspecification
Thus far, we have assumed that the generative process is correctly specified. Though this assumption helps gauge the optimum ability of the tests under favorable modeling scenarios, it is almost always false when modeling real world phenomena. It is therefore important to determine how the tests fare under misspecification. One way this can happen is if the bandwidth parameter () for the temporal decay kernel is misspecified. We generated 100 cascades of 5000 events each for the influence () and no-influence condition (), both with highest homophily and self-excitation (; ). During generation, the bandwidth was fixed (), but we assumed that both the HP learner and the ranker did not know this true value of the bandwidth parameter. We then varied the values of used by these tests, to understand their effect on validity and the statistical power of both the tests. Varying the bandwidth parameter has a natural interpretation: as decreases, the scope of the history is effectively widened; increasing this parameter has the opposite effect. The results for both validity and power are shown in Figure 3.
Increasing the bandwidth parameter causes the temporal kernel to decay sharply, nullifying the effect of spurious activations of neighbors from the past, limiting Type I error but also power. The Type I error rate decreases with bandwidth for exactly the opposite reason: as the bandwidth parameter decreases, the impact of past events is larger on each node’s activation. Irrelevant activations of neighbors in the distant past may be mistakenly deemed consequential, leading the test to overestimate the role of social influence. The model is quite sensitive to this parameter, and misspecification severely undermines the validity of the test for practical purposes.
In contrast, the ranking test is robust to misspecification: for a wide range of bandwidth values, the ranker maintains a low Type I error rate and has considerable power. This is because the ranking objective requires only that the relative order of each node is maintained. Spurious events in the distant past affect many nodes, and the resulting changes in rank are not significant. While there is research on augmenting the Hawkes Process with nonparametric kernels that are learned from data (Zhou et al., 2013), such methods are complex to implement, and require large amounts of training data. In contrast, the ranking test can easily incorporate multiple kernels into the discriminative ranking function, thereby learning complex triggering patterns.
4.4.3. Missing data
We have thus far assumed that all events in the cascade are available. There are various reasons that this assumption can be violated in real data: full data collection is often too expensive; there are rate limits on the number events that can be collected from public APIs for sites such as Twitter; individuals may erase past events in their history; data may be lost accidentally, as can happen when a server crashes during data acquisition; data collection may have begun only after the cascade was initiated. Incomplete data however can diminish statistical power, giving the appearance of unprompted innovations to events that were in fact socially motivated. To quantify this phenomenon, we generate cascades with two different types of missing events: events missing at random, and events missing in contiguous blocks.222Other types of missing data, such as missing edges or missing nodes, should also be considered (Duong et al., 2011; Linderman et al., 2017). We leave the validation of our test against these types of missing data for future work.
Random missing events
We generate 100 cascades of varying lengths, and randomly drop events at a specified rate of 99%. This matches a published estimate of the fraction of tweet included in Twitter’s streaming API (Morstatter et al., 2013). Figure 4 shows the power of the and ranking tests, as a function of cascade length. While both tests increase in power with the cascade length, the is marginally less powerful across all cascades — particularly on short cascades. The Type I error rate for both tests is around 0.05 even as cascade length increases, and is therefore not shown.
Doubly censored events
Instead of missing random events, a contiguous block of events in a cascade can also be missing. This scenario was introduced by Xu , where they considered sequences with a block of missing events both at the beginning and the end resulting in what they call short doubly censored (SDC) sequences (Xu et al., 2017). The censoring scenario is also practically important since for any given cascade, all events – from the first to the final – of the cascade are not always available or are expensive to obtain.333Again, for instance, the Twitter API places restrictions on collecting past data. Although Xu provide a creative modeling strategy of “sampling and stitching” events to SDC sequences to recover the generative model parameters, the adverse impact of such censoring on our tests, if any, is not obvious.
We generated 100 cascades of different lengths and dropped 99% of events, with half of the dropped events in a contiguous block at the beginning and the remaining half in a contiguous block at the end of the cascade. On censored cascades, the goodness-of-fit test for HP suffered from a high Type I error rate, severely limiting its validity. This bias towards overestimating social influence is consistent with findings from Xu (Xu et al., 2017), who note that maximizing the likelihood of an HP can lead to overfitting in this scenario. Conversely, the validity of the ranking test was not affected by censoring, because the ranking test does not rely on explaining the temporal distribution of events. However, the loss of information due to censoring does affect the power of the ranking test, especially if the length of the cascade is small. On cascades of 5000 events censored at the 99 % rate, the power of the ranker is 0.22 which is one-fifth of its power when full data is available.
We summarize the main findings from the synthetic data experiments as follows:
The shuffle test consistently has low power compared to HP and the ranking test. This can be attributed to its requirement of using only the first-time activations for each node, and ignoring the time between activations.
The goodness-of-fit test has low Type I error rate and high power in scenarios where (a) the model is correctly specified and (b) complete data is available.
However, the test is highly sensitive to misspecification of the time kernel, and to missing events. These cases can make the test statistically invalid or hurt its power.
The ranking test is robust across all scenarios: it is valid in all scenarios, and nearly match the test’s power when complete data is available; model misspecification has little impact on its validity or power; and it is reasonably powerful even under different types of missing data conditions.
5. Real Data Evaluation
The empirical evidence from experiments on synthetic data shows that the ranking test is a flexible and robust test to detect social influence. While the experimental results on synthetic data demonstrate the explanatory ability of our approach, in this section we highlight the predictive ability of our approach by applying it to two real world datasets: cosponsorship of bills in the U.S. House of Representatives (§ 5.1) and spread of rumors around the discovery of the Higgs boson particle (§ 5.2).
|Statistic||Legislator Network||Friends-Follower Network|
|Giant component size (% nodes)||98.1||78.9|
5.1. Legislative co-sponsorship and political finance networks
A key step in the legislative process is when a bill introduced by a sponsor gets endorsement from other legislators, called cosponsors. Cosponsorship decisions are important markers of wider support signaling a legislator’s own expertise, and provide information about bill content, and the party, ideological, or constituency base for whom the bill advocates. However, cosponsorship is also a low-cost means of position taking, reflective of favor-trading, vote-buying, and special interest politics (Kessler and Krehbiel, 1996). An open question is whether these decisions are influenced by campaign donations, for example by facilitating special access to legislators (Kalla and Broockman, 2016). To test this question, we construct an affiliation network among legislators if they share common donors, and apply our discriminative ranking test to sequences of cosponsorship decisions.
5.1.1. Cascade data
We collect cosponsorship sequences on bills introduced by representatives in the 115th U.S. House of Representatives, using ProPublica’s Congress API.444https://projects.propublica.org/api-docs/congress-api/ We only consider bills from the House of Representatives, and ignore resolutions, since these are not presented to the President to become law. We filter out bills that have fewer than five or more than 200 cosponsors, resulting in a total of 1022 bills. The cosponsorship sequence for each bill – a sequence of events with a legislator as the source and the date as the time – is considered as a separate cascade.
5.1.2. Social network data
For every representative, we collect a list of their top 20 campaign donors from public sources.555https://www.opensecrets.org/ A legislators network is constructed such that a pair of legislators have an edge between them if they share a donor. Some statistics for this network are described in Table 2. For every legislator, we collect information on their party affiliation, and the state they represent. This information is included as node-level covariates for the ranking model.
5.1.3. Problem setup and evaluation
To obtain evidence of network influence, we use the ranking test to compare predictive performance of two rankers. We randomly divide the set of bills into a training and test set of equal sizes. All events from bills in the training set are used for estimating the parameters of the rankers, and both rankers make predictions for every event from bills in the test set. One distinctive feature of this data is the temporal resolution of the cascades: events are timestamped only by date, and there are often multiple new cosponsors on the same date. This creates a problem during training, since our discriminative ranker is based on the mean reciprocal rank (MRR), which assumes a single event at any time.666It is possible to formulate the WARP loss with an alternative error function, allowing multiple events at each “query”, or time point. We leave this for future work. We resolve this by adding a small amount of random “noise” to the time of each event during training to separate them but maintain the order in which they occurred. We evaluate the predictive performance by calculating the mean average precision (MAP), rather than MRR, for each bill in the test set.
The features from Table 1 are not all applicable to cosponsorship cascades. For instance, self-excitation is not applicable since representatives do not cosponsor the same bill twice. However, cosponsorship decisions can be affected by party or state affiliation. For the baseline ranker, we use two dyadic features for every node, which are activated if past cosponsors are from the same party or the same state respectively. We also learn per-representative parameters, which captures the general tendency of each legislator to cosponsor legislation. The social feature from Table 1 is then added to the set of features in the baseline to construct our socially-augmented ranker.
The campaign finance network significantly improved ranking performance overall, by a paired -test on the mean average precisions across cascades (). As shown in Figure 5, the social network features improve ranking performance on out-of-sample data. The improvements are particularly strong for the earliest co-sponsors, who may be more likely to be motivated by campaign finance considerations. The improvement was also statistically significant on 16.4% of individual bills, despite the fact that the typical cascade length was only slightly greater than eight.
5.2. Scientific rumors
On 4th July, 2012, the Higgs boson particle was discovered. Before the official announcement of the discovered particle, rumors had begun on social media that the particle might indeed be the Higgs boson. These “scientific rumors” on Twitter were collected by De Domenico to study the dynamics behind the spread of these rumors (De Domenico et al., 2013). We applied our discriminative ranker to predict the users most likely to spread the rumor.
De Domenico collected tweets from users who mentioned any phrase from a set they identified to be about the Higgs boson’s discovery. These tweets were from July 1 to July 7, 2012. In conjunction, they also collected the friends-follower, retweet, reply and mention networks between users of these tweets. Some network statistics about this data777Available at http://snap.stanford.edu/data/higgs-twitter.html is given in Table 2. For our work, we used the strongly connected giant component of the friends-follower network and the cascade of first 5000 retweets about spanning approximately the first two days of the spread.
5.2.2. Problem setup and evaluation
To predict the users who spread the rumor, we rank them by training the discriminative ranker on all but the last 100 events of the cascade using all features described in Table 1. The ranking are the predictions of the users who have the events in the remaining 100 events of the cascade and the quality of the ranking is evaluated with average precision.
We set up two simple baselines. To compare the quality of predictions with chance, in the first baseline users are simply ranked randomly. In the second baseline, users are ranked by their number of past events and if there are none then by their degree with the intuition that degree is a proxy for rate of tweeting.
|Baseline 1 (random)||0.0002|
|Baseline 2 (by tweeting rate)||0.0008|
The results are in Table 3, which shows that the discriminative ranker is not only better than chance and but also learns about the social behavior to improve over the second baseline which uses just past activity of the users. Although the improvement is small, its still notable considering the size of the network.
The experiments on both synthetic and real data demonstrate the utility of discriminative ranking for the prediction of cascades and the detection of social influence. Despite the strong empirical evidence in favor of using the ranking test, it is not without limitations. The ranker comparison is based on a permutation test, which is a non-parametric distribution-free approach. If there is a lack of variance between the predictions of the two rankers, then the permutation distribution of the test statistic has mass concentrated only at a few points. This problem becomes severe in at least three conditions: the network size is small; there are few events in the heldout data; or when the base ranker has very high predictive performance, leaving very little room for improvement with the addition of the social features. While the conservative nature of the permutation test does not affect the general validity of the test(Edgington and Onghena, 2007), it can affect the power. Future work may explore alternative methods for comparing rankers.
Second, any claim about the detection of influence in the presence of confounds should be properly situated. Identifying social influence in the presence of homophily using only observational data has been proven to be generally impossible, without the assumption that all homophily-related confounds are observed (Shalizi and Thomas, 2011). We also make this assumption when we use structural similarity captured by the embeddings of the node as a proxy for homophily. While the impossibility result may limit the utility of any test based on observational data, the ranking test still has advantages over other such tests. The discriminatory nature of the ranker allows easy addition of a variety of features to improve the prediction strength, whereas the ranking test makes no assumptions about the distribution of the test statistic and is non-parametric making it attractive for hypothesis testing. Recently, researchers have argued that social systems and methods should have predictive strength in addition to explanatory power (Hofman et al., 2017). We argue that the discriminative ranker and the ranking test to detect social influence is an example in that vision.
Finally, there are also several avenues for future research. We motivated the synthetic data experiments by practical constraints on the data, more complex scenarios can occur in real world and their impact on our test needs to be empirically tested. For example, missing data can itself be some function of homophily, if similar users decide to make some of their social media posts private. It would be interesting to such complex scenarios in the future. Another direction for future work is to adapt our discriminative ranker to dynamic networks instead of the static networks that we considered. In our discriminative approach, this should be possible by using node embeddings at the input layer which can then be updated during learning as the network changes over time.
- Anagnostopoulos et al. (2008) Aris Anagnostopoulos, Ravi Kumar, and Mohammad Mahdian. 2008. Influence and correlation in social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 7–15.
- Aral et al. (2009) Sinan Aral, Lev Muchnik, and Arun Sundararajan. 2009. Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks. Proceedings of the National Academy of Sciences 106, 51 (2009), 21544–21549.
- Bacry et al. (2012) Emmanuel Bacry, Khalil Dayri, and Jean-François Muzy. 2012. Non-parametric kernel estimation for symmetric Hawkes processes. Application to high frequency financial data. The European Physical Journal B 85, 5 (2012), 157.
- Bakshy et al. (2011) Eytan Bakshy, Jake M Hofman, Winter A Mason, and Duncan J Watts. 2011. Everyone’s an influencer: quantifying influence on twitter. In Proceedings of the fourth ACM international conference on Web search and data mining. ACM, 65–74.
- Bond et al. (2012) Robert M Bond, Christopher J Fariss, Jason J Jones, Adam DI Kramer, Cameron Marlow, Jaime E Settle, and James H Fowler. 2012. A 61-million-person experiment in social influence and political mobilization. Nature 489, 7415 (2012).
- Cheng et al. (2014) Justin Cheng, Lada Adamic, P Alex Dow, Jon Michael Kleinberg, and Jure Leskovec. 2014. Can cascades be predicted?. In Proceedings of the 23rd international conference on World wide web. ACM, 925–936.
- Cormack and Lynam (2006) Gordon V Cormack and Thomas R Lynam. 2006. Statistical precision of information retrieval evaluation. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM.
- De Domenico et al. (2013) Manlio De Domenico, Antonio Lima, Paul Mougel, and Mirco Musolesi. 2013. The anatomy of a scientific rumor. Scientific reports 3 (2013), 2980.
- Donetti and Munoz (2004) Luca Donetti and Miguel A Munoz. 2004. Detecting network communities: a new systematic and efficient algorithm. Journal of Statistical Mechanics: Theory and Experiment 2004, 10 (2004), P10012.
- Du et al. (2013) Nan Du, Le Song, Manuel Gomez Rodriguez, and Hongyuan Zha. 2013. Scalable influence estimation in continuous-time diffusion networks. In Advances in neural information processing systems. 3147–3155.
- Duong et al. (2011) Quang Duong, Michael P Wellman, and Satinder Singh. 2011. Modeling information diffusion in networks with unobserved links. In Privacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third Inernational Conference on Social Computing (SocialCom), 2011 IEEE Third International Conference on. IEEE, 362–369.
- Edgington and Onghena (2007) Eugene Edgington and Patrick Onghena. 2007. Randomization tests. CRC Press.
- Goel et al. (2016) Rahul Goel, Sandeep Soni, Naman Goyal, John Paparrizos, Hanna Wallach, Fernando Diaz, and Jacob Eisenstein. 2016. The Social Dynamics of Language Change in Online Networks. In International Conference on Social Informatics. Springer International Publishing, 41–57.
- Granger (1969) Clive WJ Granger. 1969. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society (1969), 424–438.
- Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 855–864.
- Hawkes (1971) Alan G Hawkes. 1971. Spectra of some self-exciting and mutually exciting point processes. Biometrika 58, 1 (1971), 83–90.
- He et al. (2016) Xinran He, Ke Xu, David Kempe, and Yan Liu. 2016. Learning Influence Functions from Incomplete Observations. In Advances in Neural Information Processing Systems. 2065–2073.
- Hofman et al. (2017) Jake M Hofman, Amit Sharma, and Duncan J Watts. 2017. Prediction and explanation in social systems. Science 355, 6324 (2017), 486–488.
- Kalla and Broockman (2016) Joshua L Kalla and David E Broockman. 2016. Campaign contributions facilitate access to congressional officials: A randomized field experiment. American Journal of Political Science 60, 3 (2016), 545–558.
- Karsai et al. (2014) Márton Karsai, Gerardo Iniguez, Kimmo Kaski, and János Kertész. 2014. Complex contagion process in spreading of online innovation. Journal of The Royal Society Interface 11, 101 (2014), 20140694.
- Kessler and Krehbiel (1996) Daniel Kessler and Keith Krehbiel. 1996. Dynamics of cosponsorship. American Political Science Review 90, 3 (1996), 555–566.
- La Fond and Neville (2010) Timothy La Fond and Jennifer Neville. 2010. Randomization tests for distinguishing social influence and homophily effects. In Proceedings of the 19th international conference on World wide web. ACM, 601–610.
- Linderman et al. (2017) Scott Linderman, Yixin Wang, and David Blei. 2017. Bayesian Inference for Latent Hawkes Processes. In Advances in Neural Information Processing Systems.
- Lokhov (2016) Andrey Lokhov. 2016. Reconstructing parameters of spreading models from partial observations. In Advances in Neural Information Processing Systems.
- Morstatter et al. (2013) Fred Morstatter, Jürgen Pfeffer, Huan Liu, and Kathleen M Carley. 2013. Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitter’s Firehose.. In Proceedings of ICWSM.
- Muchnik et al. (2013) Lev Muchnik, Sinan Aral, and Sean J Taylor. 2013. Social influence bias: A randomized experiment. Science 341, 6146 (2013), 647–651.
Evaluating evaluation metrics based on the bootstrap. InProceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 525–532.
- Salganik et al. (2006) Matthew J Salganik, Peter Sheridan Dodds, and Duncan J Watts. 2006. Experimental study of inequality and unpredictability in an artificial cultural market. Science 311, 5762 (2006), 854–856.
- Shalizi and Thomas (2011) Cosma Rohilla Shalizi and Andrew C Thomas. 2011. Homophily and contagion are generically confounded in observational social network studies. Sociological Methods & Research 40, 2 (2011), 211–239.
- Smucker et al. (2007) Mark D Smucker, James Allan, and Ben Carterette. 2007. A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. ACM, 623–632.
- Tang et al. (2015) Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1067–1077.
- Usunier et al. (2009) Nicolas Usunier, David Buffoni, and Patrick Gallinari. 2009. Ranking with ordered weighted pairwise classification. In Proceedings of the 26th annual international conference on machine learning. ACM, 1057–1064.
- Weston et al. (2011) Jason Weston, Samy Bengio, and Nicolas Usunier. 2011. Wsabie: Scaling up to large vocabulary image annotation. In IJCAI, Vol. 11. 2764–2770.
- Wilbur (1994) W John Wilbur. 1994. Non-parametric significance tests of retrieval performance comparisons. Journal of Information Science 20, 4 (1994), 270–284.
- Wilks (1938) Samuel S Wilks. 1938. The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics 9, 1 (1938).
- Xu et al. (2016) Hongteng Xu, Mehrdad Farajtabar, and Hongyuan Zha. 2016. Learning Granger causality for Hawkes processes. In International Conference on Machine Learning.
- Xu et al. (2017) Hongteng Xu, Dixin Luo, and Hongyuan Zha. 2017. Learning Hawkes Processes from Short Doubly-Censored Event Sequences. In ICML.
- Yang and Leskovec (2010) Jaewon Yang and Jure Leskovec. 2010. Modeling information diffusion in implicit networks. In Proceedings of ICDM.
- Yang and Zha (2013) Shuang-Hong Yang and Hongyuan Zha. 2013. Mixture of mutually exciting processes for viral diffusion. In International Conference on Machine Learning.
- Zhao et al. (2015) Qingyuan Zhao, Murat A Erdogdu, Hera Y He, Anand Rajaraman, and Jure Leskovec. 2015. Seismic: A self-exciting point process model for predicting tweet popularity. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1513–1522.
- Zhou et al. (2013) Ke Zhou, Hongyuan Zha, and Le Song. 2013. Learning Social Infectivity in Sparse Low-rank Networks Using Multi-dimensional Hawkes Processes. In Proceedings of AISTATS.