On the Complexity of Opinions and Online Discussions

02/19/2018 ∙ by Utkarsh Upadhyay, et al. ∙ Max Planck Institute for Software Systems Oath Inc. 0

In an increasingly polarized world, demagogues who reduce complexity down to simple arguments based on emotion are gaining in popularity. Are opinions and online discussions falling into demagoguery? In this work, we aim to provide computational tools to investigate this question and, by doing so, explore the nature and complexity of online discussions and their space of opinions, uncovering where each participant lies. More specifically, we present a modeling framework to construct latent representations of opinions in online discussions which are consistent with human judgements, as measured by online voting. If two opinions are close in the resulting latent space of opinions, it is because humans think they are similar. Our modeling framework is theoretically grounded and establishes a surprising connection between opinion and voting models and the sign-rank of a matrix. Moreover, it also provides a set of practical algorithms to both estimate the dimension of the latent space of opinions and infer where opinions expressed by the participants of an online discussion lie in this space. Experiments on a large dataset from Yahoo! News, Yahoo! Finance, Yahoo! Sports, and the Newsroom app suggest that unidimensional opinion models may be often unable to accurately represent online discussions, provide insights into human judgements and opinions, and show that our framework is able to circumvent language nuances such as sarcasm or humor by relying on human judgements instead of textual analysis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

People join online discussions to, on the one hand, express their own opinions and, on the other hand, approve and disapprove the opinions expressed by others. In this context, there is a wide variety of online platforms that enable their users to approve or disapprove each others’ comments explicitly using, , upvotes and downvotes. Here, whenever a user upvotes or downvotes a comment in an online discussion, she reveals the relative position of her opinion with respect to the opinion expressed in the comment in a latent space of opinions. By leveraging this observation from multiple comments, upvotes and downvotes, our goal is to investigate the nature and complexity of an online discussion and its space of opinions, uncovering where each participant lies.

There is a long history of theoretical models [4, 17, 18, 31, 32] and empirical studies [8, 9, 10, 15, 16, 24, 26] of opinions. However, most of this previous work has reduced (potentially) complex opinions down to real-valued numbers—they have assumed that opinions lie on the real line. While a unidimensional space of opinions may be sufficient to coarsely characterize people into, , left leaning vs right leaning or liberals vs conservatives, they may be lacking at accurately representing complex, multisided opinions in an online discussion. In fact, if such opinions exist, an unidimensional representation of opinions will be unable to explain many common voting patterns under some of the most popular voting models, as we will show in Section 2.

Figure 1: Our modeling framework. From left to right, given a (toy) online discussion with a set of comments and voters , our modeling framework maps the upvotes and downvotes cast by the voters into a partially observed sign matrix . Within the matrix, each row corresponds to a comment and each column corresponds to a voter. In each entry, indicates that the voter upvoted the comment, indicates that she downvoted the comment, and the sign indicates that the voter did not vote. Then, it represents the opinions expressed in the comments and those held by the voters as

-dimensional real-valued vectors lying in the same latent space of opinions. Finally, it provides a set of practical algorithms to both estimate the dimension

of the latent space of opinions as well as infer the vectors of opinions which are consistent with the partially observed sign matrix , , the upvotes and downvotes.

Current work. Given an online discussion consisting of a set of comments, which are upvoted and downvoted by a set of voters, we first introduce a latent multidimensional representation of the opinions expressed in the comments and the opinions held by the voters. Then, we propose two voting models, one deterministic and another probabilistic, which leverage the above multidimensional representation to characterize the voting patterns within an online discussion. Under this characterization, it becomes apparent that the dimension of the latent space of opinions is a measure of complexity of the online discussion—along how many different axis can the opinions expressed in the comments and the opinions held by the voters differ. Moreover, it empowers opinions under this latent space of opinions with a remarkable property: if two opinions are close (far away) in the latent space it is because the voters—the crowd—think that they are similar (dissimilar). Motivated by these observations, we develop:

  • [leftmargin=0.75cm]

  • An algorithm to determine an upper bound on the minimum dimension that a latent space of opinions needs to have so that they are able to explain a particular voting pattern under the deterministic voting model.

  • An inference method based on quantifier elimination to recover the latent opinions from the observed voting patterns under the deterministic voting model.

  • An inference method based on maximum likelihood estimation to recover the latent opinions from the observed voting patterns under the probabilistic voting model.

Finally, we experiment with a large dataset from Yahoo! News, Yahoo! News, Yahoo! Finance, Yahoo! Sports, and the Newsroom app, which consists of one day of online discussions about a wide variety of topics. Our analysis yields several interesting insights. We find that only % of the online discussions we analyzed can be explained using a unidimensional representation of opinions, % of them require a two dimensional representation, and the remaining ones require a greater number of dimensions. This provides empirical evidence that, to provide opinions representations that are coherent with human judgements, it may be often necessary to move beyond one dimension. The presence of multisided opinions, which are recognized by the crowd, is an indication that the discussion is not falling prey to demagoguery. Whenever an online discussion has low complexity, , it can be represented using one dimensional opinions, the deterministic model, which assumes voters cast a vote following a deterministic rule, achieves higher predictive performance than the probabilistic model, which allows for noisy voting. However, for discussions with higher complexity, the probabilistic model provides more accurate predictions. This suggests that, whenever humans face more complex discussions, their judgements become less predictable

. Moreover, by looking at particular examples of online discussions, we show that our modeling framework, by relying on human judgements, is able to circumvent language nuances like sarcasm and humor, which are often difficult to detect using natural language processing. Lastly, by comparing the estimated opinions in the online discussions, we find that the higher the dimension of the latent space of opinions, the lower the agreement between comments.

2 Modeling Opinions and Votes

At the very outset, the underlying mechanism behind voting on online discussions is fairly commonplace and straight-forward. Every time a user expresses an opinion by posting a new comment in an online discussion, other users can upvote (downvote) the comment to indicate that they agree (disagree) with the expressed opinion.

In this context, whenever a user upvotes or downvotes a comment, she reveals the relative position of her opinion with respect to the opinion expressed in the comment. By leveraging this observation to multiple comments, upvotes and downvotes, our modeling framework will be able to infer the relative positioning between comments in an online discussion, as judged by the crowd. Moreover, by doing so, it will also find a meaningful joint latent representation for the opinions expressed in an online discussion as well as the opinions held by the users who voted. In the remainder of the section, we formally introduce our modeling framework, starting from the data it is designed for.

Online voting data. We observe an online discussion consisting of a set of comments which are upvoted and downvoted by a set of voters . Here, we keep track of who voted what by means of the variables , which indicate that voter upvoted, downvoted, or did not vote on comment , respectively. Then, we define a (partially) observed sign matrix , where each -th entry is given by

(1)

the sign indicates that the voter did not vote and thus we cannot know whether she agrees (or disagrees) with the comment, and we denote by the set of indexes where we have observations, , . Figure 1 particularizes the above definitions for a given toy example.

Next, we introduce our multidimensional representation of the opinions expressed in the comments and those held by the voters and then elaborate on our voting model, which relates these opinions to the observed voting data.

Opinion representation. Unidimensional (scalar) real-valued representations of opinions have been used most commonly in the literature, following the example set by the seminal works by DeGroot [12] and Rowley [28]. Thus, we could think of using such unidimensional representation of opinions in our work. However, under that choice, we would be unable to explain certain voting patterns illustrated below, which are common in many online discussions.

Given an online discussion, assume we represent the opinions expressed in each comment as and the opinion held by voter as . Now, we elaborate separately on two of the most popular voting models in the literature [25]: the proximity model and the directional model. Under the proximity model, the voters use the euclidean distance as a similarity measure and decide to cast an upvote if , where is a threshold, and a downvote otherwise. Now consider the voting pattern 1 in Figure (a)a. It is easy to show that there are no real-valued scalar opinions and leading to such a voting pattern: assume that (as the pattern is symmetric, we can always relabel the voters and comments to make this true) and . Then , and . This contradicts the assumption that . Similarly, we arrive at a contradiction with the assumption .

Under the directional model, the voters use the dot product as a similarity measure and decide to cast an upvote if and a downvote otherwise. Here, consider the voting pattern 2 in Figure (b)b. Again, it is easy to show that there are no real-valued non-zero scalar opinions and leading to such a voting pattern. The first row requires and , which implies . However, the second row requires and , which implies and this leads to a contradiction.

Motivated by the above examples, given an online discussion, we represent the opinions expressed in the comments and those held by the voters as -dimensional real-valued vectors lying in the same latent space. More formally, we represent the opinion expressed in each comment as and we stack all these opinions into a matrix , in which the -th row corresponds to the opinion . Similarly, we represent the opinions held by each voter as and stack all these opinions into a matrix , in which the -th row corresponds to the opinion . Here, one can think of the dimension as a measure of the complexity of the online discussion—along how many different axis can the opinions expressed in the comments and the opinions held by the voters differ. Figure 1 illustrates the above definitions using a toy example.

1 2 3 1 2 3

(a) Voting pattern 1

1 2 3 1 2 3

(b) Voting pattern 2
Figure 4: Examples of unfeasible voting patterns under the proximity and directional voting models with unidimensional (scalar) real-valued representation of opinions.

Voting model. Given a comment with an opinion and a voter who holds an opinion , we introduce two voting models, one deterministic and another probabilistic, inspired by the directional model of voting discussed above.

— Deterministic voting model: In this model, we can uniquely determine each vote from the comment’s opinion and voter’s opinion by means of the following deterministic rule:

(2)

In the above rule, the vote depends on the angle between the opinion vectors and —if the angle is greater (less) than , , and lie in the same (different) half-plane in the latent space, then .

Under this voting model and a partially observed sign-matrix derived from votes using Eq. 1, two natural question emerge:

  • [leftmargin=0.75cm]

  • What is the minimum dimension of the latent space needed to recover the observed entries in from the above decision rule without errors?

  • Once we know the minimum dimension , can we infer the opinion vectors and ?

We will answer both questions affirmatively in Section 3 and 4, respectively.

— Probabilistic voting model: In the definition of our deterministic model, we have implicitly assumed that voters do not make any errors

while casting their votes. However, this assumption might be rather restrictive in some scenarios. To overcome this, we also propose a probabilistic voting model in which votes are binary random variables

, and,

(3)

where if and if . Similarly, as in the case of the deterministic model, we will propose a method to infer the opinion vectors and under this model in Section 4. In doing so, we will make the mild assumption that and , as a way of modeling a “Region of Acceptability”, which is a common assumption.

Remark. In the above model definitions, we opt for a similarity metric based on dot products because the euclidean distance, used in the proximity model, does not scale well with increasing dimensionality: the relative volume of the opinion space where a voter will cast an upvote is proportional to where is the threshold for the user, is the dimension of the latent space and is the “Region of Acceptability”.

3 Complexity of Online Discussions

In this section, we present an algorithm which can determine an upper bound on the minimum dimension that a latent space of opinions needs to have so that and are able to explain voting patterns exhibited by the voters which result in a particular vote-matrix under the deterministic voting model, , , where . To this aim, we will first introduce the notion of sign-rank of a sign matrix. Then, we will show that the problem of determining reduces to finding the sign-rank of a partially observed sign matrix. Finally, we will present an efficient algorithm to estimate the sign-rank.

Sign-rank of a sign matrix. Paturi  [27] introduced the classical notion of sign-rank of a sign matrix, which is closely related to the VC dimension of concept classes [3] and it goes as follows: Let be a real matrix and denote a matrix such that . Then, the sign-rank of a sign matrix is defined as:

Here, we extend the above definition to partially observed sign matrices as follows: The sign rank of a partially observed sign matrix is defined as:

It is easy to see that, if the rank of a matrix is , then we can decompose the matrix into two components of the form

using, , the singular value decomposition (SVD). Hence, the problem of determining

reduces to the problem of finding .

Note that the sign-rank of a sign matrix can be much lower than its actual rank, as was noticed by Hsieh  [20] in the context of signed graph models. For example, consider the sign-rank of the matrix , where

is the identity matrix and

is the matrix of all of size . For , remains though the matrix itself is always of full rank . Moreover, note that, in our setting, the sign-rank does not merely correspond to the number of topics being discussed in an online discussion. Instead, the complexity may be manifest in the combination of the topics under discussion: the voters may agree with some opinions in a comment while disagreeing with others.

Estimating the sign-rank of a partially observed sign matrix. The problem of determining whether is can be solved by a simple breath-first search (BFS). We first create a signed bi-partite graph of comments and voters with adjacency matrix . Then for each connected component in the graph, pick one , set and , and fill in the remaining values using BFS by multiplying the source node value with the sign of the edge to arrive at the destinaion node value. If a consistent assignment of to all the nodes is possible, then .

However, this algorithm does not generalize to multiple dimensions. To estimate the sign-rank of a partially observed sign matrix, we adapt the algorithm for (fully observed) sign matrices proposed recently by Alon  [3]. First, we explain the main ideas behind the original algorithm and then describe the necessary, non trivial modifications we propose.

The original algorithm upper-bounds the sign-rank of a (fully observed) sign matrix by the number of sign-changes in the columns of the matrix. More formally, define the function as the maximum number of sign changes in any column of the matrix , , , as the set of all possible row permutations of , and the function as:

Then, the following lemma establishes the relationship between the sign-rank of matrix and , which the original algorithm exploits [2]: For a sign matrix , . To use the above result, our algorithm needs to do two tasks:

  • Find a matrix such that it is a completion of , ,

    (4)
  • Find such that it minimizes the maximum number of sign-changes in its columns.

After it is able to do so, it will output as the estimated sign-rank of the matrix.

Our algorithm does both tasks together while it computes an estimation of using an algorithm by Welzl  [30, See Ex. 4]. In a nutsell, we construct a graph in which each node corresponds to a row of the partially observed matrix and the weight of each edge between nodes and is given by the number of columns where the signs of the corresponding rows disagree. Then, we extract a spanning tree from the completely connected graph which minimizes the number of sign-changes between pairs of vertices connected by an edge, as shown in Algorithm 1. In the process of creating the spanning tree, we also fill the matrix. Finally, we derive a permutation of the rows based on the tree to construct .

0:  A sign-matrix
0:  Spanning-Tree of the rows.
1:  ; ; 
2:  while  do
3:     for ; is cycle-free do
4:        
5:     end for
6:     
7:      Update
8:     ; 
9:  end while
10:  return  
Algorithm 1 It constructs a sign-minimizing spanning tree for the columns of partially observed sign matrix .
0:   edge chosen; and sign-matrix;
0:  Updated
1:  // initialized as an empty set and persisted across calls
2:  for  do
3:     if  and  then
4:        
5:     else if  and  then
6:        
7:     else if  and  then
8:        
9:     end if
10:  end for
11:  return  
Algorithm 2 Update procedure used in Algorithm 1.
Operation assigns the RHS to all positions which were declared equivalent to the LHS in line 8.

More in detail, to understand how Algorithm 1 fills the matrix as it computes the spanning tree, we distinguish two different cases:

Case 1: When, given a column, only one of the rows has a missing entry. Consider, for example, we have two rows and . To calculate the weight of the edge between and , , in line 4 of Algorithm 1, we ignore the second column, as , and the third column, as it has a missing entry . Hence, we report , because first column is the only column where signs of and certainly differ. Now, if this edge was chosen in line 6 of Algorithm 1, we modify , such that this weight indeed is the true weight of the edge: we replace with the corresponding value in , , , via line 4 in Algorithm 2.

Case 2: When, given a column, both entries are unknown. If and , we would still calculate the weight the same way as above. Hence, . However, if this edge was chosen in line 6 of Algorithm 1, we could keep the weight the same by merely ensuring that both and have the same value in the third column, , . Hence, we create and save the constraint that the third column of and must always have the same value in line 8 of Algorithm 2. Now, say a few steps into the creation of the spanning tree, we find that the missing value in has to be set to as it was being picked as part of an edge under Case 1 above. Then, we can also set the same value in the third column of , , , via line 4 in Algorithm 2. Note that since each column in contains at least 1 entry which is , as each voter has voted at least once, we will eventually hit Case 1 and fill in all missing entries.

Note that we are conservative and greedy while filling in the missing entries, , the edge selected at the th step will have the same weight at the th iteration if the algorithm was to be run on the filled matrix and it will be the minimum weight amongst all valid edges at step (though the edge may not be unique in having that weight). Additionally, our algorithm ensures that the weight of the edge selected at iteration is the minimum possible, given the history of selections. After obtaining a spanning tree, one can walk the tree starting from any source node and create a permutation of the rows by dropping the duplicate nodes in the walk. Hence, we can obtain and report as .

Then, we can establish the upper bound on the dimension by using the following series of self-evident inequalities: .

Finally, we would like to highlight that the spanning tree algorithm presented above minimizes the average number of sign-changes in . Welzl  [30] also describe a variant of the algorithm which produces guarantees on the worst case number of sign-changes in ; the way the weight is calculated is more involved in the variant. This variant was used by Alon  [3] to design the first polynomial time algorithm with approximation guarantees for the sign-rank of the matrix . Remarkably, the Update procedure in Algorithm 2 can be ported to that variant without any changes to complete a partially observed matrix matrix with the algorithm with worst case guarantees as well. However, the version is computationally more expensive, more complex, and does not offer significantly better results in practice in our dataset. Hence, for ease of exposition, we have described the simpler of the two versions.

Remark. In our implementation, we do a (non-exhaustive) search over walks with different sources to improve our estimate of and break ties in calculating the randomly. Also, as , we run the algorithm on both matrices and report the smaller value.

4 Multisided Opinion Estimation

Given an online discussion, we infer the corresponding -dimensional opinions and for the deterministic and probabilistic voting models introduced in Section 2 as follows.

Deterministic voting model. By definition, under the deterministic voting model, we know that the corresponding -dimensional opinions and and the partially observed sign matrix need to satisfy the following inequalities:

(5)

where and are the -th entry of the opinions and , respectively, and is the set of observed entries in . However, we also know from Section 3 that, for each voting pattern, there will be a minimum dimension under which such an opinion embedding will not exist.

This reduces the problem of finding the opinions and to the existential theory of reals [19] and, for small values of and moderate number of comments and voters, and , this problem can be solved via quantifier elimination using, , the solver Z3 [11]. Here, note that, if , the solver will conclude that the problem is unsatisfiable. Hence, by iteratively increasing and checking for satisfiability of Eq. 5, one could determine the true sign-rank of any matrix . However, as the most efficient method known for quantifier elimination is doubly exponential in the number of variables, calculating the minimum dimension in this way would be computationally more expensive than using the polynomial algorithm introduced in Section 3.

Probabilistic voting model. Given a partially observed matrix , under the probabilistic voting model, we estimate and by solving the following constrained maximum likelihood estimation (MLE) problem: , -∑_(i,j)∈Ωlog(1+exp(-s_ij_i_j)) ————_∞≤α————_∞≤α. Remarkably, the structure of the above problem allows us to adapt an efficient -bit matrix completion method based on gradient descend with theoretical guarantees [7]. Finally, note that unlike in the deterministic model, for each voting pattern and dimension , there will always exist opinions and that are able to explain the pattern.

Remark. In both the deterministic and the probabilistic model, the estimated opinions are unique up-to orthogonal transformations since the inequalities in Eq. 5 and the likelihood in Eq. 4 only depends on entries of , ,

for any orthogonal matrix

.

5 Experiments

In this section, we will apply our modeling framework to one day of online discussions gathered from the webistes Yahoo! News, Yahoo! Finance, Yahoo! Sports, and the Newsroom app. We will consider all comments made on a post on these websites as one discussion. First, we will estimate the complexity of the discussions and show that, in most (daily) discussions, the opinions can be explained using less than dimensions. Then, for each online discussion, we will infer the corresponding -dimensional opinions and show that, on the one hand, these opinions representations can be used to accurately predict upvotes and downvotes and, on the other hand, they can circumvent language nuances like sarcasm and humor.

Dim. Discuss. Patterns
Table 1: Number of comments, voters and unique voting patterns seen in the dataset for discussions with different dimensions. The numbers in each column are the mean values

the standard deviation. Dimensions marked with

indicate that they were determined using Z3 and are the true dimensions of the discussions. Our algorithm was used to estimate the dimension of other discussions. It can be seen that though the dimension of discussions is positively correlated with the size and participants, discussions of different complexity can be found on the entire spectrum.

Data description. Out dataset contains online discussions, each associated to an article from Yahoo! News (including contributed articles), Yahoo! Finance, Yahoo! Sports, and the Newsroom app, which contain million votes, cast by voters on comments, posted by users. These votes were randomly sampled from all votes which were cast on comments made by users in the US on August 8, 2017.

As a pre-processing step, we discard discussions with less than 10 comments, as they contain too little data to provide meaningful results. After this step, our dataset consists of discussions, with million votes, cast by voters, on comments, posted by users. Figures (a)a and (b)b show the richness of the data in the votes gathered in the online discussions by means of the sparsity of and the number of unique columns of , which we name as voting patterns.

(a) Sparsity of votes

(b) Unique voting patterns

(c) Discussions of dim.

(d) Discussions of dim.
Figure 9: Distribution of number of observed elements and fraction of unique voting patterns in matrix (Panel a and b) and performance of our algorithm for dimension estimation (Panel c and d). Panel (a) shows that for most discussions is very sparse and Panel (b) shows that many have overlapping voting patterns. Panel (c) and (d) show that for discussions whose opinions can be explained using two (three) dimensions, our algorithm recovers the true dimension for (%) of the discussions and is off by one for () of them.

(a) Discussions of dim. 

(b) Discussions of dim. 

(c) Agreement vs upvotes

(d) Distribution of Agreement/upvote
Figure 14: Panel (a) and (b) show vote prediction accuracy for the deterministic voting model (DVM), the probabilistic voting model (PVM), and a baseline that just outputs as its vote prediction the most common vote in the discussion. Both for discussions with dimension and , the performance for the DVM and PVM uniformly increases as the number of unique voting patterns increases. In contrast, the performance of the baseline remains nearly constant. Panel (c) and (c) show agreement and percentage of upvotes among all votes in online discussions. Agreement is measured in terms of percentage of comment pairs for which . The higher the dimension of the latent space of opinions, the lower the agreement between comments, however, such finding would not be apparent directly from the fraction of upvotes, which remains relatively constant.

Complexity of discussions. In this section, we compute the complexity of the discussions, , the dimension of the latent space of opinions, for the online discussions in our dataset as follows. For each online discussion, we determine whether it can be explained using an unidimensional space of opinions using the linear time algorithm presented at the beginning of Section 3. If it cannot be explained using one dimension, we determine whether it can be explained using a two- or three-dimensional space of opinions via quantifier elimination111In practice, we found quantifier elimination to be sufficiently scalable to test whether a online discussion can be explained using up to two dimensions., following Section 4. Finally, if it cannot be explained using two or three dimension, we resort to the algorithm presented in Section 3, which provides an upper bound on the true dimension.

Table 1 summarizes the results, which show that the opinions of about (%) of the discussions can be explained using one dimension, (%) of the discussions require two dimensions, while the remaining discussions (%) require a higher number of dimensions. This allows us to conclude that the opinions in most of the online daily discussions (%) can be explained using a latent space of relatively low dimensions, , . Moreover, while discussions with a higher number of participants () and richness (, higher number of voting patterns and lower sparsity) require, in general, a latent space of opinions with a larger number of dimensions, there is a large variability spanning the entire spectrum of online discussions.

Next, we evaluate how tight is the upper bound on the true dimension provided by our algorithm for online discussions, which we used above for discussions whose dimension we could not find using quantifier elimination. To this aim, we run our algorithm on discussions whose true dimension we could find using quantifier elimination and compare the upper bound with the true dimension. Figure 9 summarizes the results, which show that, for discussions whose opinions can be explained using two (three) dimensions, our algorithm recovers the true dimension for (%) of the discussions and is off by one for () of them.

Opinions in online discussions. In this section, we first evaluate both quantitatively and qualitatively the quality of the estimated -dimensional opinions in the online discussions and then leverage the estimated opinions to shed some light on the level of controversy in online discussions. Here, we used the opinion estimation method for the probabilistic voting model introduced in Section 4, which scales graciously with the dimension .

In terms of quantitative evaluation, we assess to which extent the deterministic voting model (DVM) and the probabilistic voting model (PVM) can predict whether a voter will upvote or downvote a comment from the estimated opinions in comparison with a baseline method that just outputs as its vote prediction the most common vote (be it an upvote or downvote) in the discussion. To this aim, for each discussion, we held out some of the observed upvotes and downvotes, estimate the opinions from the remaining votes, and then predict the votes from the held-out set. However, since our data is very sparse, as shown in Figure (a)a, and even holding out a small fraction of votes may change the underlying dimension of the latent space of opinions, we resort to leave-one-out validation. Moreover, we randomly select

discussions to tune the hyperparameters of the probabilistic voting model and these discussions are excluded from the validation set. Figure 

14 summarizes the results, which show that:

  • [leftmargin=0.75cm]

  • The performance of both voting models increase as the number of unique voting patterns in the dataset increase, in contrast, the performance of the baseline method remains nearly constant.

  • While for discussions whose opinions can be explained using two dimensions, both the deterministic and probabilistic models achieve a comparable performance, for discussions whose opinions require only one dimension, the deterministic model beats the probabilistic one. A potential explanation for this behavior is that, whenever humans face simpler decisions, , their opinions can be explained using a single dimension, they become more predictable.

(a) Estimated opinions
: […] [Donald Trump] has […] Enquirers222National Enquirers is a well known entertainment magazine in US. [which] he considers a treasure trove of information. : He should change his name to Donald J Dubious. : […] Trump can be an #$%$, and Islam can be cancer […] they are not mutually exclusive […] : Why not? Try anything. Terrorism has got to stop now! : It is a great idea : Trump family motto-“It’s not a lie if you believe it.”
Figure 16: A subset of comments and estimated opinions for an online discussion about politics. Two pairs of comments, (, ) and (, ), express a similar opinion, however, the lexical overlap between comments within each pair is low. Remarkably, our method is able to identify they are similar, as a human would do, by leveraging the judgements of the voters, and their estimated opinions lie close to each other in the latent space of opinions. Moreover, the estimated opinion of a comment expressing an opposite view to the ones above, , lies in an orthogonal direction.

In terms of qualitative evaluation, we take a close look into the comments and inferred opinions of an discussion about politics, shown in Figure 16, and a discussion about finance, shown in Figure 18. The discussion about politics shows that, even if the lexical overlap between comments who express a similar opinion is low, , and or and , our opinion estimation method is able to identify they are similar, as a human would do, by leveraging the judgements of the voters. Note that, due to their low lexical overlap, it would be difficult to identify such similarity using methods based on textual analysis. The discussion about the price of Twitter stock (see Figure 18) shows that our method is able to capture objective opinions about the price (whether it stays at $16 or goes up, , or stays down, and ), along one axis and subjective opinions questioning the reason behind the price drop along a different axis (suggests management is the reason, , or biased media/corruption in Wall-Street, ). Note that suggests both, that the price should go up, and that the reason for the decrease is the management.

(a) Estimated opinions
: You can forget about $16 for a while. : Bye-Bye twitter sweet 16. : […] when world leaders speak, they turn to Twitter first. […] How is it trading at $16? […] How come Dorsey333Jack Dorsey, CEO of Twitter can’t monetize this instantaneous platform? : […] It’s about time [for] more positive news to get it […] up again. […] Seems to have support $16ish. […] : [Wall Street/CNBC] only want to pump [selective] stocks […] Twitter of China, Weibo, is selling for $88.00 a share […]
Figure 18: A subset of comments and estimated opinions for an online discussion about finance (price of Twitter stock). There are two distinct issues being discussed: (i) the objective price of the stock (, , , ) and (ii) a subjective discussion about reasons for the supressed price (, ). and say that the price will stay below $16 (using some humor), while and suggest that the price may rise up. suggests Wall Street/media bias against the stock and is neutral about the price of stock, while questions the management of the company instead.

Finally, we assess to which extent comments in online discussions agree (or disagree) by analyzing the estimated opinions of the comments. More specifically, for each online discussion, we compute the percentage of comment pairs for which and compare this quantity with the percentage of upvotes among all votes (upvotes and downvotes). Figure (c)c summarizes the results, which show that the higher the dimension of the latent space of opinions, the lower the agreement between comments, as one may have expected. Remarkably, such finding would not be apparent directly from the fraction of upvotes, which remains relatively constant.

6 Related work

Our work builds upon ideas from several areas of research, viz. (i) opinion mining, (ii) voting models, and (iii) sign-rank estimation. We will discuss related work in each area separately.

Opinion mining. One of the first theoretical models of opinion dynamics and opinion formation is by DeGroot [12], which decided to represent opinions as single real numbers. Since then, this has been a common modeling choice in many theoretical models of opinions [4, 17, 18, 31, 32]. However, this modeling choice is in conflict with voting data, which requires multidimensional models of opinions, as argued previously.

Empirical studies of opinions have often relied on sentiment analysis based on feature extraction methods to measure the polarity of expressed opinions and then determine the level of agreement between users 

[10, 26]. However, it has been also argued that feature extraction methods may be not sufficient to reliably determine the polarity of expressed opinions in several scenarios [26]. In our work, we overcome this limitation by leveraging the input of voters, acting as oracles, to determine the level of agreement (disagreement) between users.

The voting data we work with—upvotes and downvotes between voters and comments—can also be represented as signed graphs, which have been the focus of several empirical studies in the opinion mining literature [1, 15, 22, 23], however, such studies did not consider the existence of underlying multidimensional, multi-sided opinions. In this context, the work by Hsieh  [20] is perhaps more closely related to ours since they aim to uncover the low rank structure of signed graphs via matrix factorization. That being said, our work differs from their in two key aspects: (i) it explores a novel connection between the factorization of a signed graph and the sign-rank of a matrix and (ii) it considers that there is an underlying opinion model which is responsible of generating the signed graph.

Voting models. As voting is one of the basic tenets of a democratic society, there is a long history of voting models in the political science and law literature. However, the majority of voting models in this literature use a single real number to represent political orientation, , left and right leanings [26, see references in Section 4.1.4]. In this context, multidimensional representations of policies have been argued against by favoring the desirable properties of unidimensional spatial voting models [28]. That being said, the Downsian Proximity model and Directional model [13, 25] can be extended to multidimensional representations. However, this literature assumes the (policy) dimensions to be known and primarily aims to uncover the relative weight that voters give to various policy dimensions. In contrast, in our work, we address the inverse problem of determining the dimension of the multidimensional representations of opinions given the votes cast by the voters.

Matrix sign-rank. The notion of sign-rank of a matrix was introduced in the seminal work by Paturi  [27] in the context of unbounded error communication complexity of protocols given by binary matrices. Since then, there has been a rich history of work on matrix sign-ranks [3, 5, 6, 14, 21]. In terms of sign-rank estimation, Alon  [3] have developed a polynomial time multiplicative approximation algorithm, which has been a source of inspiration for our minimum sign rank estimation algorithm in Section 3. In terms of sign-rank complexity, Basri  [5] and Bhangale  [6] have shown that while one can determine whether for a given sign matrix in polynomial time, determining whether for is NP-hard. However, in contrast to our own work, where we consider the sign-rank of partially observed matrices, all previous work has considered the sign-rank of fully observed matrices.

Finally, sign-rank is only one of many measures of matrix complexity such as rank, trace-norm or max-norm. Srebro  [29] give an excellent overview of various other matrix complexity measures.

7 Conclusion

In this work, we have proposed a modeling framework to generate latent representations of opinions using human judgements, as measured by online voting. As a consequence, such representations exhibit a remarkable semantic property: if two opinions are close in the latent space of opinions, it is because the voters—the crowd—think that they are similar. Our modeling framework is theoretically grounded and establishes an unexplored, surprising connection between opinion and voting models and the sign-rank of a matrix. Moreover, it also provides a set of practical algorithms to both estimate the dimension of the latent space of opinions and infer where opinions expressed in comments and held by voters lie in this space. Finally, we applied our framework to a large dataset from Yahoo! News, which consists of one day of online discussions about a wide variety of real-world news. Our experiments question the ability of unidimensional opinion models to accurately represent online discussions, show that many discussions are multisided and avoid falling prey to demagoguery, provide insights into human judgements and opinions, and show that our framework is able to circumvent language nuances, , sarcasm and humor, by relying on human judgements.

Our work also opens up many interesting venues for future work. For example, our measure of complexity—the dimension of the latent space of opinions—may be a good starting point to develop theoretically grounded measures of polarization [8, 9, 24] and controversy [15, 16], which have been lacking in the literature. Moreover, it would be very interesting to augment our modeling framework to also incorporate the textual information in the comments, in addition to the voting data. Our algorithm for determining the minimum dimension under which the opinion space is able to explain the voting data exhibits weak theoretical guarantees though it performs well on real-data. It would be interesting to develop exact algorithm by adapting recent advances in exact sign-rank estimation [5, 6].

Acknowledgements. We thank Mounia Lalmas, Dmitry Chistikov and Rupak Majumdar for useful discussions.

References

  • [1] L. Akoglu. Quantifying political polarity based on bipartite opinion networks. In ICWSM, 2014.
  • [2] N. Alon, P. Frankl, and V. Rodl. Geometrical realization of set systems and probabilistic communication complexity. In FOCS, 1985.
  • [3] N. Alon, S. Moran, and A. Yehudayoff. Sign rank versus vc dimension. In Conference on Learning Theory, 2016.
  • [4] R. Axelrod. The dissemination of culture a model with local convergence and global polarization. Journal of conflict resolution, 41(2):203–226, 1997.
  • [5] R. Basri, P. F. Felzenszwalb, R. B. Girshick, D. W. Jacobs, and C. J. Klivans. Visibility constraints on features of 3d objects. In CVPR, 2009.
  • [6] A. Bhangale and S. Kopparty. The complexity of computing the minimum rank of a sign pattern matrix. arXiv preprint arXiv:1503.04486, 2015.
  • [7] S. A. Bhaskar and A. Javanmard. 1-bit matrix completion under exact low-rank constraint. In CISS, 2015.
  • [8] Y. Choi, Y. Jung, and S.-H. Myaeng. Identifying controversial issues and their sub-topics in news articles. In Pacific-Asia Workshop on Intelligence and Security Informatics, pages 140–153. Springer, 2010.
  • [9] M. Conover, J. Ratkiewicz, M. R. Francisco, B. Gonçalves, F. Menczer, and A. Flammini. Political polarization on twitter. ICWSM, 2011.
  • [10] A. De, I. Valera, N. Ganguly, S. Bhattacharya, and M. G. Rodriguez. Learning and forecasting opinion dynamics in social networks. In Advances in Neural Information Processing Systems, pages 397–405, 2016.
  • [11] L. De Moura and N. Bjørner. Z3: An efficient smt solver. Tools and Algorithms for the Construction and Analysis of Systems, pages 337–340, 2008.
  • [12] M. H. DeGroot. Reaching a consensus. Journal of the American Statistical Association, 69(345):118–121, 1974.
  • [13] J. M. Enelow and M. J. Hinich. The spatial theory of voting: An introduction. Cambridge University Press, 1984.
  • [14] J. Forster. A linear lower bound on the unbounded error probabilistic communication complexity. Journal of Computer and System Sciences, 65(4):612–625, 2002.
  • [15] K. Garimella, G. D. F. Morales, A. Gionis, and M. Mathioudakis. Quantifying controversy on social media. ACM Transactions on Social Computing, 1(1):3, 2018.
  • [16] P. H. C. Guerra, W. Meira Jr, C. Cardie, and R. Kleinberg. A measure of polarization on social media networks based on community boundaries. In ICWSM, 2013.
  • [17] R. Hegselmann and U. Krause. Opinion dynamics and bounded confidence models, analysis, and simulation. Journal of Artificial Societies and Social Simulation, 5(3), 2002.
  • [18] P. Holme and M. E. Newman.

    Nonequilibrium phase transition in the coevolution of networks and opinions.

    Physical Review E, 74(5):056108, 2006.
  • [19] H. Hong et al. Comparison of several decision algorithms for the existential theory of the reals. Research Institute for Symbolic Computation, 1991.
  • [20] C.-J. Hsieh, K.-Y. Chiang, and I. S. Dhillon. Low rank modeling of signed networks. In KDD, 2012.
  • [21] T. Lee and A. Shraibman. An approximation algorithm for approximation rank. In Computational Complexity, 2009. CCC’09. 24th Annual IEEE Conference on, pages 351–357. IEEE, 2009.
  • [22] J. Leskovec, D. Huttenlocher, and J. Kleinberg. Predicting positive and negative links in online social networks. In WWW, 2010.
  • [23] J. Leskovec, D. Huttenlocher, and J. Kleinberg. Signed networks in social media. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 1361–1370. ACM, 2010.
  • [24] Y. Mejova, A. X. Zhang, N. Diakopoulos, and C. Castillo. Controversy and sentiment in online news. arXiv preprint arXiv:1409.8152, 2014.
  • [25] S. Merrill and B. Grofman. A unified theory of voting: Directional and proximity spatial models. Cambridge University Press, 1999.
  • [26] B. Pang, L. Lee, et al. Opinion mining and sentiment analysis. Foundations and Trends® in Information Retrieval, 2(1–2):1–135, 2008.
  • [27] R. Paturi and J. Simon. Probabilistic communication complexity. In FOCS, 1984.
  • [28] C. K. Rowley. The relevance of the median voter theorem. Zeitschrift für die gesamte Staatswissenschaft/Journal of Institutional and Theoretical Economics, (H. 1):104–126, 1984.
  • [29] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In COLT, 2005.
  • [30] E. Welzl. Partition trees for triangle counting and other range searching problems. In Proceedings of the fourth annual symposium on Computational geometry, pages 23–33. ACM, 1988.
  • [31] E. Yildiz, A. Ozdaglar, D. Acemoglu, A. Saberi, and A. Scaglione. Binary opinion dynamics with stubborn agents. ACM Transactions on Economics and Computation, 1(4):19, 2013.
  • [32] M. E. Yildiz, R. Pagliari, A. Ozdaglar, and A. Scaglione. Voting models in random networks. In Information Theory and Applications Workshop, pages 1–7, 2010.