Social dynamics is a multidisciplinary subject that, especially in the last two decades, has attracted a strong interest due to an exponential growth of digital data where trends, opinions, and relationships, are nowadays continuously collected and analyzed. In this context, particularly appealing are the Twitter data related to political orientations. In fact, a vast literature in which Twitter messages (Twitts) about politic are analyzed, has already been produced since Twitter creation in 2006 GermanTumasjan ; Spain2016 ; India2014 ; Hauff ; Choy ; UK2010 ; Brazil ; Brazil-2 ; see also Avello and references therein. In some cases, the analysis of the Twitts collected prior to a particular electoral event in which candidates compete for the position of president or prime minister, have allowed to predict the actual electoral result, while in some others not, even though a fair statistics is lacking Avello . In these works, the main hypothesis consists in assuming that the set of Twitts collected, say, from day -30 till day -1, is representative of the offline will of vote and contains the information about the winner candidate elected on the day 0, the election day. Each message’s content is then analyzed and, according to criteria based on specific dictionaries of key words, the Twitt is to be interpreted in favor (mention), or possibly against (sentiment), of one of the candidates. Any poll, however, is partial and predictions must be taken with care, even when the best criteria for interpreting the messages has been used. The problem with polls is in fact twofold as they are limited in both “space” and “time”: On one side, samples are finite and the statistics suffers from finite size effects, on the other side, samples look only at specific times and, of course, they do not contain any actual information about the future election day. If a system is somehow at a critical point, the above limitations will dramatically affect the predictions even under the hypothesis that the system is approximately in a stationary regime.
In order to justify typical macroscopic patterns observed in social dynamics, several microscopic models have been proposed Santo . So, for example, on the base of elementary laws among neighbors, like replication of opinion, competitions, and external noise, different versions of the Voter model Voter , as well as of the Sznajd dynamics Sznajd , have been studied and a large literature produced (see also references in Santo ). Despite important progresses in this direction, especially with respect to certain observed universal properties, like the spread of rumor Marsili , and the vote distribution for electoral events characterized by a large number of candidates (as in the case of legislators and city councilors elections) Bernardes ; Bernardes2 ; Travesso ; Santo2 ; Crokidakis , most of such studies have remained mainly qualitative (or spectral quantitative), the major problem being the lack of connection between the model-parameters and the actual system-parameters. This issue is in fact one of the most serious one for any microscopic approach to social dynamics, for the couplings and the external fields to which a real social system obeys, if any, depend of an infinite number of different sources out of control, which results in the impossibility of using the model for making predictions.
In this paper, we consider a different perspective. We do not aim at finding the most reliable microscopic model from “first-principles”. Rather, we consider effective models which, on the base of a limited number of macroscopic data, are able to incorporate the complex source of the collective behavior, and yet are simply enough to allow for the determination of their actual parameters to be found via an inverse problem, which in turn can reveal surprisingly interesting facts. To track the dynamics of a collection of Twitter users posting Twitts about a specific electoral event with candidates, in which, at any time , we have access to the “volume” (or number of mentions), i.e., the number of Twitts in favor of a specific candidate and, in some cases also to the “sentiment”, i.e., the number of Twitts in favor minus the number of Twitts against a specific candidate, we use a -state Potts model Wu . The choice of the model is not arbitrary. In fact, as we shall show, a generalized -state Potts model with suitable couplings and external fields, in principle time-dependent, is the model that maximizes the entropy of a set of data constituted by mean number of mentions and mention-to-mention correlations (more precisely, auto-correlations, and two-point correlations). In other words, the generalized -state Potts model is maximally random under the above
constrains. Unfortunately (at the present study), we do not have direct access to the mention-to-mention instantaneous correlations (microscopic level, or single-Twitt level). However, under the stationary hypothesis, where no impacting news reach the Twitter network, the system can be considered as isolated. As a consequence, during a stationary regime, any apparent dynamics can be seen as sole due to stochastic fluctuations originating from a time-independent probability distribution and we can derive the macroscopic (or coarse-grained) correlations from thetime series of the mean number of mentions. Quite interestingly, these macroscopic correlations scale always as , where is an effective number of messages that can also be calculated from the data and that turns out to be a small fraction of the mean number of messages (where the mean is calculated over the given time interval of observation, ranging from minutes to days). In turn, this implies that the microscopic correlations scale at most as , i.e., they are effectively mean-field like. Remarkably, by assuming a mean-field Potts model (MFP) characterized by one single coupling , we find that, in all the electoral events that we have considered, the inverse problem is solved for a close to its critical value. However, other mean-field models could work as well. In fact, by comparing in detail the structure of the measured macroscopic correlations with the model macroscopic correlations, we see that, whereas they are both mean-field like (decay as ), in some cases the multinomial distribution (MD) turns out to be the best statistical model, while in some others is the MFP. Note that, in the MD case, microscopic correlations are zero, and macroscopic correlations arise only as a consequence of the constraint that the total number of messages is fixed while, in the MFP case, microscopic correlations are present. In both cases, the macroscopic connected correlations functions decay as , but their forms are quite different and allow for a clear distinction between two corresponding scenarios, MD or MFP. Note also that, formally, the MD for candidates and messages corresponds to a -state Potts model with zero coupling and suitable external fields, while the MFP is characterized by a coupling that scales as
and no external field. Clearly, one could look for other mean-field Potts models with more general features able to interpolate between the MD and the MFP, as well as to take into account the scale-free character of the underlying Twitter networkNeville ; TwitterSF . However, the MD and the MFP, which constitute the two simplest and somehow opposite mean-field models, turn out to be able to reproduce well the measured macroscopic correlations. In fact, as we shall show, the use of the effective number of messages allows for taking into account replicas of messages, which can be seen as strong-rigid correlations. This aspect makes the MD and MFP models rather non trivial and effective.
The MD versus MFP scenario is very interesting. Our analysis shows that only in the MD case one can use the Twitts as a poll for predictions. In this case in fact, during an approximate stationary regime, one can simply track the average trends and extrapolate the future election result by making use of standard regression methods. In the case of a MFP model instead, the fact that the model lies in its critical point should alarm the statistician attempting to make predictions about the future electoral result. In fact, for , where the Potts model is equivalent to the Ising model, at the critical point the system undergoes a second-order phase transition and fluctuations are unbounded in the thermodynamic limit. For , instead, the system undergoes a first-order phase transition where fluctuations are bounded but sudden jumps occur. In any case, our analysis shows that, in several electoral events, the system tends to settle itself, for over large time intervals, to its critical point, lying then between the two phases: a liquid/paramagnetic one where each candidate tends to receive, on average, the same percentage of votes, and a crystal/magnetized one where one of the candidates breaks the liquid/paramagnetic symmetry gaining a net advantage over the others. The passages between the two phases appear to be rather unpredictable due to the permanent vicinity to the critical point preventing therefore to make a forecast of the trends. At least within the context of democratic elections, this scenario of intrinsic unpredictability seems to be due to the controversial spread of political and economic power that takes place among a few symmetrically powerful parties and respective candidates, the strongest of which rarely overcome . Strong candidates are quite able to compete with each other. Typically, a strong candidate publicly speaking at time say, , has a good chance to fade out the impact of the speech of her/his strong opponents given at prior times . This process can result in a continuous rank reversal of the trends that, statistically, translates into a system that remains on the boundary between the two phases. In such a scenario of symmetrically strong candidates, the winner of the election is just the result of a random large fluctuation.
The paper is organized as follows. In Sec. II, we introduce and develop the general inverse problem (IIA) whose formal solution is given in terms of a generalized Potts model having suitable couplings and external fields (IIB), with important simplifications arising in the homogeneous case (IIC) which, in turn, includes the sub-case of zero correlations where the inverse problem is solved by the MD. In Sec. III, we review the MFP in detail (IIIA). Although the MFP can be seen as a particular solution of the inverse problem, for pedagogical reasons we prefer to describe it in a separated section where, in particular, we calculate the connected correlation functions (IIIB). In Sec. IV, we apply the theory to several Twitter data sets of trends (time series of the mean number of mentions, and also sentiment in a specific case). In doing this, we first stress the peculiarities of the MD and MFP models (IVA), then we show how to measure the macroscopic correlations in a stationary regime (IVB), and define the effective number of messages which provides another interesting information on the tendency of the users to copy each other (IVC). Finally, we apply the theory to several electoral events. At the end, some conclusions are drawn.
2 Inverse Problem in Social Sciences
Here we introduce a general unbiased approach that, in principle, allows to derive the actual system-parameters, although at high computational cost, and also data cost. In fact, it is worth to remind that only certain sets of Twitter data are publicly free. We first illustrate the procedure in a idealized context where we assume that data are known at the user-microscopic level. Next, we shall relax this assumption and make the naive mean-field approximation in order to cope with the incompleteness of the available data and in view of simplicity.
We assume that there exist nodes/users/agents/messages and that each agent owns a status variable , the “spin”, that can take possible different values labeled as . For example, the state can represent an opinion on a given topic where only different answers are possible, or the state can represent the agent’ s preference over a set of political candidates. The status must take into account two different tendencies of the spin: a tendency to be influenced by its neighbors, and a tendency to follow a behavior independently from the others. This latter in turn can be either due to an intrinsic tendency of the user to follow her/his believes, or can be due to news and events whose sources are external to the set of the agents. In the framework of equilibrium statistical mechanics, the natural way to model such a system consists in using the -state Potts model, where the couplings quantify the tendency of the spins to be influenced by its neighbors, whereas the external fields quantify the tendency of the spins to follow a behavior independently of the other spins (similar ideas have been applied in Rumor for the case to analyze the interaction between two communities). The use of the Potts model in a tentative to describe social interactions is perhaps already available in literature but not abundant for . However, most of the works so far produced in this direction turn out to be quite disconnected from real world phenomena. The concepts themselves of Hamiltonian, coupling, temperature and external field, which are well established physical entities, in social sciences assume a too often abstract meaning and seem to have no direct connection with real data. In this way, the use of statistical mechanics in social science remains a merely pictorial description of the real phenomena under study, with no prediction power. This critics applies also to other tools which traditionally do not belong to statistical mechanics (e.g. community detection algorithms, voter-like models, and so on).
We show here that it is instead possible to introduce in social science a rigorous use of statistical mechanics, where the connection with the real observed data is exact, and where the probabilistic prediction power of statistical mechanics is recovered, though at the expenses of high computational costs as well as expensive data. The following probabilistic method is certainly not original (see for example the same method applied to graph theory to derive the probability distribution in a canonical ensemble of graphs Newman ), but it is perhaps new in the present context and highlights several points that are important to our aim.
2.1 The General Problem and its Formal Solution
be the vector of thespin configurations: . Let us assume that at any instant there exists the probability distribution . Suppose that all we know about the real phenomena under consideration is a set of independent averages over the distribution of certain observables (in the following we shall omit the dependency on for brevity):
We can “derive” as the distribution that maximizes the information entropy under the constraints (1). In other words, we have to find the distribution that maximizes the following functional
where is a trial probability distribution. The maximum of must be found with respect to and the Lagrangian multipliers, and . The latter Lagrangian multiplier ensures the normalization of while the others ensure that the constraints (1) are satisfied. The general formal solution of the above problem is the following Gibbs-like distribution
and the Lagrangian multipliers solve the following system of Eqs.
Notice that the normalization condition of is automatically satisfied ( is absorbed in ). We stress that, in general, does not coincide with the target unknown distribution . In fact, only in the limit of an infinite number of independent averages we can have the equality between the two distributions (formally, for ). For finite, is simply the distribution that, under the constrains (1), is maximally random with respect to the distribution .
More precisely, the use of Shannon’ s entropy ensures that: i) is maximally random under the constrains (there are in principle other functionals that satisfy the same requirement, like Rényi’ s and Tsallis entropy), ii) if for some data we have, for example, , this means that , a property which is consistent only with an additive (and then extensive) entropy, like the Shannon’ s or the Rényi’ s entropy, iii)
the chain rule for the conditional probability,(which is a natural requirement in information theory), is satisfied only by the Shannon’s entropy.
2.2 Most general means and correlations
Let us now consider a system of agents where, at any instant, can take one over possible values . We assume that, at any instant , we have the following independent data 111 The sets of data are of course not independent since for any . Formally, we can deal with this issue by making use of another Lagrangian multiplier to be used in Eq. (2.1) that takes into account this constrain. However, it is easy to see that can be absorbed in with no effect in the final result for , Eqs. (3-2.1) being the same. :
Notice that in defining the correlation , we have imposed since the information corresponding to is already contained in the data of . According to Eqs. (3)-(2.1), we have the following general Gibbs-Boltzmann form
Eqs. (10) and (2.2) establish our inverse problem: having the sets of the means and the set of the correlations , we have the necessary and sufficient conditions to find, at least numerically, the set of the external fields , and the set of the couplings .
2.3 Homogeneous case
Above, we have discussed a very general case in which user’ s orientation and user-user correlations depend both on the user and on the orientation. However, in a sufficiently large system, we expect the following homogeneity to take place for the two sets of data:
The meanings of and are
where we have used
, and introduced the “frequency” random variable
where the sets of external fields and couplings must satisfy
Homogeneous uncorrelated case - Multinomial Distribution
This case applies when the
’s are independent and uniformly distributed random variables. As a consequence, trivial correlations exist only for the frequency random variable’s, Eqs. (18). In fact, due to the constraint , it is easy to see that
and the probability distribution for the ’s, , turns to be the multinomial distribution 222 with an abuse of terminology we shall say that the probability distributions for both the ’s and the ’s are MD’s.
where , proportional to the random variable , is a new random variable counting how many users have status , in terms of which, the connected correlation functions are
It is possible to derive the multinomial distribution (25) by using the previous approach as follows. From Eq. (19) we see that, since the ’s have no correlations, we can look for solutions in which and we find
The fields ’s can be expressed in terms of the means ’s by applying Eq. (16), which provides the following formal system of equations
On noting that the last expression depends only on the multiplicity random variables ’s, we can finally obtain the distribution for the latter, , by multiplying by the number of ways in which we can arrange the vector among spins, , and the result is the multinomial distribution (24). Later on, we shall make use of Eqs. (23) for a crucial benchmark that distinguishes between interacting and non interacting models.
3 The mean-field Potts model
When a set of data displays a symmetry (or some approximate symmetry) among or of its components, the distribution in which corresponds to the mean-field Potts model, turns out to be a good candidate for solving the inverse problem of the previous Section. Below, we review the traditional -state mean-field Potts model emphasizing some points not often stressed in literature. We shall evaluate, in particular, the connected correlation functions for finite (to the best of our knowledge, this constitutes a new result). A posteriori, we can say that the latter allow to rigorously establish under which hypotheses on the data the inverse problem is solved by the distribution associated to the mean-field Potts model. Indeed, since the complete analytical solution (direct problem) of the mean-field Potts model is rather non trivial, it is pedagogically much more convenient to devote an entire Section to the solution of the direct problem.
3.1 Uniform coupling and homogeneous fields
The mean-field Potts model with a uniform coupling and homogeneous fields is defined through the following Hamiltonian built on the fully connected (or complete) graph
where is the Kronecker delta function. Let us rewrite as (up to terms negligible for )
From Eq. (30) we see that, if , by introducing independent Gaussian variables, , , we can evaluate the partition function, , as
From Eq. (31), for , we get immediately the following system of equations for the saddle point
while the free energy density is given by
to be evaluated in correspondence of the solution of the system (32). For each , coincides with the thermal average of , i.e., the probability to find any spin in the state . For Eqs. (32) are symmetric under permutation of the components . Moreover, for all the possible solutions of Eqs. (32) can be found by setting components equal to each other and solving one single equation. If is any permutation of , then we set
where , and satisfies the equation
Eqs. (33)-(35) give rise to a well known phase transition scenario Wu : a second-order mean-field Ising phase transition sets up only for at the critical coupling , while for any there is a first-order phase transition at the following critical value
3.2 Covariances and correlations in the case of uniform coupling and homogeneous fields
In order to calculate correlations in the case of homogeneous fields, we have to first generalize the previous calculation to include the external fields site- and status- dependent: . The generalized Hamiltonian is
and the generalized partition function is
The saddle point of the integral in Eq. (38) corresponds to the system of Eqs.
and the function evaluated at the saddle point, , is
which provides the partition function as
The one-point and two-point connected correlation functions are related to via
The partial derivatives of can be obtained by solving the following system derived from the saddle point Eqs. (41)
from which we see that
As a consequence, as expected, we get
Notice that the last Eq. holds only for , while for
which, plugged into Eq. (3.2) gives
It is easy to check that, if , for it is , so that, in the paramagnetic phase, up to a positive constant, the correlations of the mean-field Potts model have the same formal structure of the correlations of a multinomial distribution where for any , see Eqs. (23).
In general, Eq. (54) can be put in a matrix form by
where is the identify matrix,
and stands for the dyadic product between the vector and itself.
4 Application to Twitter data for political trends
Twitter is a micro blogging environment where users post small messages, or Twitts, depicting their likes and dislike towards a certain topic, e.g. candidates to the next political elections. In this Section, we apply the previous general approach to Twitter data related to political trends.
We have considered a number of electoral events and used Twitter data collected and published in papers and/or websites Brazil ; Brazil-2 ; UK2010 ; Spain2016 ; India2014 . In the Refs. Brazil ; Brazil-2 ; UK2010 ; Spain2016 ; India2014 , the following general scheme is used: For each electoral event there are main candidates, each candidate is represented by a label , and the activity of a large number of users (order or more) is followed for days (or minutes) prior or after the election day (or election minute), or partially prior and after. The days (minutes) of observation are represented by the integer , while the corresponding collected number of messages (order or more) is represented by . At each , the activity of the -th message, , is then analyzed by looking for key words providing the political preference of the user toward one specific candidate among the ones: . The percentages , or normalized mean number of mentions, as in Eq. (16), are then tracked till the final time . Taking into account that , these data constitute a set of independent information. Consistently with the general approach derived in Sec. II, the statistical model to be used for these data should correspond to the generalized Potts model via Eq. (8) where, besides being totally general, the couplings and external fields are also time dependent. Notice that, whereas in Sec. II, represents the number of users (or agents), here (at each time ) refers to the number of messages. We stress that, from a probabilistic point of view, the role of in Sec. II, and the role of here, are the same. Here, the advantage in working with the number of messages rather than the number of users, relies on the fact that each user, in general, at each time is free to post an arbitrary number of messages (whereas in Sec. II each user/agent was supposed to take a unique status at each ). It is then clear that in Twitter, in general, does not represent the mean offline preference toward the candidate since a user of Twitter could repeat her/his preference many times while an offline person can express her/his vote just once. On the other hand, it is also possible that each Twitter user repeats her/his preference the same number of times on average. In such a case, the normalized mean number of mentions would be representative of the offline preference. At any rate, the main aim of this work is to look for the models that best reproduce the observed Twitter data, independently of the actual offline will of vote.
4.1 Statistical Models: Multinomial Distribution (MD) vs Mean-field Potts (MFP)
Unfortunately (currently) the percentages constitute the only data available to us. In particular, we have no direct access to the single message activity , nor to the instantaneous correlation functions. As a consequence, even under the very mild assumption of homogeneity, where the Potts model takes the form (19), we have no way to determine the involved general couplings and external fields . As mentioned in the Introduction, we overcome this impasse by assuming a stationary regime and by using the empirical observation that the macroscopic correlations (see below) decay inversely with an effective number of messages . The stationary hypothesis allows us to measure the macroscopic correlations by using the data, while the weak correlations lead to assume one of the two possible statistical models: MD and MFP. By summarizing the two possible statistical models are characterized as follows.
Multinomial distribution (MD): the ’s are independent random variables taking possible values according to arbitrary probabilities normalized to 1; as a consequence, weak (order ), and trivial (due to the constraint ) macroscopic correlations, are present only in the frequency random variables ’s (not in the ’s ), as given by Eqs. (23). This model reproduces a urn of elements taking each a color among possible ones biased by the probabilities . As it will be made clear soon, the MD model works well also when the messages are strongly correlated via replication.
Mean-field Potts model (MFP): the ’s are weakly (order ) correlated random variables taking possible values and the only parameter is a uniform coupling . This model reproduces a kind of self-organized system where the users have not arbitrary preferences for the latter are determined by the coupling . In particular, there exists a critical value below which the system is symmetric with , and above which the system has a winner and equal losers, as determined by Eqs. (34-35). The non trivial correlations are given by Eqs. (56) to be solved in combination with Eqs. (58-60). Also in this case, the MFP model works well even if the messages are strongly correlated via replication.
The above two models are both mean-field like and constitute two opposite limits. We have considered also some intermediate cases involving both a coupling and external fields for which Eqs. (32) apply. However, as it will be evident in the next examples, such attempts fail as the coupling and external fields strongly fluctuate, contradicting the stationary hypothesis. We shall come back on this issues in the Conclusions.
4.2 Sampled Means and Correlation functions - Stationary regimes
For both the models, MD and MFP, the correlation functions of the random variables ’s (model macroscopic correlation functions) can be expressed in terms of the sole means ’s. On the other hand, within the stationary assumption (which means that the ’s originate from a time independent probability distribution) we can also measure the macroscopic correlations from the time series. In fact, such a condition certainly applies when no news arrives and the users keep or change their status only due to closed interactions within the Twitter community. Under such circumstances, we can evaluate the sampled means and the correlation functions as
However, for properly taking into account the fact that the number of messages strongly fluctuates, we shall rather use the following weighted means and correlations