1 Introduction
The Web is universally recognised as a huge source of textual information in a wide variety of languages. The type of information contained may be expository in nature, factual, or may consist of opinions. Due to the sheer volume of information available on any given subject in any given language, the use of natural language processing tools to extract this information is a commonly used strategy. The extraction of opinions, evaluation and attitudes from free text, which goes by the name of sentiment analysis, is frequently implemented with the use of lexicons, or dictionaries, which can assign a polarity, positive, negative, or neutral to the adjectives which appear in the text. Alternatively, machine learning approaches which require manual classification can be used.
Manual Classification, which is often considered the gold standard, requires a large corpus of texts with manual annotation. In languages such as portuguese this is resource is not present. In a recent review on statistical methods for sentiment analysis, [21], these languages have been referered to as resource poor languages. Even in languages such as English, in which there are many annotated texts, problems can arise due the divergence of opinions between different annotators studying the same text [16]. In [16] a variety of tweets in different languages were annotated by multiple annotators, including 9 separate annotators for English. Several of the English language tweets were analyzed by different annotators. In particular there are 1144 tweets analyzed by both annotator 78 and annotator 80 (notation of [16]
). Each annotator assigns to each tweet one of three labels, positive, negative or neutral. The combined results of their analysis is shown in the contingency table below.
N78  Ne78  P78  

N80  236  76  35 
Ne80  50  295  113 
P80  16  58  265 
The presence of non zero off diagonal elements in the contingency table demonstrates a lack of agreement between these two annotators. Similar results can be obtained from other pairs of annotators. In addition to the lack of agreement between different annotators, the authors of [16] show that the same annotator on occasion marks the same tweet with a different label. These results complicate the use of manual annotation as a guideline for training a classifier. Given this variability in manual annotation, it is worthwhile to have a framework to directly compare algorithms without any reference to manual annotation. This is the main motivation for this paper. The statistical methodology which we will discuss later will be used to show (in a real dataset), that within the same lexicon the degree of concordance between algorithms is label dependent, and for the same labels the degree of concordance between algorithms is lexicon dependent. In particular, it will be shown that it is exceedingly unlikely that these differences can be due to chance variation.
2 Statistical Methodology
In order to motivate the choice of statistical methodology, we will consider the same set of texts analysed by two different algorithms. The output of each algorithm on any one of the texts is a label positive (), neutral () or negative (). The output of both algorithms can be summarised in a contingency table of the form below.
In Table 2, is the number of texts for which both algorithms assign label . is the number of texts for which one algorithm assigns label and the other label . describes the same situation but with the labels of the algorithms switched. The values in parentheses etc. are the more common notation for the entries this kind of table and will be used later on. The presence of non vanishing off diagonal values indicates less than perfect agreement between the algorithms. However, it is still desirable to check the degree of concordance between the methods, more specifically to see if the degree of concordance is less than would be expected by pure chance. The comparison of the performance between different evaluators arises in various contexts, for example when evaluating the degree of concordance between two radiologists each of which assign one of four qualitative labels to the same set of slides of ovarian tumors. Among the methods suggested in the statistical literature for the comparison between experts are marginal homogeneity testing and log linear modelling. We will describe each of these methods in turn and later describe their use. The kappa of cohen [8] , a single parameter which measures the degree of agreement between algorithms over and above that expected by pure chance has been frequently used for this purpose. However there are well known deficiencies associated with this measure ” In summarizing a contingency table by a single number the reduction of information can be severe”. [2]. Thus we will make use of other statistical methods as well. None of the methods we will describe assume knowledge of the true labels. Thus methods which rely on ranking classifier performance such as the friedman test [11], whose use is suggested in [9] and [7], will not be considered in this paper. In addition, we will not make use of methods in [21] which consider two known class labels .
2.1 Marginal Homogeneity Tests
The first category of tests which will used are those which will help compare

the difference in the proportion of texts labelled as by both algorithms,

the difference in the proportion of texts labelled as by both algorithms, and

the difference in the proportion of texts labelled as by both algorithms.
These are marginal homogeneity tests; for contingency tables all three differences can be tested simultaneously for statistical significance via the Stuart Maxwell Test [22],[15]. If the pvalue obtained from this test is significant, then at least at one of the three differences is statistically significantly different from zero. The Stuart Maxwell test used here is a more general version of the McNemar test whose use in the comparison of binary classifiers was suggested in [6].
2.2 Log Linear Models
In addition to marginal homogeneity tests we will also analyze contingency tables using log linear models. The use of log linear models to study agreement among observers has been discussed before [1] , the same methodology can be applied to discuss the agreement between algorithms. We will compare two models, the independence model and quasi independence models to test the concordance between algorithms to analyze polarity. The independence model may be written as
(1) 
In this model the level of agreement between algorithms which occurs is the level expected by pure chance.
If we assume that the algorithms show a degree of concordance which exceeds that expected by pure chance, then the simplest model we can consider is
(2) 
In equation 2 the parameter enters only in the diagonal elements and describes the extent to which the values of diagonal elements differ from the values expected by pure chance. However, Equation 2 assigns the same parameter to describe the rate at which the expected values for the diagonal counts for all three labels differ from those expected by chance agreement between the two algorithms. In practise, the algorithms may have different degrees of concordance for different labels. To describe this situation we will use a specific type of quasi independence model [5] which can be written as
(3) 
The parameter thus describes that contribution to the rate at which both algorithms assign label , which cannot be explained by pure chance. There are three such parameters, and the diference between the sizes of these parameters is an indication of the difference in the degree of concordance between the algorithms for each label separately. The use of quasi independence models to distinguish the degree of concordance and disagreement in human observers for each separate label has been suggested in [4].
One question relevant to the analysis of concordance against disagreement which can be addressed through the use of log linear modelling is the following: if we have two observations which both algorithms classify as being different, for example both being either or
, what are the odds that both algorithms agree rather than disagree on which observation is
and which is ? Questions of this nature can be easily answered within the framework of a quasiindependence model. If we denote bythe probability that for a given observation one algorithm chooses label
and the other label , then as shown in in [2] the log of the odds can be estimated by(4) 
A large value for Equation 4 would suggest a strong tendency to agree rather than to disagree on what is and what is . The value of the log odds is 0 if the the tendency to agree is comparable with the tendency to disagree. The same analysis can be repeated for all possible combinations of labels, and for different lexicons leading to pairwise measures for agreement between different labels.
We will also consider logarithm of the odds ratio which in the framework of the quasi independence model is
(5) 
This quantity may be used to see if the tendency to agree on what is and what is , is comparable with the tendency to agree on what is and what is .
Due to finite sample sizes however, model parameters cannot be estimated exactly. We will attempt to estimate these parameters and confidence intervals using Maximum Likelihood Methods. In addition, we will make use of the residual deviance to compare goodness of fit of the models discussed earlier with the saturated model. We will also make use of AIC scores to compare models and choose the most suitable model among those discussed.
3 Data Sets and Algorithms
To illustrate the use of these methods we will extend the results data presented in [14]. In this paper, a well known Portuguese book review corpus [10] was used analyzed with different algorithms relying on word polarities (which we denote by Words, adjective polarities (which we denote by Adj) among others. Sentiment was assigned to the output by the use of three separate lexicons, OpLex [20], SentiLex [19], and LIWC [3]. Additional Details about the Algorithms and lexicons can be obtained from [14]. For 2233 texts from the [10] corpus, all the algorithms discussed in [14] and all dictionaries were used to assign a sentiment in the form of a categorical label taking one of three possible values, , , or . These outputs can be used to perform various pairwise comparisons between the algorithms. We will focus on the comparison between Words and Adj. The comparison is performed using the methods described in Section 2 using each lexicon separately. All analyses are performed using the R programming language [17], and the Inter Rater Reliability (irr)[12].
4 Results
4.1 Marginal Homogeneity Tests
Using the LIWC lexicon the contingency table obtained from the cross comparison of the results from the Words and the Adj comparison is shown below
Adj  

Words  55  4  97  
49  637  1009  
36  24  322 
In Table 3 the presence of numerous non vanishing off diagonal elements shows that the algorithms do not agree perfectly. The tables for OpLex and SentiLex share this feature and will not be shown here.
In all three tables, the presence of relatively large diagonal elements suggests the degree of concordance between different labels is greater than would be expected by chance alone. In order to check this we check the values of the unweighted kappa of cohen for all three lexicons. The results are shown below
Lexicon  kappa  pvalue 

LIWC  0.1731(0.15130.1949)  
OpLex  0.5440 (0.51480.5731)  
SentiLex  0.5820 (0.55370.6103) 
In order to test for differences in marginal frequencies we implement the Stuart Maxwell tests for all three lexicons, as mentioned earlier. The middle column shows that value of the test statistic, the final column the pvalues.
Lexicon  TestStatistic  pvalue 

LIWC  1001  
OpLex  165  
SentiLex  158 
The pvalue for Stuart Maxwell test is highly significant which is evidence against equal marginal frequencies of the labels (,, and ) as obtained from the two algorithms.
4.2 Log Linear Models
To begin with, we analyze the independence model (equation 1
), for all three lexicons. Rather than specify the estimates and standard errors for all parameters we show the deviance statistic and AIC score for all three lexicons. We also include the pvalues obtained from a
test with 4 degrees of freedom.
Dictionary  AIC  Resid. Dev.  pvalue 

LIWC  439.73  373.36  
OpLex  1433.2  1363.4  
SentiLex  1594.4  1524.4 
From the pvalues in this table we see that the independence model is unsatisfactory. In order to check whether a model of the form equation 2 might be a better fit we analyze the table of pearson residuals.
27.86  15.32  4.010 

14.76  31.69  20.74 
8.00  19.96  24.88 
In Table 7, we find that the pearson residuals have large positive values for the diagonal elements. This suggests that the independence model underestimates the degree of concordance between the algorithms. The tables for the pearson residuals for LIWC and OpLex share this feature and will not be shown here.
The underestimation of the degree of concordance between the algorithms suggests the use of an equation such as equation 2 to obtain a better fit to the data. The results from equation 2 are presented for all three dictionaries. Rather than report all parameter estimates and standard errors the results for the parameter in equation 2
is shown along with the AIC score, the residual deviance and the corresponding pvalue which is now obtained by comparison with the quantiles of
.Dictionary  AIC  Resid. Dev.  Resid pvalue  

LIWC  1.236  169.54  101.17  
OpLex  1.708  230.6  158.53  
SentiLex  1.828  187.78  115.86 
For all three dictionaries, the pvalues associated with the parameter are small
) which is evidence against the value zero, the null hypothesis. However, for all three dictionaries the pvalue associated with the residual deviance is significant, this model does not provide an adequate fit to the saturated model. The next step is to implement the quasi independence model (equation
3) in the hope of obtaining a better fit. The AIC and Deviance Statistics for the Quasi Independence Model are shown below.Dictionary  AIC  Resid. Dev.  Resid. pvalue 
LIWC  72.53  0.1600  0.6891 
OpLex  75.83  0.9493  
SentiLex  77.56  1.636  0.2008 
Rather than report the estimates and standard errors for all parameters we will report the sample estimates of the parameters and (corresponding to labels , and ) along with their 95% confidence intervals obtained from the profile likelihood.
Lexicon  Parameter  Estimate  95% CI  pvalue 
SentiLex  2.9741  (2.586,3.386)  
SentiLex  3.1175  (2.766,3.492)  
SentiLex  0.06139  (0.3227,0.4268)  0.7475  
OpLex  3.005  (2.610,3.426)  
OpLex  3.269  (2.905,3.659)  
OpLex  0.4004  (0.800,0.02277)  0.0429  
LIWC  2.4371  (2.012,2.865)  
LIWC  2.9125  (2.405,3.449)  
LIWC  0.8019  (1.224,0.3776)  0.0002 
Next we will present for each dictionary, the results for the log of the odds defined in equation 4 . We present not only the sample estimates but also the 95% Normal confidence intervals.
Lexicon  Label Pairs  Estimate  95% CI 

LIWC  and  5.350  (4.606,6.093) 
LIWC  and  1.635  (1.159,2.111) 
LIWC  and  2.111  (1.707,2.514) 
SentiLex  and  6.092  (5.456,6.727) 
SentiLex  and  3.036  (2.681,3.390) 
SentiLex  and  3.179  (2.915,3.443) 
OpLex  and  6.2744  (5.610,6.939) 
OpLex  and  2.605  (2.246,2.964) 
OpLex  and  2.869  (2.614,3.124) 
Finally we show for each lexicon the log of the Odds Ratio as defined by equation 5. Once again, we use 95% Normal Confidence Intervals.
Lexicon  Estimate  95% CI 

LIWC  0.4754  (1.0683,0.1175 ) 
SentiLex  0.1434  (0.5646,0.2777) 
OpLex  0.2637  (0.6800,0.1525) 
5 Discussion
A variety of different conclusions can be drawn from the results presented in the various tables. Tables 4, and 5 show that independent of the dictionary used the algorithms show substantial disagreement. The values for the kappa of cohen suggest are similar for the OpLex and the SentiLex with the results from the LIWC dictionary quite distinct from the other two. The results of the Stuart Maxwell Test again are similar for the OpLex and SentiLex, but the results from LIWC are again quite distinct. This suggests that LIWC is somewhat different from the other two lexicons.
However, the analysis of marginal homogeneity serves only to compare marginal frequencies. In order to check the rate at which the same text is given the same label by different algorithms, the results from the log linear analysis are required.
The results of the analysis of the independence model, Equation 1, shown in (Table 6) suggest that independent of the dictionary used, the two algorithms agree at a rate higher than expected by chance. However, the low pvalues obtained from the deviance residual are a motivation to consider more complex models. This conclusion holds independent of the dictionary used. Table 7 suggests that the diagonal elements are underestimated in the independence model.
The first model used to address the inadequacies of the independence model is that described in Equation 2. The results shown in Table 8 confirm the suggestion from the Table 6 about the inadequacy of the independence model. More specifically, an estimated value for the parameter, statistically significantly different from 0, suggests that, independent of the dictionary, the observed level of agreement between the two algorithms is different from that expected by chance. This is compatible with the positive diagonal residuals in Table 7.
Another measure for the improvement of this model in comparison with the independence model is the lower value for the AIC compared with the independence model, independent of the dictionary used. Despite this improvement, the low pvalue associated with residual deviance suggests strongly that this model provides a poor approximation to the saturated model. This justifies the consideration of the quasiindependence model, Equation 3.
Proceeding to the results of the quasi independence model, Equation 3, shown in Table 9, we find that the AIC scores are lower than for any other model considered earlier and the pvalue associated with the residual deviance is sufficiently large to be considered a reasonable substitute for the saturated model. Since the quasi independence model is the only model considered which shows no evidence for overdispersion, we will consider only the quasiindependence model from now on.
The values of the three parameters (Table 10) show different patterns for the three dictionaries. In the case of LIWC they are consistently lower than for OpLex and SentiLex. In common with Tables 4, 5 and 6 LIWC seems to be different. In the case of SentiLex it is possible to consider a simpler model in which the parameter is dropped from the model. However, we will retain the parameter in all lexicons, this does not introduce any changes to the conclusions drawn from the results of Tables 11 and 12.
In Table 11 the estimated value of the Log Odds for the combination is larger than for the other two combinations. Along with the confidence intervals, this can be interpreted to mean that for all three lexicons the separation between and is much more distinct than between and and and . This is consistent with the results obtained in [16] for the comparison of human annotators which were obtained using different statistical tests. This suggests that the relative difficulty of human annotators in distinguishing and and and , compared with and , may get carried over to lexicon based polarity analysis. In Table 12, independent of lexicon, the confidence intervals for all three logs of the odds ratio includes the value zero. This can be interpreted to mean that within a given lexicon, the tendency to agree on what is and what is is comparable with the tendency to agree on what is and what is . However the tendency to agree on what is and what is is very much larger. In other words can be considered lie midway between and . This seems to be compatible with the results of [16]. As an additional check on the consistency between the results shown here and those in [16] Table 1 was analyzed using Equations 3,4 & LABEL:odds:ratio. The results obtained show the same trends as Tables 11 and 12.
As in the case of the earlier analyses, there are perceptible lexicon dependencies in Table 11. The results for OpLex and SentiLex are more similar to each other than the results for LIWC, similar to the analysis of Tables 4 and Table 5. The 95% confidence intervals in Table 11 for the LIWC Log Odds and Log Odds do not overlap with the confidence intervals of the Log Odds and Log Odds for the other two lexicons. The non overlap of the confidence intervals, along with the smaller values of the Log Odds for LIWC compared with corresponding values for the other two lexicons, suggest that for the label pairs and , the level of concordance between Words and Adj is statistically significantly less in LIWC than in OpLex or SentiLex. In particular, given the lack of overlap in the confidence intervals, this effect is exceeding unlikely to be due to sampling variation
The dictionary dependence of lexicon based sentiment analysis has been commented on qualitatively earlier [13] in the context of a single algorithm in conjunction with various lexicons. The analysis presented in this paper deals with the difference between the results of two algorithms studied across multiple lexicons and is more quantitative. Furthermore, unlike in [13] the methods presented here do not make use of known class labels. From our analysis, it appears that within a given lexicon the results of different algorithms show statistically significant differences, and furthermore these differences themselves can be lexicon dependent. These results were obtained through the use of marginal homogeneity tests and log linear modelling, which illustrates the advantage of these methods.
To summarise, we have employed a statistical technique well known in the comparison of human evaluators to compare algorithms for lexicon based sentiment analysis. We find that independent of the lexicon used, the concordance between different algorithms on the pair of labels and appears to be much better than the concordance on the pairs and or and . This is consistent with the results obtained from a comparison of human evaluators of twitter feeds. We have also shown that the degree of concordance between different algorithms is lexicon dependent. This conclusion has been backed up by extensive statistical analysis, including numerous pvalues and confidence intervals and extends previously published work on lexicon dependence. One open question in this context is how much of this variability in the degree of concordance is merely due to the variability in the lexicons, and how much is due to the variability in the algorithms. This question requires further investigation.
Quite apart from the data addressed here, the statistical methodology proposed has many other potential applications and is well suited to shed light on the details on the differences and similarities in the output of different algorithms whenever the output is in the form of a set of discrete labels which are mutually exclusive and complete. Furthermore, the statistical methodology discussed can also be used to analyze confusion matrices obtained from the comparison of true and predicted labels.
One limitation of the approach discussed in this paper is that Table 3 cannot have too many vanishing entries, if there are many vanishing entries the maximum likelihood estimators may not exist. In such instances the methods discussed by [18] may be better suited for the kind analysis discussed in this paper.
6 Acknowledgments
The results obtained here were obtained as part of a Project Funded by the Pró Retoria de Pesquisa of the University of São Paulo, Call No. 668/2018, Project No. 18.1.1719.59.8
References
 [1] A. Agresti. Modelling patterns of agreement and disgareement. Statistical Methods in Medical Research, 1:201–218, 1992.
 [2] A. Agresti. Categorical Data Analysis. Wiley, 3rd edition, 2013.
 [3] Pedro P Balage Filho, Thiago Alexandre Salgueiro Pardo, and Sandra Maria Aluisio. An evaluation of the brazilian portuguese liwc dictionary for sentiment analysis. In Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology, pages 215–219, 2013.
 [4] J.R. Bergan. Measuring observer agreement using the quasiindependence concept. Journal of Educational Measurement, 17:59–69, 1980.

[5]
Y. M. Bishop, S. E. Fienberg, and P. W Holland.
Discrete Multivariate Analysis: Theory and Practice
. Springer Science & Business Media, 2007.  [6] B. Bostanci and E. Bostanci. An evaluation of classification algorithms using mcnemar’s test. In et. al. Bansal J.C., editor, Proceedings of Seventh International Conference on BioInspired Computing: Theories and Applications (BICTA 2012), pages 15–26. Springer India, Hyderabad, India, 2013.
 [7] I. Brown and C. Mues. An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39:3446–3453, 2012.
 [8] J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20:213–220, 1968.
 [9] J. Demešar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006.
 [10] C. Freitas, E. Motta, R.L. Milidiú, and et. al. Que brilha… rá ! desafios na anotação de opinião em um corpus de resenhas de livros. In XI Encontro de Linguística de Corpus (ELC 2012). S o Paulo, Brazil, 2012.
 [11] M. Friedman. A comparison of alternative tests of significance for the problem of m rankings. Annals of mathematical Statistics, 11:86–92, 1940.
 [12] Matthias Gamer, Jim Lemon, and Ian Fellows Puspendra Singh. irr: Various Coefficients of Interrater Reliability and Agreement, 2019. R package version 0.84.1.
 [13] Olga Kolchyna, Thársis T. P. Souza, Philip Treleaven, and Tomaso Aste. Twitter sentiment analysis: Lexicon method, machine learning method and their combination. In Gautam Mitra and Xiang Yu, editors, Handbook of Sentiment Analysis in Finance, chapter 5. 2016.
 [14] M. T. Machado, T.A.S. Pardo, and E.E.S. Ruiz. Creating a portuguese context sensitive lexicon for sentiment analysis. In A. Villavicencio, V. Moreira, A. Abad, and et. al., editors, International Conference on Computational Processing of the Portuguese Language, pages 335–344. Springer, Canela, Brazil, 2018.
 [15] A.E. Maxwell. Comparing the classification of subjects by two independent judges. The British Journal of Psychiatry, 116:651–655, 1970.
 [16] I. Mozetič, M. Grčar, and J. Smailovič. Multilingual twitter sentiment classification: The role of human annotators. PLOS ONE, 11:1–26, 2016.
 [17] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2019.
 [18] F. Rapallo. Algebraic exact inference for rater agreement models. Statistical Methods & Applications, 14:45–66, 2005.
 [19] Mário J Silva, Paula Carvalho, Carlos Costa, and Luís Sarmento. Automatic expansion of a social judgment lexicon for sentiment analysis. 2010.
 [20] Marlo Souza, Renata Viera, Débora Busetti, Rove Chishman, and Isa Mara Alves. Construction of a portuguese opinion lexicon from multiple resources. In Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology, 2011.
 [21] R.A. Stine. Sentiment analysis. Annual Review of Statistics and Its Application, 6:287–308, 2019.
 [22] A. Stuart. A test for homogeneity of the marginal distributions in a twoway classification. Biometrika, 42:412–416, 1955.
 [23] E.S. Tellez, M. Graff, R.R. Suarez, and et.al. A simple approach to multilingual polarity classification in twitter. Pattern Recognition Letters, 94:68–74, 2017.