Inferring politically-charged information from text data is a popular research topic in Natural Language Processing (NLP), with a wide range of applications to predict individuals’ political views ranlp-wesley, bertha2019, bracis-pavan and behaviour potthast2018, hp-semeval, li2021. Existing work in the field often draws a distinction between two main tasks: the issue of political bias, and political ideology in general. Following hp-semeval and others, political bias is presently defined as any extreme one-sided (or hyperpartisan) discourse that, in a political context, clearly leans towards a liberal or conservative agenda, and which may be deliberately produced as a means to convince or gather support. This contrasts with several other more nuanced forms of political expression, hereby called ideology, in which arguments tend to be more balanced, and which do no explicit promote a particular political agenda. These include expressions of political orientation (e.g., being left- or right-leaning) and political stance in general (e.g., being for or against a political party, a principle, or an individual), among others.
From a computational perspective, both bias and ideology inference from text data may be further divided into two problem definitions, hereby called text- and author-level political inference. Text-level inference is more closely related to sentiment analysisdeep-sentiment, sentrule, sentsoc and stance classification st-post-sem, st-ensemble, bracis-pavan, and it is intended to determine the meaning of an input text (e.g., whether the text expresses a liberal or conservative view.) Less-known author-level inference, by contrast, is an instance of computational author profiling ca-demo, ca-ideology, ca-image, ca-bots, ca-fineg, ca-bert, ca-celeb, that is, the task of inferring an individual’s demographics (e.g., their political leaning) based on samples of text that they have authored, and which may or may not convey politically-charged information explicitly.
In recent years, both text- and author-level inference have been implemented with the aid of representations from transformers such as BERT bert and others. Models of this kind - which may be seen as large, pre-trained language models - are able to capture deep contextual relations between words, and have been shown to significantly improve downstream task results lee2019, baly2020. Despite the benefits afforded, however, in this paper we hypothesise that the use of pre-trained language models for political inference from text may be improved even further with the aid of additional text representations. More specifically, we ask how we may combine transformed-based text representations with both syntactic information and psycholinguistics-motivated features, and ask which of these so-called heterogeneous knowledge representations may have an impact on different political inference tasks.
To shed light on these issues, the present work takes the form of a series of experiments in supervised machine learning addressing a number of text- and author-level tasks alike. As in much of the existing work in the field, some of these experiments will make use of English text data but, in addition to that, we will also address a number of tasks based on Portuguese text data as well, and we will introduce a novel dataset for this particular language. Our main contributions are summarised as follows.
A neural architecture that combines transformer-based language models, syntactic dependencies and psycholinguistics-motivated features for political bias and ideology inference from text.
Text- and author-level formulations of the general political bias and ideology inference tasks.
Experiments involving both mainstream English and less-studied Portuguese text data.
A novel, large dataset of Twitter text data labelled with political stance information.
The reminder of this paper is structured as follows. Section 2 reviews existing work in political inference from text, and Section 3 describes the computational tasks to be addressed in this work. Section 4 presents our main approach to using heterogeneous knowledge representations. Section 5 summaries our experiment results, which are further discussed in Section 6. Finally, Section 7 presents conclusions and future work.
2 Related work
Table 1 presents an overview of recent NLP work on political bias and ideology inference from text data. All selected studies happen to be devoted to the English language. Further details are discussed below.
Related work in political bias (top) and political ideology (bottom) inference from text. For each study, we report task category as either ‘bias’ (e.g., hyperpartisan news detection) or ‘ideology’ (political ideology, stance, and alignment), label granularity (t = text-level, a = author-level), main learning features (text or network relations), learning methods (e.g., RNN = recurrent networks, U = stylometry, CNN = convolutional networks, LogReg = logistic regression, SVM = Support Vector Machines, RGC = graph networks, NPOV=neutral point of view), and text genre (D = discourse/debate, N = news, T = Twitter, W = Wikipedia.)
2.1 Political bias
Extremely one-sided political bias has largely focused on the issue of hyperpartisan news detection. The work in potthast2018, which is among the first prominent NLP studies in this field, analyses a corpus of extremely one-sided news and presents a stylometry-based approach to distinguish biased and neutral news, fake news and satire. Among other findings, results suggest that stylometry is of limited use in fake news detection, and that left- and right-wing news share a significant amount of stylistic similarities.
Some of the results from potthast2018 were taken as the basis for the influential SemEval-2019 shared task on hyperpartisan news detection in hp-semeval. The task comprised two formulations of the problem, each of them based on a different dataset. The by_article dataset contained 1,273 manually labelled texts, and the by_publisher dataset contained 754,000 articles labelled via distant supervision (based on the publishing source of each article.) Results reported in hp-semeval suggest that the by_publisher task was generally more challenging than the by_article task.
Among the participant systems at SemEval-2019, the work in bertha2019 reported the overall highest accuracy in the by_article task. The system makes use of a pre-trained ELMo elmo
language model to encode news texts as input features to a CNN classifier. This approach will be taken as a baseline to our own experiments described in the next sections.
The work in vernon2019
makes use of a logistic regression classifier to detect hyperpartisan news with the aid of hand crafted features (e.g., bias scores obtained from a bias lexicon, article- and sentence-level polarity, subjectivity and modality scores etc.) and Universal Sentence Encoder embeddings, among other alternatives. The system obtained the overall highest F1 score in theby_article task at SemEval-2019.
The studies in drissi2019 and lee2019 are among the first to attempt using pre-trained BERT language models for hyperpartisan news detection. In both cases, however, results remain below those obtained by several more traditional approaches in both by_article and by_publisher tasks.
The work in tintin2019 reported the overall highest accuracy in the b
y_publisher task using a Logistic Regression classifier with a bag-of-words text representation, outperforming a wide range of more complex models based on deep neural networks and others.
The work in patankar2019 takes a more application-oriented approach to hyperpartisan bias detection by presenting a real-time system that flags political bias on news articles, and then recommends similar articles from alternative sources. To this end, the system makes use of a bias lexicon and unsupervised methods for clustering articles by topic similarity.
Finally, the work in li2021, although not an original participant system at SemEval-2019, uses the SemEval by_article dataset to address the hyperpartisan news detection task as well. The work makes use of a heterogeneous knowledge representation consisting of BERT pre-trained model enriched with social and political information and linguistically-motivated features alike, and obtains results comparable to the top-performing systems at SemEval-2019 with the aid of a multihead Bi-LSTM architecture.
2.2 Political ideology
In addition to extremely one-sided bias, natural language text may convey many other more nuanced forms of political view, which are presently labelled as ‘ideology’ for conciseness. The work in iyyer2014
is among the first of this kind, addressing the issue of political ideology (defined as left, right, or neutral leaning) detection in text. In what nowadays may be seen as a standard approach to the task, the work uses a recurrent neural network model and word embeddings built from left- and right-leaning data to detect ideology in US congressman debates. Results suggest that this approach outperforms a range of logistic regression baseline systems using bag-of-words and word embeddings representations.
Also in the US congressman debates domain, the work in bhatia2018
introduces a sentiment-oriented model to identify political ideology in text, the underlying assumption being that sentiment words may be revealing of an individual’s political leaning (e.g., conservatives may arguably express more positive sentiment towards the free-market topic etc.) To investigate this issue, potentially relevant topics were selected from a debate corpus, including issues related to health care, US military program and others, and topic-specific sentiments were computed as a probability distribution over ordinal polarity classes ranging from strongly positive to strongly negative. Results suggest that a logistic regression classifier based on sentiment features outperforms the use word embeddings and others.
The work in kulkarni2018
addresses the issue of political ideology detection in news text using a so-called multi-view approach. In this approach, document-level features (e.g., the news headline and contents) and a network of links exhibited in the text are regarded as being complementary properties in the sense that authors may refer to links that reinforce their political views. The study is carried out using a corpus of 120k politically-related news in English, and results suggest that a multi-view approach based on convolutional networks outperforms a range of baseline alternatives including the use of logistic regression classifiers and hierarchical attention models, among others.
The work in baly2020 addresses the task of detecting political ideology in news texts from previously unseen media sources, which prevents models from learning the text source rather than the ideology proper. The work makes use of complementary knowledge obtained from Twitter and Wikipedia sources, and uses adversarial adaptation and triplet loss pre-training (TLP) with both LSTM and BERT models. Results suggest that combining TLP with additional Twitter information outperforms a range of alternatives, including the use of pre-trained transformer models alone, and those with access to additional Wikipedia information.
The work in stefavov2020
addresses the task of characterising the general political leaning of online media and influencers by using unsupervised learning to determine the stance of Twitter users towards a polarising topic based on their retweet behaviour, and then performing label propagation to take the resulting user stance information as training data for media political leaning detection. Results suggest that a combination of User-to-Hashtag and User-to-Mention graph embeddings with BERT models built from both article titles and contents outperforms the use of these individual strategies in isolation.
Finally, the work in feng2021 combines socially- and politically-related features to address the issue of entity stance prediction, an which may be regarded as an instance of author-level inference as discussed in Section 1. Examples of entities under consideration include US presidents, parties, and states. The study builds a heterogeneous information network from Wikipedia articles, in which nodes are social entities and edges are relations between them (e.g., party affiliation, home state etc.) The network is combined with the Wikipedia text summary of each entity using a RoBERTa transformer roberta in a gated relational graph convolutional network for representation learning. Results suggest that the approach outperforms a wide range of baseline alternatives, including bag-of-words, average word embeddings, transformers and graph-based models alike.
3 Task definitions
The present work addresses two text-level political inference tasks (T1,T2) and three author-level tasks (T3,T4,T5) using data from three sources: the SemEval-2019 Hyperpartisan news corpus hp-semeval, the BRmoral essay corpus brmoral and a novel dataset, hereby called GovBR corpus, to be introduced in Section 4.1. These tasks are summarised in Table 2 and discussed individually in the next sections.
|T1||text||hyperpartisan news||hyperpartisan, neutral||SemEval||by_articles||English|
|T2||text||political orientation||left, (centre), right||BRmoral||by_opinion||Portuguese|
|T3||author||hyperpartisan news||hyperpartisan, neutral||SemEval||by_publisher||English|
|T4||author||political orientation||left, (centre), right||BRmoral||by_author||Portuguese|
|T5||author||political stance||for, against||GovBR||Portuguese|
3.0.1 Text-level tasks (T1,T2)
Text-level tasks concern the inference of politically-charged information directly associated with the meanings of the input texts, in which case class labels are annotated at the individual text level. As discussed in Section 1, this is analogous to sentiment analysis, stance classification and related tasks.
Task T1 addresses the issue of text-level hyperpartisan news detection based on the SemEval by_articles dataset hp-semeval. This consists of a binary classification task intended to distinguish ‘hyperpartisan’ from ‘neutral’ information, as in the following examples.
Hyperpartisan: ‘Trump can’t get Congress to repeal Obamacare, he’s making changes that will penalize low-income people’
Neutral: ‘Colin Kaepernick told a CBS reporter that he would represent the national anthem if he was hired by an NFL team’
Task T2 addresses the issue of text-level political orientation detection based on the BRmoral corpus brmoral by_opinion dataset following both binary (‘left’ / ‘right’) and ternary (‘left’ / ‘centre’ / ‘right’) class definitions (to be further discussed in Section 4.1.2.) Examples taken from short essays related to the issue of same-sex marriage are illustrated as follows (translated from the original texts in Portuguese.)
Left-leaning: ‘Agreed, same-sex people must have their marriages valid, as they pay taxes and have the same civil obligations, so they should also have the same rights as the other.’
Centre: ‘It’s not up to the State to forbid this. However if a person asks me if I’m in favour, I’ll say no, although I respect (their choice).’
Right-leaning: ‘From the perspective of the civil institution, I do not see problems in marriage, however, the Christian doctrine makes it clear that the family consists of a man and a woman.’
3.0.2 Author-level tasks (T3,T4,T5)
Author-level tasks concern the inference of politically-charged information related to the individual who wrote the input texts rather than to the literal meaning of the text. In this case, class labels are annotated at the author (or publisher) level. As discussed in Section 1, this is analogous to NLP author profiling.
Task T3 is the author-level version of previous task T1, addressing the issue of author-level hyperpartisan news detection based on the SemEval by_publisher dataset hp-semeval. Once again, this consists of a binary classification task intended to distinguish ‘hyperpartisan’ from neutral’ information, but using weakly labelled data determined by the source of information rather than individually annotated texts. Thus, for instance, all texts produced by a publisher deemed to be a hyperpartisan source are labelled as ‘hyperpartisan’ news regardless of their actual contents.
Task T4 is the author-level version of previous task T2, addressing the issue of author-level political orientation detection based on the BRmoral brmoral by_author dataset following both binary (‘left’ / ‘right’) and ternary (‘left’ / ‘centre’ / ‘right’) class definitions. To this end, all texts are weakly labelled with the self-reported political orientation of their authors regardless of the actual text contents. Thus, for instance, opinions that are annotated (at text-level) as ‘left-leaning’ for the purpose of previous Task T2 may nevertheless be assigned a (author-level) ‘right-leaning’ label if their authors happen to identify themselves as right-leaning individuals.
Finally, task T5 addresses the issue of author-level stance classification based on the GovBR corpus to be described in the next section. This consists of a binary classification task intended to distinguish Twitter users who are ‘for’ or ‘against’ the current president of Brazil. In this case, tweets are weakly labelled with for/against stance information derived from popular hashtags. Thus, for instance, all tweets accompanied by a #RespectThePresident hashtag are assumed to be favourable to the president (the actual hashtags are not included in the data.) Examples of both classes are as follows.
For: ‘THIS IS MY PRESIDENT’
Against: ‘This misgovernment is formed by indecent, immoral, ignorant, stupid and perverse people.’
4 Materials and method
The goal of the present study is to investigate the use of heterogeneous knowledge representations - based on transformer-based language models, syntactic dependencies, and psycholinguistics-motivated features - for political bias and ideology inference from text as discussed in the previous section. In what follows we describe the data for each task, the classifier models to be investigated, and evaluation procedure.
The following sections discuss the three corpora to be taken as train and test data for our experiments, and present descriptive statistics.
4.1.1 SemEval-2019 hyperpartisan news corpus (tasks T1 and T3)
For the hyperpartisan news detection tasks T1 and T3, we will make use a subset of the SemEval-2019 Hyperpartisan news corpus hp-semeval in the English language. The SemEval-2019 corpus consists of political news organised in two datasets called by_articles and by_publisher, both of which annotated with ‘hyperpartisan’ or ‘neutral’ labels. Hyperpartisan news convey extreme one-sided information of either liberal or conservative nature alike. The full corpus data - whose train subset is presently used in our experiments - originally conveys 645 news articles in the by_articles set, and 750 million articles in the by_publisher set.
As discussed in the previous section, by_articles texts are labelled individually according to their contents, and will be taken as an input to text-level hyperpartisan news detection (task T1). by_publisher texts, by contrast, are weakly labelled according to their media source, and will be taken as the input for author-level hyperpartisan news detection (task T3). Despite using well-known shared task data, however, we notice that we do not presently seek to outperform the existing SemEval benchmark, but rather compare a number of novel computational strategies among themselves.
For the purpose of the present work, the original train portions of both datasets were randomly split into development (80%) and test (20%) sets. Table 3 presents the resulting class distribution.
|Development||332 (64.3%)||184 (35.7%)||516||1235 (49.4%)||1265 (50.6%)||2500|
|Test||75 (58.1%)||54 (41.9%)||129||492 (49.2%)||508 (50.8%)||1000|
4.1.2 BRmoral essay corpus (tasks T2 and T4)
For the political orientation detection tasks T2 and T4, we will make use of the BRmoral essay corpus brmoral in the Portuguese language. This consists of short essays about eight topics of liberal and conservative nature alike (same-sex marriage, gun possession, abortion, death penalty, drug legalisation, lowering of criminal age, racial quotas, and tax exemptions for churches) labelled with both stance scores (from ‘totally against’ to ‘totally for’ each topic) and authors’ demographics, including their self-reported political orientation from ‘extreme left’ to ‘extreme right’. The full corpus data conveys 4080 essays written by 510 crowd-sourced volunteers brmoral.
The dual labelling scheme in the BRmoral corpus (i.e., either based on individual stances or author’s own political orientation) gives rise to two dataset definitions, hereby called by_opinion and by_author for analogy with the SemEval by_articles and by_publisher datasets discussed in the previous section. Both by_opinion and by_author are labelled with ‘left’, ‘right’ and, depending on the task under consideration (see below), also with ‘centre’ information.
by_opinion takes as labels the liberal and conservative stance information available from the corpus to determine, albeit indirectly, a text’s likely political leaning. More specifically, texts expressing an opinion against so-called liberal topics (same-sex marriage, abortion, drug legalisation, and racial quotas), or those expressing opinions in favour of so-called conservative topics (death penalty, gun possession, lowering of criminal age, and tax exemptions for churches), are labelled as ‘right’, and so forth. This dataset will be taken as an input to text-level political orientation detection (task T2).
The by_author dataset, by contrast, takes as labels the actual authors’ political orientation information available from the corpus, in what may be seen as an instance of weakly labelling not unlike the SemEval by_publisher labels discussed in the previous section. BRmoral by_author texts will be taken as an input to author-level political orientation detection (task T4).
Both datasets were randomly split into development (80%) and test (20%) sets. Table 4 presents the resulting class distribution.
|Development||1201 (36.8%)||685 (21.0%)||1378 (42.2%)||3264||1210 (37.1%)||1158 (35.5%)||896 (27.4%)||3264|
|Test||299 (36.6%)||176 (21.6%)||341 (41.8%)||816||310 (38.0%)||282 (34.5%)||224 (27.5%)||816|
4.1.3 GovBR political stance corpus (task T5)
Finally, for the political stance task T5, we created a novel language resource based on Twitter data in the Portuguese language, hereby called the GovBR corpus. GovBR comprises a collection of tweets written by users who expressed a clear stance towards the current president of Brazil. The corpus was built by selecting two disjoint sets of users - supporters and opponents of the said president - according to the use of a number of popular politically-charged hashtags (e.g., #RespectThePresident or #NotHim. The full corpus data - from which non-political tweets were filtered out as discussed below - conveys 13.5 million tweets written by 5452 unique users.
For each selected user, all their publicly available tweets (i.e., disregarding their retweets) were downloaded. Users who simultaneously promoted supportive and opposing hashtags were discarded, and so were all hashtags and all messages shorter than five words. Finally, tweets that did not convey a minimal level of political content were also discarded. To this end, we computed a TF-IDF representation of the political section of the Folha de SP newspaper111https://www.kaggle.com/marlesson/news-of-the-site-folhauol, and kept only the tweets conveying a minimum degree of similarity to the political news texts. After the removal of non-political tweets, we obtained approximately 25 tweets per user.
GovBR politically-related tweets will be taken as the input for author-level political stance detection (task T5). In our current work, we use a balanced subset of this data consisting of 4000 randomly selected tweets. These were randomly split into development (80%) and test (20%) sets as illustrated in Table 5
|Development||1600 (49.9%)||1608 (50.1%)||3208|
|Test||405 (50.5%)||397 (49.5%)||802|
As a means to investigate the issue of political inference from text (tasks T1-T5 described in the previous sections), in what follows we propose combining heterogeneous knowledge representations into a convolutional neural network architecture. An overview of this architecture is presented in Section 4.2.1, and its individual components are described in Section 4.2.2.
We envisaged a convolutional neural network for political inference from text that combines three kinds of text representation: (i) pre-trained language models provided by Bidirectional Encoder Representations from Transformers (BERT) bert, hereby called bert; (ii) syntactic bigram counts computed from dependency graphs sngram, hereby called sngram, and (iii) psycholinguistics-motivated features obtained from Linguistic Inquiry and Word Count (LIWC) liwc and from the Medical Research Council (MRC) database mrc, hereby called psych. This architecture is illustrated in Figure 1 and further discussed below.
Given a set of input documents (bottom left of the figure), we use a text classifier model that takes as an input both standard text features represented as contextual embeddings bert (top left), and engineered features (bottom centre) that combine the alternative text representations based on syntactic dependencies sngram and psycholinguistics-motivated features psych. In this approach, bert
embeddings are taken as an input to five convolutional layers (Conv1-Conv5) followed by batch normalisation, a max pooling and a 0.5 dropout layers, whereas engineered features are processed in parallel by a kernel of size 1, also followed by a normalisation layer and max pooling. Finally, the output of the operations Conv1-Conv5 is concatenated with the engineered feature vector followed by a 0.5 dropout, and a softmax activation function produces the class predictions.
The full model architecture, hereby called bert+sngram+psych, will be evaluated against a number of baseline systems, and also against some of its individual components as discussed in Section 4.3.
4.2.2 Text representations
In the bert+sngram+psych model, input texts are to be represented as contextual embeddings (our bert component), syntactic bigram counts sngram, and psycholinguistics-motivated features (psych). These components are described as follows.
Bidirectional Encoder Representations from Transformers (BERT) bert are now mainstream in NLP and related fields, and are the basis for our bert architectural component as well. This consists of pre-trained BERT language models fine-tuned for each of our classification tasks T1-T5. For the English language models (tasks T1 and T3), we use base-uncased BERT, and for the Portuguese language (tasks T2, T4, and T5) we use multilingual-base-uncased BERT.
The sngram component explores the use of text structural information to enrich our classifier models by computing syntactic bigram counts from dependency graphs. To this end, we first compute a syntactic dependency graph from the input text using SpaCy222https://spacy.io/usage/linguistic-features#dependency-parse, and then generate a TF-IDF bigram model as suggested in sngram
. For the English language models (tasks T1 and T3), we use the en_core_web_sm pipeline, and for the Portuguese language (tasks T2, T4, and T5) we use the pt_core_news_sm pipeline. Finally, we perform univariate feature selection using F1 as a score function in order to keep only thebest features. Optimal values for each task were obtained through grid search on development data.
The psych component makes use of psycholinguistics-motivated features computed with the aid of both LIWC liwc and MRC mrc lexicons. Examples of LIWC word categories include those related to attention focus (e.g., pronouns and verb tense), affective or emotional processes (positive and negative emotions, anxiety, fear etc.), social relationships (e.g., family, friends etc.) and others. Similarly, MRC categories cover lexical features such as concreteness, age of acquisition, and others. For the English language models (tasks T1 and T3), we use the 93-feature LIWC-2015 lexicon liwc2015 and the 9-feature MRC database mrc. For the Portuguese language (tasks T2, T4, and T5) we use 64-feature LIWC-BR liwc-br and the 6 MRC-like features from PsychoProps.
Both LIWC and MRC text representations consist of word category counts normalised by document size, in which words that belongs to more than one category update all related counts (e.g., ‘she’ is a pronoun and also a feminine word etc.) Both representations are concatenated as a single vector of size features for English, or features for Portuguese. As in the case of the syntactic features discussed above, we once again perform univariate feature selection with F1 as a score function to obtain the best features for each task.
We conducted a series of experiments focused on tasks T1-T5 introduced in Section 3 to assess the use of the bert+sngram+psych model and some of its subcomponents, namely, bert, sngram, psych alone, and also the two BERT-based pairs bert+sngram and bert+psych. More specifically, our goal is to investigate how each of these alternatives compare to two baseline systems, namely, the Bertha von Suttner model described in bertha2019, which was the overall best-performing system at SemEval-2019 Hyperpartisan news detection shared task hp-semeval, and the use of BERT alone as a classifier, hereby called BERT.baseline.
All models were trained in 30 epochs using a development dataset partition, and then evaluated using a previously unseen test data as described in Section4.1
. For BERT-based models, additional pre-processing was performed to remove all non-alphabetic characters, links and HTML tags. All input documents were limited to their first 300 tokens, and shorter documents were completed with the [PAD] token. This representation was taken as the input to a BERT model of size 768, resulting in text embeddings of size.
sngram and p
sych features are concatenated as a single vector, and a z-score function is applied to obtain standardised value ranges. Table6 summarises the actual number of features considered by each models and for each corpus.
Evaluation proper was carried out by (i) measuring Accuracy (Acc), macro (), Precision (P) and Recall (R) scores, (ii) by assessing statistical significant differences between models, and (iii) by providing model prediction explanations. To this end, statistical significance is assessed by using the McNemar test mcnemar in the case of binary classifiers, and by using the Stuart-Maxwell test stuart1955test, maxwell1970comparing for ternary classifiers. For each task, two kinds of significance tests are conducted. First, we identify those models that are statistically superior to the reference model in bertha2019. Second, we identify the groups of models that are statistically distinguishable from others. Finally, we performed eli5 prediction explanation333https://eli5.readthedocs.io/en/latest/ to compute the word features more strongly correlated with each class and task, as discussed in Section 6.
This section presents results of the full model bert+sngram+psych and its subcomponents (bert, sngram, psych, bert+sngram, and bert+psych) compared against those obtained by the work in bertha2019 and by BERT alone as a classifier (bert.baseline).
Results are to be reported as Accuracy (Acc), , Precision (P) and Recall (R) scores divided into three groups: (1) on the top row baselines from bertha2019 and bert.baseline, (2) following by model components bert, sngram, psych, bert+sngram, and bert+psych, and (3) on the bottom row the full model bert+sngram+psych. In all scenarios, best accuracy scores are highlighted, and also marked as * when found to be statistically superior to the baseline system in bertha2019. Finally, in addition to the main results report for each task, models are also depicted in statistically significant clusters () according to their accuracy scores.
5.1 Task T1: Text-level hyperpartisan news detection
Results for task T1 - text-level hyperpartisan news detection - using the SemEval corpus by_articles dataset are summarised in Table 7 and further discussed below.
|Jiang et. al.||0.72||0.65||0.69||0.61|
Based on these results, we notice that the best-performing model is bert+sngram. The difference between this and the baseline in bertha2019 is statistically significant (, = 0.05, < 0.05). The full model bert+sngram+psych, by contrast, ranks considerably lower. To further illustrate this outcome, the models were clustered into homogeneous groups (A,B,C) by statistical significance according to their accuracy scores as illustrated in Table 8.
|Jiang et al.||0.72||B|
We notice that, in addition to the best-performing bert+sngram model, both bert and bert+psych obtain, to a lesser extent, statistically similar results within group A. The full model and the reference baseline in bertha2019, by contrast, are both members of group B.
5.2 Task T2: Text-level political orientation detection
Results for task T2 - text-level political orientation detection - using the BRmoral corpus by_opinion dataset for both binary (left, right) and ternary (left,centre,right) classification are summarised in Table 9 and further discussed below.
|Binary classification||Ternary classification|
|Jiang et. al.||0.76||0.76||0.77||0.76||0.62||0.61||0.61||0.62|
Once again, the best-performing model for both binary and ternary classification is bert+sngram. However, differences between this and others, including the baseline in bertha2019, were not found to be statistically significant. To further illustrate this outcome, homogeneous groups related to the binary classification task are shown in Table 10, and groups related to ternary classification are shown in Table 11.
|Jiang et. al.||0.76||A|
|Jiang et. al.||0.62||A|
In both binary and ternary classification tasks, although bert+sngram still obtains the highest accuracy scores, its simpler bert sub-component (i.e., the combination of a fine-tuned BERT model with a CNN classifier, as discussed in Section 4) is statistically similar. The full model, bert+sngram+psych, composes the group B in both classification tasks, ranking lower than the reference system bertha2019.
5.3 Task T3: Author-level hyperpartisan news detection
Results for task T3 - author-level hyperpartisan news detection - using the SemEval corpus by_publisher dataset are summarised in Table 12 and further discussed below.
|Jiang et. al.||0.56||0.62||0.55||0.71|
Based on these results, we notice that the best-performing alternative is bert+psych. The difference between this and the baseline in bertha2019 is statistically significant ( = 6.715, = 0.05, < 0.01). Models were clustered into homogeneous groups by statistical significance according to their accuracy scores as illustrated in Table 13.
|Jiang et al.||0.56||B|
According to Table 13, group A includes most alternatives that are based on BERT language models, and once again the difference between the full model bert+sngram+psych and the reference baseline in bertha2019, both of which in group B, is not statistically significant.
5.4 Task T4: Author-level political orientation detection
Results for task T4 - author-level political orientation detection - using the BRmoral corpus by_author dataset for both binary (left, right) and ternary (left,centre,right) classification are summarised in Table 14 and further discussed below.
|Binary classification||Ternary classification|
|Jiang et al.||0.61||0.61||0.61||0.61||0.38||0.37||0.37||0.38|
The best-performing model for binary classification is bert+sngram, and for ternary classification is bert+sngram+psych. However, the differences between these and the baseline in bertha2019 were not found to be significant. Homogeneous groups for the binary task are illustrated in Table 15, and groups for the ternary task are illustrated in Table 16.
|Jiang et al.||0.61||A|
|Jiang et al.||0.38||A|
In both binary and ternary classification tasks, we notice that several models turned out to obtain statistically equivalent results. This outcome, which is similar to what has been observed in the text-level political orientation task (cf. Section 5.2) based on the same corpus, will be further discussed in Section 6.
5.5 Task T5: Author-level political stance detection
Results for task T5 - author-level political stance detection - using the GovBR corpus are summarised in Table 17 and further discussed below.
|Jiang et. al.||0.58||0.55||0.62||0.58|
Once again, the best-performing model is bert+sngram, but others, including the baseline in bertha2019, were found to be similar. The corresponding homogeneous groups are illustrated in Table 18.
|Jiang et. al.||0.58||A|
As in some of the previous experiments, although bert+sngram still obtains the highest accuracy among the alternatives, some of its simpler sub-components (as bert, in this case) were found to be statistically similar. Moreover, the full model bert+sngram+psych is once again statistically similar to the reference baseline in bertha2019.
In what follows we summarise our main results (Section 6.1) and report prediction explanations for the main classification tasks.
6.1 Results summary
Table 19 shows the tasks in which each of the models under evaluation ranks among the top-performing alternatives (i.e., belonging to cluster A in each of the homogeneous groups illustrated in the previous section.)
|Text-level tasks||Author-level tasks|
|Jiang et al.||5||A||A||A||A||A|
From these results, a number of observations are warranted. First, we notice that there is a large number of statistically-similar models in the author-level tasks (on the right of the table), particularly in the case of the essay (tasks T4) and Twitter (task T5) domains. This was to some extent to be expected as the information to be learned from author-level inference is less explicit in the input texts, making these tasks possibly more challenging than text-level inference in general.
Second, we notice that the full model, comprising the main CNN architecture and bert, sngram and psycho sub-components, is generally not the best choice, being often outperformed by simpler alternatives and/or by the reference baseline system in bertha2019.
Regarding the role of individual sub-components of the main architecture, we notice that using psych or sngram alone is clearly insufficient, and that even the use of the bert component alone generally fails to deliver optimal results. This outcome suggests that using fine-tuned BERT and the CNN architecture alone, as in the present bert model (not to be mistaken by the standard BERT.baseline model in the second row from the top, which does not use a CNN classifier) explains much of the best results obtained across our experiments, but not all of them. In fact, it is the combination of BERT and sngrams in the CNN architecture (represented by the bert+sngram model) that generally obtains the best results among these alternatives, being the top-performing model in 5 out of 7 tasks. This outcome may be partially explained by the simultaneous use of two representations of the input text (i.e., linearly ordered tokens and count-based syntactic bigram features), but we notice that larger models using this strategy (e.g., including psycholinguistics-motivated features) are not necessarily better.
Finally, we notice that the baseline in bertha2019 remains highly competitive and, although seldom obtaining the highest accuracy, it is often found within the group of top-performing alternatives.
6.2 Feature importance
As a means to illustrate the word features more strongly correlated with each class and task, we performed eli5 model explanation to obtain word weights representing the change (decrease/increase) of the evaluation score when a given feature is shuffled.
Table 20 presents the most important features associated with the hyperpartisan (i.e., non-neutral) class in the SemEval tasks T1 (by_articles) and T3 (by_publisher) datasets.
|T1: Text-level||T3: Author-level|
Table 21 presents the most important features associated with the left/right classes in the BRmoral corpus tasks T2 and T4 (by_opinion and by_author datasets, respectively.)
|T1: Text-level||T3: Author-level|
|left orientation||right orientation||left orientation||right orientation|
Finally, Table 22 presents the most important features associated with political stance in GovBR task T5.
7 Final remarks
This paper has addressed the issue of how to combine transformed-based text representations, which are main stream in NLP and related fields, with both syntactic dependency information and psycholinguistics-motivated features for political inference from text data. In doing so, we considered both text- and author-level task definitions in both English and Portuguese languages, and introduced a novel dataset devoted the latter.
As expected in experiments involving a range of tasks, datasets and languages, our present results vary considerably across evaluation settings and, although BERT remains a robust baseline for many of these tasks, in most cases it was possible to obtain significant improvements by making use of additional representations. This is not to say, however, that combining all current three text representations (BERT, syntax and psycholinguistics) into a single model is the best option. In particular, it was a subset of our original CNN architecture combining only BERT and the syntactic dependency model that obtained overall best results in most tasks.
The current work leaves a number of opportunities for further research. First, we notice that political bias and ideology are relatively broad terms that may actually include a wide range of distinct politically-related phenomena, and that future NLP studies may benefit from more fine grained task definitions.
We notice also that many other pre-trained language models have been made available in recent years, including ELMo elmo, XLNet xlnet, RoBERTa roberta, GPT-3 gpt3. Whether any of these may outperform BERT when combined with other text representations (as in the present work) remains an open research question.
Finally, the present use of text representations is only a first step towards more informed models that may ultimately combine BERT with many other text- and author-level features. In particular, the present architecture may be expanded with, for instance, sentiment or emotion-related information, or even with author demographics (e.g., gender, age, personality traits, moral foundations etc.) obtained with the aid of author profiling classifiers ca-fineg. These and many other related possibilities are also left as future work.
Samuel Caetano da Silva: Conceptualisation, Methodology, Software, Validation, Formal analysis, Investigation, Writing - original draft, reviewing and editing. Ivandré Paraboni: Conceptualisation, Writing - review and editing, supervision.
The first author received support from the Brazilian Foundation CAPES - Coordination for the Improvement of Higher Education Personnel, under grant 88882.378103/2019-01.
Conflict of interest
The authors declare no potential conflict of interests.
Samuel Caetano da Silva. Graduate student of Information Systems at the University of São Paulo.
Ivandré Paraboni. PhD in Computer Science (Univ. of Brighton, UK, 2003), and associate professor at the University of São Paulo.