A Multilingual Multi-Target Dataset for Stance Detection
We extract a large-scale stance detection dataset from comments written by candidates of elections in Switzerland. The dataset consists of German, French and Italian text, allowing for a cross-lingual evaluation of stance detection. It contains 67 000 comments on more than 150 political issues (targets). Unlike stance detection models that have specific target issues, we use the dataset to train a single model on all the issues. To make learning across targets possible, we prepend to each instance a natural question that represents the target (e.g. "Do you support X?"). Baseline results from multilingual BERT show that zero-shot cross-lingual and cross-target transfer of stance detection is moderately successful with this approach.READ FULL TEXT VIEW PDF
A Multilingual Multi-Target Dataset for Stance Detection
In recent years many datasets have been created for the task of automated stance detection, advancing natural language understanding systems for political science, opinion research and other application areas. Typically, such benchmarks Mohammad et al. (2016) are composed of short pieces of text commenting on politicians or public issues and are manually annotated with their stance towards a target entity (e.g. Climate Change, or Trump). However, they are limited in scope on multiple levels Küçük and Can (2020).
First of all, it is questionable how well current stance detection methods perform in a cross-lingual setting, as the multilingual datasets available today are relatively small, and specific to a single target Taulé et al. (2017, 2018). Furthermore, specific models tend to be developed for each single target or pair of targets Sobhani et al. (2017). Concerns have been raised that cross-target performance is often considerably lower than fully supervised performance Küçük and Can (2020).
In this paper we propose a much larger dataset that combines multilinguality and a multitude of topics and targets. x-stance comprises more than 150 questions concerning Swiss politics and more than 67k answers given in the last decade by candidates running for political office in Switzerland.
Questions are available in four languages: English, Swiss Standard German, French, and Italian. The language of a comment depends on the candidate’s region of origin.
We have extracted the data from the voting advice application Smartvote111https://smartvote.ch. On that platform, candidates respond to questions mainly in categorical form (yes / rather yes / rather no / no). They can also submit a free-text comment in order to justify, explain or differentiate their categorical answer. An example is given in Figure 1.
We transform the dataset into a stance detection task by interpreting the question as a natural-language representation of the target, and the commentary as the input to be classified.
The dataset is split into a multilingual training set and into multiple test sets to evaluate zero-shot cross-lingual and cross-target transfer. To provide a baseline, we fine-tune a multilingual Bert model Devlin et al. (2019) on x-stance. We show that the baseline accuracy is comparable to previous stance detection benchmarks while leaving ample room for improvement. In addition, multilingual Bert can generalize to a degree both cross-lingually and in a cross-target setting.
We have made the dataset and the code for reproducing the baseline model publicly available.222https://github.com/ZurichNLP/xstance
In the context of the IberEval shared tasks, two related multilingual datasets have been created Taulé et al. (2017, 2018). Both are a collection of annotated Spanish and Catalan tweets. Crucially, the tweets in both languages focus on the same issue (Catalan independence); given this fact they are the first truly multilingual stance detection datasets known to us.
With regard to the languages covered by x-stance, only monolingual datasets seem to be available. For French, a collection of tweets on French presidential candidates has been annotated with stance Lai (2019). Similarly, two datasets of Italian tweets on the occasion of the 2016 constitutional referendum have been created Lai et al. (2018); Lai (2019). For German, there is no known stance detection dataset, but an aspect-based sentiment dataset has been created for a GermEval 2017 shared task Wojatzki et al. (2017).
The SemEval-2016 task on detecting stance in tweets Mohammad et al. (2016) offers data concerning multiple targets (Atheism, Climate Change, Feminism, Hillary Clinton, and Abortion). In the supervised subtask A, participants tended to develop a target-specific model for each of those targets. In subtask B cross-target transfer to the target “Donald Trump” was tested, for which no annotated training data were provided. While this required the development of more universal models, their performance was generally much lower.
Sobhani et al. (2017) introduced a multi-target stance dataset which provides two targets per instance. For example, a model designed in this framework is supposed to simultaneously classify a tweet with regard to Clinton and with regard to Trump. While in theory the framework allows for more than two targets, it is still restricted to a finite and clearly defined set of targets. It focuses on modeling the dependencies of multiple targets within the same text sample, while our approach focuses on learning stance detection from many samples with many different targets.
In a target-specific setting, Ghosh et al. (2019) perform a systematic evaluation of stance detection approaches. They also evaluate Bert Devlin et al. (2019) and find that it consistently outperforms previous approaches.
However, they only experimented with a single-segment encoding of the input, preventing cross-target transfer of the model. Augenstein et al. (2016) propose a conditional encoding approach to encode both the target and the tweet as sequences. They use a bidirectional LSTM to condition the encoding of the tweets on the encoding of the target, and then apply a nonlinear projection on the conditionally encoded tweet. This allows them to train a model that can generalize to previously unseen targets.
|Infrastructure & Environment||31||9590|
|Total (training topics)||174||59 915|
|Total (held-out topics)||20||7356|
The input provided by x-stance is two-fold: (A) a natural language question concerning a political issue; (B) a natural language commentary on a specific stance towards the question.
The label to be predicted is either ‘favor’ or ‘against‘. This corresponds to a standard established by Mohammad et al. (2016). However, x-stance differs from that dataset in that it lacks a ‘neither’ class; all comments refer to either a ‘favor’ or an ‘against‘ position. The task posed by x-stance is thus a binary classification task.
We downloaded questions and answers via the Smartvote API. The downloaded data cover 175 communal, cantonal and national elections between 2011 and 2020.
All candidates in an election who participate in Smartvote are asked the same set of questions, but depending on the locale they see translated versions of the questions. They can answer each question with either ‘yes’, ‘rather yes’, ‘rather no’, or ‘no’. They can supplement each answer with a comment of at most 500 characters.
The questions asked on Smartvote have been edited by a team of political scientists. They are intended to cover a broad range of political issues relevant at the time of the election. A detailed documentation of the design of Smartvote and the editing process of the questions is provided by Thurman and Gasser (2009).
We merged the two labels on each pole into a single label: ‘yes’ and ‘rather yes’ were combined into ‘favor’; ‘rather no’, or ‘no’ into ‘against‘. This improves the consistency of the data and the comparability to previous stance detection datasets.
We did not further preprocess the text of the comments.
As the API does not provide the language of comments, we employed a language identifier to automatically annotate this information. We used the langdetect library Shuyo (2010). For each responder we classified all the comments jointly, assuming that responders did not switch code during the answering of the questionnaire.
We applied the identifier in a two-step approach. In the first run we allowed the identifier to output all 55 languages that it supports out of the box, plus Romansh, the fourth official language in Switzerland333Namely the Rumantsch Grischun variety; the language profile was created using resources from the Zurich Parallel Corpus Collection Graën et al. (2019) and the Quotidiana corpus (https://github.com/ProSvizraRumantscha/corpora).. We found that no Romansh comments were detected and that all unexpected outputs were misclassifications of German, French or Italian comments. We further concluded that little or no Swiss German comments are in the dataset: If they were, some of them would have manifested themselves in the form of misclassifications (e.g. as Dutch).
In the second run, drawing from these conclusions, we restricted the identifier’s output to English, French, German and Italian.
We pre-filtered the questions and answers to improve the quality of the dataset. To keep the domain of the data surveyable, we set a focus on national-level questions. Therefore, all questions and corresponding answers pertaining to national elections were included.
In the context of communal and cantonal elections, candidates have answered both local questions and a subset of the national questions. Of those elections, we only considered answers to the questions that also had been asked in a national election. Furthermore, they were only used to augment the training set while the validation and test sets were restricted to answers from national elections.
We discarded the less than 20 comments classified as English. Furthermore, instances that met any of the following conditions were filtered from the dataset:
Question is not a closed question or does not address a clearly defined political issue.
No comment was submitted by the candidate or the comment is shorter than 50 characters.
Comment starts with “but” or a similar indicator that the comment is not a self-contained statement.
Comment contains a URL.
In total, a fifth of the original comments were filtered out.
The questions have been organized by the Smartvote editors into categories (such as “Economy”). We further consolidated the pre-defined categories into 12 broad topics (Table 1).
The dataset is shared under a CC BY-NC 4.0 license. Copyright remains with www.smartvote.ch.
We held out the topics “Healthcare” and “Political System” from the training data and created a separate cross-topic test set that contains the questions and answers related to those topics.
Furthermore, in order to test cross-question generalization performance within previously seen topics, we manually selected 16 held-out questions that are distributed over the remaining 10 topics. We selected the held-out questions manually because we wanted to make sure that they are truly unseen and that no paraphrases of the questions are found in the training set.
We designated Italian as a test-only language, since relatively few comments have been written in Italian. From the remaining German and French data we randomly selected a percentage of respondents as validation or as test respondents.
As a result we received one training set, one validation set and four test sets. The sizes of the sets are listed in Table 2. We did not consider test sets that are cross-lingual and cross-target at the same time, as they would have been too small to yield significant results.
Some observations regarding the composition of x-stance can be made.
Figure 2 visualizes the proportion of ‘favor’ and ‘against‘ stances for each target in the dataset. The ratio differs between questions but is relatively equally distributed across the topics. In particular, the questions belonging to the held-out topics (with a ‘favor’ ratio of 49.4%) have a similar class distribution as the questions within other topics (with a ‘favor’ ratio of 50.0%).
Not every question is unique; some questions are paraphrases describing the same political issue. For example, in the 2015 election, the candidates were asked: “Should the consumption of cannabis as well as its possession for personal use be legalised?” Four years later they were asked: “Should cannabis use be legalized?” However, we do not see any need to consolidate those duplicates because they contribute to the diversity of the training data.
We further observe that while some questions in the dataset are quite short, some questions are rather convoluted. For example, a typical long question reads:
Some 1% of direct payments to Swiss agriculture currently go to organic farming operations. Should this proportion be increased at the expense of standard farming operations as part of Switzerland’s 2014-2017 agricultural policy?
Such longer questions might be more challenging to process semantically.
The x-stance dataset has more German samples than French samples. The language ratio of about 3:1 is consistent across all training and test sets. Given the two languages it is possible to either train two monolingual models or to train a single model in a multi-source setup McDonald et al. (2011). We choose a multi-source baseline because M-Bert is known to benefit from multilingual training data both in a supervised and in a cross-lingual scenario Kondratyuk and Straka (2019).
We evaluate two types of baselines to obtain an impression of the difficulty of the task.
The first pair of baselines uses the most frequent class in the training set for prediction. Specifically, the global majority class baseline predicts the most frequent class across all training targets while the target-wise majority class baseline predicts the class that is most frequent for a given target question. The latter can only be applied to the supervised test set.
Secondly, we fine-tune multilingual Bert (M-Bert) on the task Devlin et al. (2019) which has been pretrained jointly in 104 languages444https://github.com/google-research/bert/blob/master/multilingual.md and has established itself as a state of the art for various multilingual tasks Wu and Dredze (2019); Pires et al. (2019). Within the field of stance detection, Bert can outperform both feature-based and other neural approaches in a monolingual English setting Ghosh et al. (2019).
In the context of Bert we interpret the x-stance task as sequence pair classification inspired by natural language inference tasks Bowman et al. (2015). We follow the procedure outlined by Devlin et al. (2019) for such tasks. We designate the question as segment A and the comment as segment B. The two segments are separated with the special token [SEP], and the special token [CLS] is prepended to the sequence. The final hidden state corresponding to [CLS] is then classified by a linear layer.
We fine-tune the full model with a cross-entropy loss, using the AllenNLP library Gardner et al. (2018) as a basis for our implementation.
We upsampled the ‘favor’ class so that the two classes are balanced when summing over all questions and topics. A maximum sequence length of 512 subwords and a batch size of 16 was chosen for all training runs. We then performed a grid search within the following range of hyperparameters based on the validation accuracy:
Learning rate: 5e-5, 3e-5, 2e-5
Number of epochs: 3, 4
The grid search was repeated independently for every variant that we tested. Furthermore, the standard recommendations for fine-tuning Bert were used: Adam with and ; an L2 weight decay of
; a learning rate warmup over the first 10% of the steps; and a linear decay of the learning rate. A dropout probability of 0.1 was set on all layers.
|Majority class (global)||33.1||34.8 px||34.4|
|Majority class (target-wise)||60.8||65.1 px||62.9|
|de||fr||Mean px||de||fr||Mean px||de||fr||Mean|
|Majority class (global)||33.1||34.8||33.9 px||36.4||37.9||37.1 px||32.1||33.8||32.9|
|Majority class (target-wise)||60.8||65.1||62.9 px||-||-||- px||-||-||-|
|M-Bert||76.8||76.6||76.6 px||68.5||68.4||68.4 px||68.9||70.9||69.9|
setting. For each test set we separately report a German and a French score, as well as their harmonic mean.
Table 3 shows the results for the cross-lingual setting. M-Bert performs consistently better than the majority class baselines. Even the zero-short performance in Italian, while significantly lower than the supervised scores, is much better than the target-wise majority class baseline.
Results for the cross-target setting are given in Table 4. Similar to the cross-lingual setting, M-Bert performs worse in a cross-target setting but easily surpasses the majority class baselines. Furthermore, the cross-question score of M-Bert is slightly lower than the cross-topic score.
The default setup preserves horizontal language consistency in that the language of the questions always matches the language of the comments. For example, the Italian test instances are combined with the Italian version of the questions, even though during training the model has only ever seen the German and French versions of the questions.
An alternative concept is vertical language consistency, whereby the questions are consistently presented in one language, regardless of the comment. To test whether horizontal or vertical consistency is more helpful, we train and evaluate M-Bert on a dataset variant where all questions are in their English version. We chose English as a lingua franca because it had the largest share of data during the pretraining of M-Bert.
The results are shown in Table 5. While the effect is negligible in most settings, the cross-lingual performance clearly increases when all questions are given in English.
|— with English questions||76.1||71.7||68.5||69.4|
|— with missing questions||73.2||67.1||67.8||69.3|
|— with missing comments||64.2||60.5||51.1||48.6|
|— with random questions||56.0||52.5||47.7||48.5|
|— with random comments||50.7||50.7||48.2||48.7|
|— with target embeddings||70.1||66.0||68.4||69.0|
In order to rule out that only the questions or only the comments are necessary to optimally solve the task, we conduct some additional experiments:
Only use a single segment containing the comment, removing the questions from the training and test data (missing questions).
Only use the question and remove the comment (missing comments).
In both cases the performance decreases across all evaluation settings (Table 5). The loss in performance is much higher when comments are missing, indicating that the comments contain the most important information about stance. As can be expected, the score achieved without comments is only slightly different from the target-wise majority class baseline.
But there is also a loss in performance when the questions are missing, which underlines the importance of pairing both pieces of text. The effect of missing questions is especially strong in the supervised and cross-lingual settings. To illustrate this, we provide in Table A8 some examples of comments that occur with multiple different targets in the training set. Those examples can explain why the target can be essential for disambiguating a stance detection problem. On the other hand, the effect of omitting the questions is less pronounced in the cross-target settings.
The above single-segment experiments tell us that both the comment and the question provide crucial information. But it is possible that the M-Bert model, even though trained on both segments, mainly looks at a single segment at test time. To rule this out, we probe the model with randomized data at test time:
Test the model on versions of the test sets where the comments remain in place but the questions are shuffled randomly (random questions). We make sure that the random questions come from the same test set and language as the original questions.
Keep the questions in place and randomize the comments (random comments). Again we shuffle the comments only within test set boundaries.
The results in Table 5 show that the performance of the model decreases in both cases, confirming that it learns to take into account both segments.
Finally we test whether the target really needs to be represented by natural language (e.g. “Do you support X?”). Namely, an alternative is to represent the target with a trainable embedding instead of a question.
In order to fit target embeddings smoothly into our architecture, we represent each target type with a different reserved symbol from the M-Bert vocabulary. Segment A is then set to this symbol instead of a natural language question.
The results for this experiment are listed in the bottom row of Table 5. An M-Bert model that learns target embeddings instead of encoding a question performs clearly worse in the supervised and cross-lingual settings. From this we conclude that spelled-out natural language questions provide important linguistic detail that can help in stance detection.
The baseline experiments confirm that M-Bert can achieve a reasonable accuracy on x-stance.
|SemEval-2016||Ghosh et al. (2019)||75.1|
|MPCHI||Ghosh et al. (2019)||75.6|
To put the supervised score into context we list scores that variants of Bert have achieved on other stance detection datasets in Table 6. It seems that the supervised part of x-stance has a similar difficulty as the SemEval-2016 Mohammad et al. (2016) or MPCHI Sen et al. (2018) datasets on which Bert has previously been evaluated.
On the other hand, in the cross-lingual and cross-target settings, the mean score drops by 6–8 percentage points compared to the supervised setting; while zero-shot transfer is possible to a degree, it can still be improved.
The additional experiments (Table 5) validate the results and show that the sequence-pair classification approach to stance detection is justified.
It is interesting to see what errors the M-Bert model makes. Table A7 presents instances where it predicts the wrong label with a high confidence. These examples indicate that many comments express their stance only on a very implicit level, and thus hint at a potential weakness of the dataset. Because on the voting advice platform the label is explicitly shown to readers in addition to the comments, the comments do not need to express the stance explicitly. Manual annotation could eliminate very implicit samples in a future version of the dataset. However, the sheer size and breadth of the dataset could not realistically be achieved with manual annotation, and, in our view, largely compensates for the implicitness of the texts.
We have presented a new dataset for political stance detection called x-stance. The dataset extends over a broad range of topics and issues regarding national Swiss politics. This diversity of topics opens up an opportunity to further study multi-target learning. Moreover, being partly Swiss Standard German, partly French and Italian, the dataset promotes a multilingual approach to stance detection.
By compiling formal commentary that politicians have written on political questions, we add a new text genre to the field of stance detection. We also propose a question–answer format that allows us to condition stance detection models on a target naturally.
Our baseline results with multilingual Bert show that the model has some capability to perform zero-shot transfer to unseen languages and to unseen targets (both within a topic and to unseen topics). However, there is some gap in performance that future work could address. We expect that the x-stance dataset could furthermore be a valuable resource for fields such as argument mining, argument search or topic classification.
This work was funded by the Swiss National Science Foundation (project MUTAMUR; no. 176727). We would like to thank Isabelle Augenstein for helpful feedback.
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 876–885. External Links: Cited by: §2.
Proceedings of Workshop for NLP Open Source Software (NLP-OSS), Melbourne, Australia, pp. 1–6. External Links: Cited by: §4.2.
Proceedings of the ACM India Joint International Conference on Data Science and Management of Data, pp. 273–281. Cited by: §5.
|Befürworten Sie eine vollständige Liberalisierung der Geschäftsöffnungszeiten?||Ausser Sonntag. Dies sollte ein Ruhetag bleiben können.||FAVOR||0.001|
|[Are you in favour of a complete liberalisation of business hours for shops?]||[Except Sunday. That should remain a day of rest.]|
|Soll die Schweiz innerhalb der nächsten vier Jahre EU-Beitrittsverhandlungen aufnehmen?||In den nächsten vier Jahren ist dies wohl unrealistisch.||FAVOR||0.005|
|[Should Switzerland embark on negotiations in the next four years to join the EU?]||[For the next four years this is probably unrealistic.]|
|Befürworten Sie einen Ausbau des Landschaftsschutzes?||Wenn es darum geht erneuerbare Energien zu fördern, ist sogar eine Lockerung angebracht.||AGAINST||0.006|
|[Are you in favour of extending landscape protection?]||[When it comes to promoting renewable energy, even a relaxation is appropriate.]|
|La Suisse devrait-elle engager des négociations pour un accord de libre échange avec les Etats-Unis?||Il faut cependant en parallèle veiller à ce que la Suisse ne soit pas mise de côté par les Etats-Unis !||AGAINST||0.010|
|[Should Switzerland start negotiations with the USA on a free trade agreement?]||[At the same time it must be ensured that Switzerland is not sidelined by the United States!]|
|Comment …||is favorable towards target …||but against target …|
|Ich will offene Grenzen für Waren und selbstverantwortliche mündige Bürger. Der Staat hat kein Recht, uns einzuschränken.||Soll die Schweiz mit den USA Verhandlungen über ein Freihandelsabkommen aufnehmen?||Soll die Schweiz das Schengen-Abkommen mit der EU kündigen und wieder verstärkte Personenkontrollen direkt an der Grenze einführen?|
|[I want open borders for goods and responsible citizens. The state has no right to restrict us.]||[Should Switzerland start negotiations with the USA on a free trade agreement?]||[Should Switzerland terminate the Schengen Agreement with the EU and reintroduce increased identity checks directly on the border?]|
|Hier gilt der Grundsatz der Eigenverantwortung und Selbstbestimmung des Unternehmens!||Sind Sie für eine vollständige Liberalisierung der Ladenöffnungszeiten?||Würden Sie die Einführung einer Frauenquote in Verwaltungsräten börsenkotierter Unternehmen befürworten?|
|[The principle of personal responsibility and corporate self-regulation applies here!]||[Are you in favour of the complete liberalization of shop opening times?]||[Would you support the introduction of a woman’s quota for the Boards of Directors of listed companies?]|
In order to study the automatic detection of stances on political issues, questions and candidate responses on the voting advice application smartvote.ch were downloaded. Mainly data pertaining to national-level issues were included to reduce variability.
The training set consists of questions and answers in Swiss Standard German and Swiss French (74.1% de-CH; 25.9% fr-CH). The test sets also contain questions and answers in Swiss Italian (67.1% de-CH; 24.7% fr-CH; 8.2% it-CH). The questions have also been translated into English.
Candidates for communal, cantonal or national elections in Switzerland who have filled out an online questionnaire.
Age: 18 or older – mixed.
Gender: Unknown – mixed.
Race/ethnicity: Unknown – mixed.
Native language: Unknown – mixed.
Socioeconomic status: Unknown – mixed.
Different speakers represented: 7581.
Presence of disordered speech: Unknown.
The questions were edited and translated by political scientists for a public voting advice website.
The answers were written between 2011 and 2020 by the users of the website.
Questions, answers, arguments and comments regarding political issues.