X-Stance: A Multilingual Multi-Target Dataset for Stance Detection

by   Jannis Vamvas, et al.
Universität Zürich

We extract a large-scale stance detection dataset from comments written by candidates of elections in Switzerland. The dataset consists of German, French and Italian text, allowing for a cross-lingual evaluation of stance detection. It contains 67 000 comments on more than 150 political issues (targets). Unlike stance detection models that have specific target issues, we use the dataset to train a single model on all the issues. To make learning across targets possible, we prepend to each instance a natural question that represents the target (e.g. "Do you support X?"). Baseline results from multilingual BERT show that zero-shot cross-lingual and cross-target transfer of stance detection is moderately successful with this approach.



There are no comments yet.


page 1

page 2

page 3

page 4


Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training

In recent years, pre-trained multilingual language models, such as multi...

Rumour Detection via Zero-shot Cross-lingual Transfer Learning

Most rumour detection models for social media are designed for one speci...

Zero-Shot Cross-Lingual Transfer in Legal Domain Using Transformer Models

Zero-shot cross-lingual transfer is an important feature in modern NLP m...

MLSUM: The Multilingual Summarization Corpus

We present MLSUM, the first large-scale MultiLingual SUMmarization datas...

Infusing Knowledge from Wikipedia to Enhance Stance Detection

Stance detection infers a text author's attitude towards a target. This ...

Multilingual Stance Detection: The Catalonia Independence Corpus

Stance detection aims to determine the attitude of a given text with res...

Code Repositories


A Multilingual Multi-Target Dataset for Stance Detection

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Question #3414 – Available in all languages
Soll der Bundesrat ein Frei- handelsabkommen mit den USA anstreben? La Suisse devrait-elle conclure un accord de libre-échange avec les Etats-Unis?
Comment #26597 (German)
Label: FAVOR
Should Switzerland strive for a free trade agreement with the USA? Mit unserem zweitwichtigsten Handels- partner sollten wir ein Freihandels- abkommen haben.
Comment #21421 (French)
Les accords de libre-échange menacent la qualité des produits suisses. [With our second most important trading partner we should have a free trade agreement.] [The free trade agreements jeopardize the quality of the Swiss products.]
Figure 1: Example of a question and two answers in the x-stance dataset. The answers were submitted by electoral candidates on a voting advice website. The author of the comment in German declared to be in favor of the issue and added the comment as an explanation. The comment in French has been written by another candidate to explain why a negative stance has been taken towards the issue. While x-stance contains hundreds of answers to this question, we have picked the two comments as examples due to their brevity.

In recent years many datasets have been created for the task of automated stance detection, advancing natural language understanding systems for political science, opinion research and other application areas. Typically, such benchmarks Mohammad et al. (2016) are composed of short pieces of text commenting on politicians or public issues and are manually annotated with their stance towards a target entity (e.g. Climate Change, or Trump). However, they are limited in scope on multiple levels Küçük and Can (2020).

First of all, it is questionable how well current stance detection methods perform in a cross-lingual setting, as the multilingual datasets available today are relatively small, and specific to a single target Taulé et al. (2017, 2018). Furthermore, specific models tend to be developed for each single target or pair of targets Sobhani et al. (2017). Concerns have been raised that cross-target performance is often considerably lower than fully supervised performance Küçük and Can (2020).

In this paper we propose a much larger dataset that combines multilinguality and a multitude of topics and targets. x-stance comprises more than 150 questions concerning Swiss politics and more than 67k answers given in the last decade by candidates running for political office in Switzerland.

Questions are available in four languages: English, Swiss Standard German, French, and Italian. The language of a comment depends on the candidate’s region of origin.

We have extracted the data from the voting advice application Smartvote111https://smartvote.ch. On that platform, candidates respond to questions mainly in categorical form (yes / rather yes / rather no / no). They can also submit a free-text comment in order to justify, explain or differentiate their categorical answer. An example is given in Figure 1.

We transform the dataset into a stance detection task by interpreting the question as a natural-language representation of the target, and the commentary as the input to be classified.

The dataset is split into a multilingual training set and into multiple test sets to evaluate zero-shot cross-lingual and cross-target transfer. To provide a baseline, we fine-tune a multilingual Bert model Devlin et al. (2019) on x-stance. We show that the baseline accuracy is comparable to previous stance detection benchmarks while leaving ample room for improvement. In addition, multilingual Bert can generalize to a degree both cross-lingually and in a cross-target setting.

We have made the dataset and the code for reproducing the baseline model publicly available.222https://github.com/ZurichNLP/xstance

2 Related Work

Multilingual Stance Detection

In the context of the IberEval shared tasks, two related multilingual datasets have been created  Taulé et al. (2017, 2018). Both are a collection of annotated Spanish and Catalan tweets. Crucially, the tweets in both languages focus on the same issue (Catalan independence); given this fact they are the first truly multilingual stance detection datasets known to us.

With regard to the languages covered by x-stance, only monolingual datasets seem to be available. For French, a collection of tweets on French presidential candidates has been annotated with stance Lai (2019). Similarly, two datasets of Italian tweets on the occasion of the 2016 constitutional referendum have been created Lai et al. (2018); Lai (2019). For German, there is no known stance detection dataset, but an aspect-based sentiment dataset has been created for a GermEval 2017 shared task Wojatzki et al. (2017).

Multi-Target Stance Detection

The SemEval-2016 task on detecting stance in tweets Mohammad et al. (2016) offers data concerning multiple targets (Atheism, Climate Change, Feminism, Hillary Clinton, and Abortion). In the supervised subtask A, participants tended to develop a target-specific model for each of those targets. In subtask B cross-target transfer to the target “Donald Trump” was tested, for which no annotated training data were provided. While this required the development of more universal models, their performance was generally much lower.

Sobhani et al. (2017) introduced a multi-target stance dataset which provides two targets per instance. For example, a model designed in this framework is supposed to simultaneously classify a tweet with regard to Clinton and with regard to Trump. While in theory the framework allows for more than two targets, it is still restricted to a finite and clearly defined set of targets. It focuses on modeling the dependencies of multiple targets within the same text sample, while our approach focuses on learning stance detection from many samples with many different targets.

Representation Learning for Stance Detection

In a target-specific setting, Ghosh et al. (2019) perform a systematic evaluation of stance detection approaches. They also evaluate Bert Devlin et al. (2019) and find that it consistently outperforms previous approaches.

However, they only experimented with a single-segment encoding of the input, preventing cross-target transfer of the model. Augenstein et al. (2016) propose a conditional encoding approach to encode both the target and the tweet as sequences. They use a bidirectional LSTM to condition the encoding of the tweets on the encoding of the target, and then apply a nonlinear projection on the conditionally encoded tweet. This allows them to train a model that can generalize to previously unseen targets.

3 The x-stance Dataset

Topic Questions Answers
Digitisation 2 1168
Economy 23 6899
Education 16 7639
Finances 15 3980
Foreign Policy 16 4393
Immigration 19 6270
Infrastructure & Environment 31 9590
Security 20 5193
Society 17 6275
Welfare 15 8508
Total (training topics) 174 59 915
Healthcare 11 4711
Political System 9 2645
Total (held-out topics) 20 7356
Table 1: Number of questions and answers per topic.

px Intra-target (New answers to known questions) Cross-question (New questions within known topics) Cross-topic () () de Train: Test: Valid: 33 850 2871 3479 Test: 3143 Test: 5269 fr Train: Test: Valid: 11 790 1055 1284 Test: 1170 Test: 1914 it Test: 1173 Test: (110) Test: (173)

Table 2: Number of answer instances in the training, validation and test sets. The upper left corner represents a multilingually supervised task, where training, validation and test data are from exactly the same domain. The top-to-bottom axis gives rise to a cross-lingual transfer task, where a model trained on German and French is evaluated on Italian answers to the same questions. The left-to-right axis represents a continuous shift of domain: In the middle column, the model is tested on previously unseen questions that belong to the same topics as seen during training. In the right column the model encounters unseen answers to unseen questions within an unseen topic. The two test sets in parentheses are too small for a significant evaluation.

3.1 Task Definition

The input provided by x-stance is two-fold: (A) a natural language question concerning a political issue; (B) a natural language commentary on a specific stance towards the question.

The label to be predicted is either ‘favor’ or ‘against‘. This corresponds to a standard established by Mohammad et al. (2016). However, x-stance differs from that dataset in that it lacks a ‘neither’ class; all comments refer to either a ‘favor’ or an ‘against‘ position. The task posed by x-stance is thus a binary classification task.

As an evaluation metric we report the macro-average of the F1-score for ‘favor’ and the F1-score for ‘against’, similar to  

Mohammad et al. (2016). We use this metric mainly to strengthen comparability with the previous benchmarks.

3.2 Data Collection


We downloaded questions and answers via the Smartvote API. The downloaded data cover 175 communal, cantonal and national elections between 2011 and 2020.

All candidates in an election who participate in Smartvote are asked the same set of questions, but depending on the locale they see translated versions of the questions. They can answer each question with either ‘yes’, ‘rather yes’, ‘rather no’, or ‘no’. They can supplement each answer with a comment of at most 500 characters.

The questions asked on Smartvote have been edited by a team of political scientists. They are intended to cover a broad range of political issues relevant at the time of the election. A detailed documentation of the design of Smartvote and the editing process of the questions is provided by Thurman and Gasser (2009).


We merged the two labels on each pole into a single label: ‘yes’ and ‘rather yes’ were combined into ‘favor’; ‘rather no’, or ‘no’ into ‘against‘. This improves the consistency of the data and the comparability to previous stance detection datasets.

We did not further preprocess the text of the comments.

Language Identification

As the API does not provide the language of comments, we employed a language identifier to automatically annotate this information. We used the langdetect library Shuyo (2010). For each responder we classified all the comments jointly, assuming that responders did not switch code during the answering of the questionnaire.

We applied the identifier in a two-step approach. In the first run we allowed the identifier to output all 55 languages that it supports out of the box, plus Romansh, the fourth official language in Switzerland333Namely the Rumantsch Grischun variety; the language profile was created using resources from the Zurich Parallel Corpus Collection Graën et al. (2019) and the Quotidiana corpus (https://github.com/ProSvizraRumantscha/corpora).. We found that no Romansh comments were detected and that all unexpected outputs were misclassifications of German, French or Italian comments. We further concluded that little or no Swiss German comments are in the dataset: If they were, some of them would have manifested themselves in the form of misclassifications (e.g. as Dutch).

In the second run, drawing from these conclusions, we restricted the identifier’s output to English, French, German and Italian.


We pre-filtered the questions and answers to improve the quality of the dataset. To keep the domain of the data surveyable, we set a focus on national-level questions. Therefore, all questions and corresponding answers pertaining to national elections were included.

In the context of communal and cantonal elections, candidates have answered both local questions and a subset of the national questions. Of those elections, we only considered answers to the questions that also had been asked in a national election. Furthermore, they were only used to augment the training set while the validation and test sets were restricted to answers from national elections.

We discarded the less than 20 comments classified as English. Furthermore, instances that met any of the following conditions were filtered from the dataset:

  • Question is not a closed question or does not address a clearly defined political issue.

  • No comment was submitted by the candidate or the comment is shorter than 50 characters.

  • Comment starts with “but” or a similar indicator that the comment is not a self-contained statement.

  • Comment contains a URL.

In total, a fifth of the original comments were filtered out.


The questions have been organized by the Smartvote editors into categories (such as “Economy”). We further consolidated the pre-defined categories into 12 broad topics (Table 1).


The dataset is shared under a CC BY-NC 4.0 license. Copyright remains with www.smartvote.ch.

Given the sensitive nature of the data, we increase the anonymity of the data by hashing the respondents’ IDs. No personal attributes of the respondents, such as their party affiliation, are included in the dataset. We provide a data statement Bender and Friedman (2018) in Appendix B.

3.3 Data Split

Foreign Policy
Political System
Figure 2: Proportion of ‘favor’ labels per question, grouped by topic. While the proportion of favorable answers varies from question to question, it is balanced overall.

We held out the topics “Healthcare” and “Political System” from the training data and created a separate cross-topic test set that contains the questions and answers related to those topics.

Furthermore, in order to test cross-question generalization performance within previously seen topics, we manually selected 16 held-out questions that are distributed over the remaining 10 topics. We selected the held-out questions manually because we wanted to make sure that they are truly unseen and that no paraphrases of the questions are found in the training set.

We designated Italian as a test-only language, since relatively few comments have been written in Italian. From the remaining German and French data we randomly selected a percentage of respondents as validation or as test respondents.

As a result we received one training set, one validation set and four test sets. The sizes of the sets are listed in Table 2. We did not consider test sets that are cross-lingual and cross-target at the same time, as they would have been too small to yield significant results.

3.4 Analysis

Some observations regarding the composition of x-stance can be made.

Class Distribution

Figure 2 visualizes the proportion of ‘favor’ and ‘against‘ stances for each target in the dataset. The ratio differs between questions but is relatively equally distributed across the topics. In particular, the questions belonging to the held-out topics (with a ‘favor’ ratio of 49.4%) have a similar class distribution as the questions within other topics (with a ‘favor’ ratio of 50.0%).

Linguistic Properties

Not every question is unique; some questions are paraphrases describing the same political issue. For example, in the 2015 election, the candidates were asked: “Should the consumption of cannabis as well as its possession for personal use be legalised?” Four years later they were asked: “Should cannabis use be legalized?” However, we do not see any need to consolidate those duplicates because they contribute to the diversity of the training data.

We further observe that while some questions in the dataset are quite short, some questions are rather convoluted. For example, a typical long question reads:

Some 1% of direct payments to Swiss agriculture currently go to organic farming operations. Should this proportion be increased at the expense of standard farming operations as part of Switzerland’s 2014-2017 agricultural policy?

Such longer questions might be more challenging to process semantically.


The x-stance dataset has more German samples than French samples. The language ratio of about 3:1 is consistent across all training and test sets. Given the two languages it is possible to either train two monolingual models or to train a single model in a multi-source setup McDonald et al. (2011). We choose a multi-source baseline because M-Bert is known to benefit from multilingual training data both in a supervised and in a cross-lingual scenario Kondratyuk and Straka (2019).

4 Baseline Experiments

We evaluate two types of baselines to obtain an impression of the difficulty of the task.

4.1 Majority Class Baselines

The first pair of baselines uses the most frequent class in the training set for prediction. Specifically, the global majority class baseline predicts the most frequent class across all training targets while the target-wise majority class baseline predicts the class that is most frequent for a given target question. The latter can only be applied to the supervised test set.

4.2 Multilingual BERT Baseline

Secondly, we fine-tune multilingual Bert (M-Bert) on the task  Devlin et al. (2019) which has been pretrained jointly in 104 languages444https://github.com/google-research/bert/blob/master/multilingual.md and has established itself as a state of the art for various multilingual tasks Wu and Dredze (2019); Pires et al. (2019). Within the field of stance detection, Bert can outperform both feature-based and other neural approaches in a monolingual English setting Ghosh et al. (2019).


In the context of Bert we interpret the x-stance task as sequence pair classification inspired by natural language inference tasks Bowman et al. (2015). We follow the procedure outlined by Devlin et al. (2019) for such tasks. We designate the question as segment A and the comment as segment B. The two segments are separated with the special token [SEP], and the special token [CLS] is prepended to the sequence. The final hidden state corresponding to [CLS] is then classified by a linear layer.

We fine-tune the full model with a cross-entropy loss, using the AllenNLP library Gardner et al. (2018) as a basis for our implementation.


We upsampled the ‘favor’ class so that the two classes are balanced when summing over all questions and topics. A maximum sequence length of 512 subwords and a batch size of 16 was chosen for all training runs. We then performed a grid search within the following range of hyperparameters based on the validation accuracy:

  • Learning rate: 5e-5, 3e-5, 2e-5

  • Number of epochs: 3, 4

The grid search was repeated independently for every variant that we tested. Furthermore, the standard recommendations for fine-tuning Bert were used: Adam with and ; an L2 weight decay of 

; a learning rate warmup over the first 10% of the steps; and a linear decay of the learning rate. A dropout probability of 0.1 was set on all layers.


de fr      px it
Majority class (global) 33.1 34.8      px 34.4
Majority class (target-wise) 60.8 65.1      px 62.9
M-Bert 76.8 76.6      px 70.2
Table 3: Baseline scores in the cross-lingual setting. No Italian samples were seen during training, making this a case of zero-shot cross-lingual transfer. The scores are reported as the macro-average of the F1-scores for ‘favor’ and for ‘against’.
qIntra-target Cross-question Cross-topic
de fr Mean         px de fr Mean         px de fr Mean
Majority class (global) 33.1 34.8 33.9         px 36.4 37.9 37.1         px 32.1 33.8 32.9
Majority class (target-wise) 60.8 65.1 62.9         px - - -         px - - -
M-Bert 76.8 76.6 76.6         px 68.5 68.4 68.4         px 68.9 70.9 69.9
Table 4: Baseline scores in the cross-target

setting. For each test set we separately report a German and a French score, as well as their harmonic mean.

Table 3 shows the results for the cross-lingual setting. M-Bert performs consistently better than the majority class baselines. Even the zero-short performance in Italian, while significantly lower than the supervised scores, is much better than the target-wise majority class baseline.

Results for the cross-target setting are given in Table 4. Similar to the cross-lingual setting, M-Bert performs worse in a cross-target setting but easily surpasses the majority class baselines. Furthermore, the cross-question score of M-Bert is slightly lower than the cross-topic score.

4.3 How Important is Consistent Language?

The default setup preserves horizontal language consistency in that the language of the questions always matches the language of the comments. For example, the Italian test instances are combined with the Italian version of the questions, even though during training the model has only ever seen the German and French versions of the questions.

An alternative concept is vertical language consistency, whereby the questions are consistently presented in one language, regardless of the comment. To test whether horizontal or vertical consistency is more helpful, we train and evaluate M-Bert on a dataset variant where all questions are in their English version. We chose English as a lingua franca because it had the largest share of data during the pretraining of M-Bert.

The results are shown in Table 5. While the effect is negligible in most settings, the cross-lingual performance clearly increases when all questions are given in English.

Supervised Cross-Lingual Cross-Question Cross-Topic
M-Bert 76.6 70.2 68.4 69.9
— with English questions 76.1 71.7 68.5 69.4
— with missing questions 73.2 67.1 67.8 69.3
— with missing comments 64.2 60.5 51.1 48.6
— with random questions 56.0 52.5 47.7 48.5
— with random comments 50.7 50.7 48.2 48.7
— with target embeddings 70.1 66.0 68.4 69.0
Table 5: Results for additional experiments. The cross-lingual score is the F1-score on the Italian test set. For the supervised, cross-question and cross-topic settings we report the harmonic mean of the German and French scores.

4.4 How Important are the Segments?

In order to rule out that only the questions or only the comments are necessary to optimally solve the task, we conduct some additional experiments:

  • Only use a single segment containing the comment, removing the questions from the training and test data (missing questions).

  • Only use the question and remove the comment (missing comments).

In both cases the performance decreases across all evaluation settings (Table 5). The loss in performance is much higher when comments are missing, indicating that the comments contain the most important information about stance. As can be expected, the score achieved without comments is only slightly different from the target-wise majority class baseline.

But there is also a loss in performance when the questions are missing, which underlines the importance of pairing both pieces of text. The effect of missing questions is especially strong in the supervised and cross-lingual settings. To illustrate this, we provide in Table A8 some examples of comments that occur with multiple different targets in the training set. Those examples can explain why the target can be essential for disambiguating a stance detection problem. On the other hand, the effect of omitting the questions is less pronounced in the cross-target settings.

The above single-segment experiments tell us that both the comment and the question provide crucial information. But it is possible that the M-Bert model, even though trained on both segments, mainly looks at a single segment at test time. To rule this out, we probe the model with randomized data at test time:

  • Test the model on versions of the test sets where the comments remain in place but the questions are shuffled randomly (random questions). We make sure that the random questions come from the same test set and language as the original questions.

  • Keep the questions in place and randomize the comments (random comments). Again we shuffle the comments only within test set boundaries.

The results in Table 5 show that the performance of the model decreases in both cases, confirming that it learns to take into account both segments.

4.5 How Important are Spelled-Out Targets?

Finally we test whether the target really needs to be represented by natural language (e.g. “Do you support X?”). Namely, an alternative is to represent the target with a trainable embedding instead of a question.

In order to fit target embeddings smoothly into our architecture, we represent each target type with a different reserved symbol from the M-Bert vocabulary. Segment A is then set to this symbol instead of a natural language question.

The results for this experiment are listed in the bottom row of Table 5. An M-Bert model that learns target embeddings instead of encoding a question performs clearly worse in the supervised and cross-lingual settings. From this we conclude that spelled-out natural language questions provide important linguistic detail that can help in stance detection.

5 Discussion

The baseline experiments confirm that M-Bert can achieve a reasonable accuracy on x-stance.

Dataset Evaluation Score
SemEval-2016 Ghosh et al. (2019) 75.1
MPCHI Ghosh et al. (2019) 75.6
x-stance this paper 76.6
Table 6: Performance of Bert-like models on different supervised stance detection benchmarks.

To put the supervised score into context we list scores that variants of Bert have achieved on other stance detection datasets in Table 6. It seems that the supervised part of x-stance has a similar difficulty as the SemEval-2016 Mohammad et al. (2016) or MPCHI Sen et al. (2018) datasets on which Bert has previously been evaluated.

On the other hand, in the cross-lingual and cross-target settings, the mean score drops by 6–8 percentage points compared to the supervised setting; while zero-shot transfer is possible to a degree, it can still be improved.

The additional experiments (Table 5) validate the results and show that the sequence-pair classification approach to stance detection is justified.

It is interesting to see what errors the M-Bert model makes. Table A7 presents instances where it predicts the wrong label with a high confidence. These examples indicate that many comments express their stance only on a very implicit level, and thus hint at a potential weakness of the dataset. Because on the voting advice platform the label is explicitly shown to readers in addition to the comments, the comments do not need to express the stance explicitly. Manual annotation could eliminate very implicit samples in a future version of the dataset. However, the sheer size and breadth of the dataset could not realistically be achieved with manual annotation, and, in our view, largely compensates for the implicitness of the texts.

6 Conclusion

We have presented a new dataset for political stance detection called x-stance. The dataset extends over a broad range of topics and issues regarding national Swiss politics. This diversity of topics opens up an opportunity to further study multi-target learning. Moreover, being partly Swiss Standard German, partly French and Italian, the dataset promotes a multilingual approach to stance detection.

By compiling formal commentary that politicians have written on political questions, we add a new text genre to the field of stance detection. We also propose a question–answer format that allows us to condition stance detection models on a target naturally.

Our baseline results with multilingual Bert show that the model has some capability to perform zero-shot transfer to unseen languages and to unseen targets (both within a topic and to unseen topics). However, there is some gap in performance that future work could address. We expect that the x-stance dataset could furthermore be a valuable resource for fields such as argument mining, argument search or topic classification.


This work was funded by the Swiss National Science Foundation (project MUTAMUR; no. 176727). We would like to thank Isabelle Augenstein for helpful feedback.


  • I. Augenstein, T. Rocktäschel, A. Vlachos, and K. Bontcheva (2016) Stance detection with bidirectional conditional encoding. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    Austin, Texas, pp. 876–885. External Links: Link, Document Cited by: §2.
  • E. M. Bender and B. Friedman (2018) Data statements for natural language processing: toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics 6, pp. 587–604. External Links: Document Cited by: §3.2.
  • S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 632–642. External Links: Link, Document Cited by: §4.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1, §2, §4.2, §4.2.
  • M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. Zettlemoyer (2018) AllenNLP: a deep semantic natural language processing platform. In

    Proceedings of Workshop for NLP Open Source Software (NLP-OSS)

    Melbourne, Australia, pp. 1–6. External Links: Link, Document Cited by: §4.2.
  • S. Ghosh, P. Singhania, S. Singh, K. Rudra, and S. Ghosh (2019) Stance detection in web and social media: a comparative study. In International Conference of the Cross-Language Evaluation Forum for European Languages, pp. 75–87. Cited by: §2, §4.2, Table 6.
  • J. Graën, T. Kew, A. Shaitarova, and M. Volk (2019) Modelling large parallel corpora: the zurich parallel corpus collection. In Proceedings of the 7th Workshop on Challenges in the Management of Large Corpora (CMLC), P. Bański, A. Barbaresi, H. Biber, E. Breiteneder, S. Clematide, M. Kupietz, H. Lüngen, and C. Iliadi (Eds.), pp. 1–8. External Links: Link, Document Cited by: footnote 3.
  • D. Kondratyuk and M. Straka (2019) 75 languages, 1 model: parsing universal dependencies universally. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2779–2795. External Links: Link, Document Cited by: §3.4.
  • D. Küçük and F. Can (2020) Stance detection: a survey. ACM Comput. Surv. 53 (1). External Links: ISSN 0360-0300, Link, Document Cited by: §1, §1.
  • M. Lai, V. Patti, G. Ruffo, and P. Rosso (2018) Stance evolution and twitter interactions in an italian political debate. In International Conference on Applications of Natural Language to Information Systems, pp. 15–27. Cited by: §2.
  • M. Lai (2019) On language and structure in polarized communities. Ph.D. Thesis, Universitat Politècnica de València. Cited by: §2.
  • R. McDonald, S. Petrov, and K. Hall (2011) Multi-source transfer of delexicalized dependency parsers. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK., pp. 62–72. External Links: Link Cited by: §3.4.
  • S. Mohammad, S. Kiritchenko, P. Sobhani, X. Zhu, and C. Cherry (2016) A dataset for detecting stance in tweets. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, pp. 3945–3952. External Links: Link Cited by: §1, §3.1, §5.
  • S. Mohammad, S. Kiritchenko, P. Sobhani, X. Zhu, and C. Cherry (2016) SemEval-2016 task 6: detecting stance in tweets. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, California, pp. 31–41. External Links: Link, Document Cited by: §2, §3.1.
  • T. Pires, E. Schlinger, and D. Garrette (2019) How multilingual is multilingual BERT?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4996–5001. External Links: Link, Document Cited by: §4.2.
  • A. Sen, M. Sinha, S. Mannarswamy, and S. Roy (2018) Stance classification of multi-perspective consumer health information. In

    Proceedings of the ACM India Joint International Conference on Data Science and Management of Data

    pp. 273–281. Cited by: §5.
  • N. Shuyo (2010) Language detection library for java. External Links: Link Cited by: §3.2.
  • P. Sobhani, D. Inkpen, and X. Zhu (2017) A dataset for multi-target stance detection. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, pp. 551–557. External Links: Link Cited by: §1, §2.
  • M. Taulé, M. A. Martí, F. Rangel, P. Rosso, C. Bosco, and V. Patti (2017) Overview of the task on stance and gender detection in tweets on catalan independence at ibereval 2017. In 2nd Workshop on Evaluation of Human Language Technologies for Iberian Languages, IberEval 2017, Vol. 1881, pp. 157–177. Cited by: §1, §2.
  • M. Taulé, F. Rangel, M. A. Martí, and P. Rosso (2018) Overview of the task on multimodal stance detection in tweets on catalan #1oct referendum. In 3rd Workshop on Evaluation of Human Language Technologies for Iberian Languages, IberEval 2018, Vol. 2150, pp. 149–166. Cited by: §1, §2.
  • J. Thurman and U. Gasser (2009) Three case studies from switzerland: smartvote. Berkman Center Research Publications. External Links: Link Cited by: §3.2.
  • M. Wojatzki, E. Ruppert, S. Holschneider, T. Zesch, and C. Biemann (2017) GermEval 2017: Shared Task on Aspect-based Sentiment in Social Media Customer Feedback. In Proceedings of the GermEval 2017 – Shared Task on Aspect-based Sentiment in Social Media Customer Feedback, Berlin, Germany, pp. 1–12. Cited by: §2.
  • S. Wu and M. Dredze (2019) Beto, bentz, becas: the surprising cross-lingual effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 833–844. External Links: Link, Document Cited by: §4.2.

Appendix A Examples

Question Comment Gold Label Prob.
Befürworten Sie eine vollständige Liberalisierung der Geschäftsöffnungszeiten? Ausser Sonntag. Dies sollte ein Ruhetag bleiben können. FAVOR 0.001
[Are you in favour of a complete liberalisation of business hours for shops?] [Except Sunday. That should remain a day of rest.]
Soll die Schweiz innerhalb der nächsten vier Jahre EU-Beitrittsverhandlungen aufnehmen? In den nächsten vier Jahren ist dies wohl unrealistisch. FAVOR 0.005
[Should Switzerland embark on negotiations in the next four years to join the EU?] [For the next four years this is probably unrealistic.]
Befürworten Sie einen Ausbau des Landschaftsschutzes? Wenn es darum geht erneuerbare Energien zu fördern, ist sogar eine Lockerung angebracht. AGAINST 0.006
[Are you in favour of extending landscape protection?] [When it comes to promoting renewable energy, even a relaxation is appropriate.]
La Suisse devrait-elle engager des négociations pour un accord de libre échange avec les Etats-Unis? Il faut cependant en parallèle veiller à ce que la Suisse ne soit pas mise de côté par les Etats-Unis ! AGAINST 0.010
[Should Switzerland start negotiations with the USA on a free trade agreement?] [At the same time it must be ensured that Switzerland is not sidelined by the United States!]
Table A7: Some classification errors where the predicted probability of the correct label is especially low. The examples have been taken from the validation set.

Comment is favorable towards target … but against target …
Ich will offene Grenzen für Waren und selbstverantwortliche mündige Bürger. Der Staat hat kein Recht, uns einzuschränken. Soll die Schweiz mit den USA Verhandlungen über ein Freihandelsabkommen aufnehmen? Soll die Schweiz das Schengen-Abkommen mit der EU kündigen und wieder verstärkte Personenkontrollen direkt an der Grenze einführen?
[I want open borders for goods and responsible citizens. The state has no right to restrict us.] [Should Switzerland start negotiations with the USA on a free trade agreement?] [Should Switzerland terminate the Schengen Agreement with the EU and reintroduce increased identity checks directly on the border?]
Hier gilt der Grundsatz der Eigenverantwortung und Selbstbestimmung des Unternehmens! Sind Sie für eine vollständige Liberalisierung der Ladenöffnungszeiten? Würden Sie die Einführung einer Frauenquote in Verwaltungsräten börsenkotierter Unternehmen befürworten?
[The principle of personal responsibility and corporate self-regulation applies here!] [Are you in favour of the complete liberalization of shop opening times?] [Would you support the introduction of a woman’s quota for the Boards of Directors of listed companies?]
Table A8: Two comments that imply a positive stance towards one target issue but a negative stance towards another target issue. Such cases can be found in the dataset because respondents have copy-pasted some comments. These examples have been extracted from the training set.

Appendix B Data Statement

Curation rationale

In order to study the automatic detection of stances on political issues, questions and candidate responses on the voting advice application smartvote.ch were downloaded. Mainly data pertaining to national-level issues were included to reduce variability.

Language variety

The training set consists of questions and answers in Swiss Standard German and Swiss French (74.1% de-CH; 25.9% fr-CH). The test sets also contain questions and answers in Swiss Italian (67.1% de-CH; 24.7% fr-CH; 8.2% it-CH). The questions have also been translated into English.

Speaker demographic (answers)

  • Candidates for communal, cantonal or national elections in Switzerland who have filled out an online questionnaire.

  • Age: 18 or older – mixed.

  • Gender: Unknown – mixed.

  • Race/ethnicity: Unknown – mixed.

  • Native language: Unknown – mixed.

  • Socioeconomic status: Unknown – mixed.

  • Different speakers represented: 7581.

  • Presence of disordered speech: Unknown.

Speech situation

  • The questions were edited and translated by political scientists for a public voting advice website.

  • The answers were written between 2011 and 2020 by the users of the website.

Text characteristics

Questions, answers, arguments and comments regarding political issues.