ApplicaAI at SemEval-2020 Task 11: On RoBERTa-CRF, Span CLS and Whether Self-Training Helps Them

05/16/2020 ∙ by Dawid Jurkiewicz, et al. ∙ 0

This paper presents the winning system for the propaganda Technique Classification (TC) task and the second-placed system for the propaganda Span Identification (SI) task. The purpose of TC task was to identify an applied propaganda technique given propaganda text fragment. The goal of SI task was to find specific text fragments which contain at least one propaganda technique. Both of the developed solutions used semi-supervised learning technique of self-training. Interestingly, although CRF is barely used with transformer-based language models, the SI task was approached with RoBERTa-CRF architecture. An ensemble of RoBERTa-based models was proposed for the TC task, with one of them making use of Span CLS layers we introduce in the present paper. In addition to describing the submitted systems, an impact of architectural decisions and training schemes is investigated along with remarks regarding training models of the same or better quality with lower computational budget. Finally, the results of error analysis are presented.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Ablation Studies

Since different random initialization or data order can result with considerably higher scores,222See e.g. junczys-dowmunt-etal-2018-approaching or recent analysis of dodge2020finetuning. models with different random seeds were trained for the purposes of ablation studies. In the case of SI task, results were evaluated on the original development set, whereas in the case of TC where fewer data points are available, we decided to use cross-validation instead.

Span Identification

[width=0.9]figures/crf_None.pdf Figure :

Performance of RoBERTa with and without CRF as a function of percentage of train set available. Values above 100% indicate self-training was performed. Mean FLC-F1 and standard deviation across 5 runs for each percentage.

CRF Self-train FLC-F1 (std, max)          
Table : Best scores on the dev set achieved with RoBERTa large model on SI task. Mean, standard deviation and maximum across 10 runs with different random seeds. Numbers in brackets indicate how many self-training iterations were used.
Batch Dropouts Self-train CRF FLC-F1
4* 2* 4*
4* 2* 4*
Table : Impact of hypothetical lowering batch size during self training or enlarging batch size during initial training, as well as of enabling or disabling both hidden and attention dropouts. Change between means across 10 runs with different random seeds.

Models with different random seeds were trained for spell-60K60K steps with an evaluation performed every 2K steps. This is equivalent to approximately 30 epochs and per-epoch validation in a scenario without data generated during the self-training procedure. Table 

Document summarizes the best scores achieved across 10 runs for each configuration. CRF has a noticeable positive impact on spell-FLCFLC-F1 scores achieved without self-training in the setting we consider. Presence of CRF layer is correlated positively with score (,

). Difference is significant according to Kruskal–Wallis test (

). Unless said otherwise, all further statistical statements within this section were confirmed with statistically significant positive Spearman rank correlation and Kruskal-Wallis test results. Differences in variance were confirmed using Bartlett’s test. The

significance level was assumed. Statistically significant influence of CRF disappears when the self-training is investigated. In the case of first self-training, whether or not CRF was used, a considerable increase of median score can be observed. Self-trained models with and without CRF layer however are indistinguishable. Improvement offered by further self-training iterations is not so evident, but is statistically significant. In particular they slightly improve mean scores and decrease variance (see Table Document). As it comes to the latter, CRF-extended models have generally higher variance and scores achieved across the runs. Table Document analyzes the importance of using different hyperparameters. Whereas use of smaller batch size and dropout is beneficial for the initial training without noisy data, it impacts self-training phase negatively. Obviously, the largest negative impact is observed when disabling dropout during training on the small amount of manually annotated data. Figure Document illustrates scores achieved by models trained for the same number of steps on subsets or supersets of manually annotated data. CRF layer has a positive impact regardless of percentage of train set available. Once again, a large variance in scores of CRF-equipped models can be observed, however it is being substantially reduced with increase of batch size. Interestingly, figures suggest the proportion of automatically annotated data we used might be suboptimal, since it was an equivalent of around 3000% in line with the chart’s convention. One may hypothesize better scores would be achieved by model trained with gold to silver proportion.

Technique Classification

6-fold cross-validation was conducted. The results are presented in Table Document. Folds were created by mixing training and development datasets, then shuffling them and splitting into even folds. Parameters were set according to Table Document and Table Document, whereas experiments were carried out as follows. Each approach from Table Document was separately evaluated on each fold using micro-averaged F1 metric. Then, for each approach average score and standard deviation was obtained using 6 scores from every fold.

# Re-weight Span CLS Self-train Micro-F1 (std) (1) (2) (3) (4) (5) (6) (7) (8)
Table : Average of 6-fold cross-validation score on TC task with micro-averaged F1 metric.
Ensemble Micro-F1 (std) (1) (6) (1) (2) (3) (5) (1) (5) (8) (2) (4) (7) (1) (4) (7) (1) (4) (7) (8) (1) (2) (4) (5) (7)
Table : Average scores achieved with ensembles of individual models described in Table Document. Micro-averaged F1 metric.

Moreover, all the 247 possible ensembles333It’s a number of 8-element set subsets with cardinality greater than one. were evaluated in the same fashion as in experiments from Table Document. Table Document

shows the performance achieved by selected combinations when simple averaging of the probabilities returned by individual models was used as the final prediction. Due to a large amount of results available, it is beneficial to conduct a statistical analysis in order to formulate remarks regarding the general trends observed. Each component model of the ensemble was treated as a categorical variable with respect to the ensemble score. Spearman rank correlation between presence of an ensemble component (approaches from Table 

Document) and achieved scores shows that adding model to the ensemble correlates with a significant increase in score, except for (6) model (see Table Document). Boxplotsspell-Boxplots from Figure Document lead to the same conclusions.444spell-KruskallKruskall-Wallis test and spell-BorutaBoruta algorithm [kursa2010boruta] we used in addition support these findings too. Re-weighting seems to be beneficial only when ensembled with other models. An interesting finding is that Span CLS offers small but consistent increase of performance both in models from Table Document and when used in ensembles. Bear in mind we outperformed the second-placed team by , so an improvement of point or half is not negligible. What is most conspicuous however is that self-training based solutions from Table Document

seems to be actually detrimental in the case of TC task. This damaging effect can be potentially attributed to the fact that data automatically generated there accumulate errors from both Span Identification and Classification. Another possible explanation is that much fewer data points are available for span classification task than for span identification attempted as a sequence labeling task. The latter would be somehow consistent with what was found in the field of Neural Machine Translation, where use of back-translation technique in low-resource setting was determined to be harmful 

[Edunov2018UnderstandingBA]. On the other hand, self-training has a positive, statistically significant impact on the score when used in ensembles (see Figure Document and Table Document

). It is not surprising as the beneficial impact of combining individual estimates was observed in many disciplines and is known since the times of Laplace


[width=trim=0.5cm 0 0.4cm 0,clip]figures/boxplot.pdf Figure : Impact of adding a certain model to the ensemble has on mean scores from different folds. Comparison of results with and without it present in tested combination. Model (1) (2) (3) (4) (5) (6) (7) (8)
Table : Spearman’s between presence of ensemble component (models from Table Document) and score achieved by ensemble. indicate results were not significant assuming significance level.

Error analysis

In addition to providing an overview of problematic classes, the question of which shallow features influence score and worsen the results was addressed. This problem was analyzed grammar-IN_A_X_MANNERin a no-box manner, as proposed by gralinski-etal-2019-geval. The main idea is to create two dataset subsets for each feature considered (one for data points with the feature present and one for data points without the feature), rank subsets by per-item scores and use Mann-Whitney rank U to determine whether there is a non-accidental difference between subsets. Low p-value indicates that feature reduces the evaluation score of the model.

Span Identification

Since spell-FLCFLC-F1 metric used in SI task gives non-zero scores for partial matches, it is interesting to analyze what was the proportion of fully missed (partially identified) spans. Table Document investigates this question broken down by propaganda technique used.

90Authority 90Fear 90Bandwagon 90B&W 90Simplification 90Doubt 90Minimization 90Flag-Waving 90Loaded 90Labeling 90Repetition 90Slogans 90Clichés 90Strawman Overall
Identified subsequence 43
Fully identified % 23
Not identified 33
Number of instances 1063
Table : Proportion of partially and fully identified spans (SI task) depending on the propaganda technique used. All the experiments conducted on the original development set.

Our system was unable to identify one third of expected spans, whereas a majority from those identified correctly were the partial matches. The spans easiest to identify in text represented the Flag-Waving, Appeal to fear/prejudice and Slogans techniques, whereas Bandwagon, Doubt and the group spell-Whataboutismof {Whataboutism, Strawman, Red Herring} turnedproselint-cliches.write_good out to be the hardest. The highest proportion of fully identified spans was achieved for Flag-Waving, Repetition and Loaded Language. Unfortunately, it is not possible to investigate precision in this manner, without training separate models for each label or estimating one-to-one alignments between output and expected spans. Further investigation of problematic cases in a paradigm of no-box debugging with GEval tool [gralinski-etal-2019-geval]

revealed the most worsening features, that is features whose presence impacts span identification evaluation metrics negatively (Table 

Document). It seems that our system tend to return ranges without adjacent punctuation. This is the case of sentences such as The new CIA Director spell-HaspelHaspel, who ‘tortured some folks,’ probably can’t travel to the EU, where only the quoted text was returned whereas annotation assumes it should be returned with apostrophes and comma. This remark can be used to slightly improve overall results with simple post-processing. Returned and conjunction refers to the cases where it connects two propaganda spans. The system frequently returns them as single span, contrary to what is expected in the gold standard.

Technique Classification

[width=trim=1.5cm 0 3cm 0,clip]figures/cm_final_ensemble.pdf Figure : Confusion matrix of the submitted system predictions normalized over the number of true labels. Rows represent the true labels and columns – the predicted ones (TC). Feature Count P-value question expected dot quotation exclamation and output
Table : Selected shallow features one may hypothesize impact evaluation scores negatively (SI).
Feature Count P-value
comma inside
CIA before
according to after
quotation before
Table : Selected shallow features one may hypothesize impact evaluation scores negatively (TC).

Figure Document presents normalized confusion matrix of the submitted system predictions. Interestingly, there are a few pairs that were commonly confused. Loaded Language and Black-and-white Fallacy were frequently misclassified as Appeal to fear/prejudice. Similarly, Causal Oversimplification was often predicted as Doubt and Clichés as Loaded Language. The most worsening features are presented in Table Document. One of the frequent predictors of low accuracy is a comma character present within the span to be classified. It can be probably attributed to the fact that its presence is a good indicator of span linguistic complexity. Another determinant of inefficiency turned out to be a negation—around a half of the sentences containing word not were misclassified by the system. Suggested features of a quotation mark before the span and the digram according to after the span are related to reported or indirect speech. Explanation of worsening effect of other features is not as evident as in the case of mentioned above. Moreover, it seems there is no obvious way of improving the final results with our findings and a more detailed analysis might be required.

Discussion and Summary

The winning system for the propaganda Technique Classification (TC) task and the second-placed system for the propaganda Span Identification (SI) task has been described. Both of the developed solutions used semi-supervised learning technique of self-training. Although CRF is barely used with Transformer-based language models, the SI task was approached with RoBERTa-CRF architecture. An ensemble of RoBERTa-based models has been proposed for the TC task, with one of them taking use of Span CLS layers we introduce in the present paper. Analyses conducted afterwards can be applied in rather straightforward manner to further improve the scores for both SI and TC tasks. It is because some of the decisions we have made given lack of or uncertain information, during the post-hoc inquiry turned out to be sub-optimal. These include the proportion of data from self-training in SI task, as well as the possibility to provide a better ensemble in the case of TC. The ablation studies conducted however have some limitations. The same subset of spell-OpenWebTextOpenWebText was used in experiments conducted within one self-training iteration. This means a random seed did not impact which sentences were used during the first, second and the third self-training phase and in each we were manipulating only the data order. Moreover, an analysis we reported was limited to few hyperparameter combinations and no extensive hyperparameter space search was performed. Finally, only one and rather simple method of cost-sensitive re-weighting was tested and there is a great chance it was sub-optimal. It would be interesting to investigate other schemes, such as the one proposed by Cui2019ClassBalancedLB.grammar-TO_TOO The error analysis revealed propaganda techniques commonly confused in TC task, as well as techniques we were unable to detect effectively within the SI input articles. In addition to providing an overview of problematic classes, the question of which shallow features influence score and worsen the results was addressed. A few of these were identified and our remarks can be used to slightly improve results on SI task with simple post-processing. This is not the case for TC task, where one is unable to propose how to improve the final results with our findings. An interesting future research direction seems to be the application of CRF layer and Span CLS to Transformer-based language models when dealing with other tasks, outside the propaganda detection problem. These may include Named Entity Recognition in the case of RoBERTa-CRF, and an aspect-based sentiment analysis that can be viewed through the lens of span classification with Span CLS we proposed.


spell-Outro Developed systems were used to identify and classify spans in the present paper in order to detect fragments one may suspect to represent one or more propaganda techniques. Unfortunately for the entertaining value of this work, none of such were identified by our SI model.