BUT-FIT at SemEval-2019 Task 7: Determining the Rumour Stance with Pre-Trained Deep Bidirectional Transformers

02/25/2019 ∙ by Martin Fajcik, et al. ∙ Brno University of Technology 0

This paper describes our system submitted to SemEval 2019 Task 7: RumourEval 2019: Determining Rumour Veracity and Support for Rumours, Subtask A (Gorrell et al., 2019). The challenge focused on classifying whether posts from Twitter and Reddit support, deny, query, or comment a hidden rumour, truthfulness of which is the topic of an underlying discussion thread. We formulate the problem as a stance classification, determining the rumour stance of a post with respect to the previous thread post and the source thread post. The recent BERT architecture was employed to build an end-to-end system which has reached the F1 score of 61.67 the competition, without any hand-crafted features, only 0.2 winner.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Fighting false rumours at the internet is a tedious task. Sometimes, even understanding what an actual rumour is about may prove challenging. And only then one can actually judge its veracity with an appropriate evidence. The works of Ferreira and Vlachos (2016); Enayet and El-Beltagy (2017) focused on prediction of rumour veracity in thread discussions. These works indicated that the veracity is correlated with stances of the discussion participants towards the rumour. Following this assumption, the participants of the SubTask A in the SemEval competition Task 7 were asked to classify whether the stance of each post in a given Twitter or Reddit thread supports, denies, queries or comments hidden rumour. Potential applications of such a function are wide, ranging from an analysis of popular events (political discussions, academy awards, etc.) to quickly disproving fake news during disasters.

Stance classification (SC) in its traditional form is concerned with determining the attitude of a source text towards a target text Mohammad et al. (2016) and it has been studied thoroughly for discussion threads Walker et al. (2012); Hasan and Ng (2013); Chuang and Hsieh (2015). However, the objective of SemEval 2019 Task 7 is to determine the stance to hidden rumour which is not explicitly given (it can be often inferred from the source post of the discussion – the root of the tree-shaped discussion thread – as demonstrated in Figure 1). The competitors were asked to classify the stance of the source post itself too.

Figure 1: An example of discussion’s source post denying the actual rumour which is present in the source post – annotated with the red cursive

The approach followed in our work builds on recent advances in language representation models. We fine-tune the pre-trained end-to-end Bidirectional Encoder Representations from Transformers (BERT) model Devlin et al. (2018), while using the discussion’s source post, target’s previous post and the target post itself as inputs to determine the rumour stance of the target post. Our implementation is available online.111www.github.com/MFajcik/RumourEval2019

2 Related Work

Previous SemEval competitions: In recent years, there were two SemEval competitions targeting the stance classification. The first one focused on the setting in which the actual rumour was provided Mohammad et al. (2016)

. The organizers of SemEval-2016 Task 6 prepared a benchmarking system based on SVM using hand-made features and word embeddings from their previous system for sentiment analysis

Mohammad et al. (2013), outperfoming all the challenge participants.

The second was previous RumourEval competition won by a system based on word vectors, handcrafted features

222The features included: a flag indicating whether a tweet is a source tweet of a conversation, the length of the tweet, an indicator of the presence of urls and images, punctuation, cosine distance to source tweet and all other tweets in the conversation, the count of negation and swear words, and an average of word vectors corresponding to the tweet. and an LSTM Hochreiter and Schmidhuber (1997) summarizing information of the discussion’s branches Kochkina et al. (2017). Other submissions were either based on similar handcrafted features Singh et al. (2017); Wang et al. (2017); Enayet and El-Beltagy (2017), features based on sets of words for determining language cues such as Belief or Denial Bahuleyan and Vechtomova (2017)

, post-processing via rule-based heuristics after the feature-based classification

Srivastava et al. (2017)

, Convolutional Neural Networks (CNNs) with rules

Lozano et al. (2017), or end-to-end CNNs that jointly learnt word embeddings Chen et al. (2017).

End-to-End approaches: Augenstein et al. (2016) encodes the target text by means of a bidirectional LSTM (BiLSTM), conditioned on the source text and empirically shows that the conditioning on the source text matters. Du et al. (2017) proposes target augmented embeddings – embeddings concatenated with an average of the source text embeddings and applies these to compute an attention based on the weighted sum of the target embeddings that were previously transformed via the BiLSTM. Mohtarami et al. (2018) proposes an architecture that encodes the source and the target text via a LSTM and a CNN separately and then uses a memory network together with a similarity matrix to capture the similarity between the source and the target text, and infers a fixed-size vector suitable for the stance prediction.

3 Dataset

S D Q C Total
train 925 378 395 3519 5217
in % 18 7 8 67
dev 102 82 120 1181 1485
in % 7 6 8 80
test 157 101 93 1476 1827
in % 9 6 5 81
Table 1: Histogram and distribution of examples through classes in the train/dev/test dataset splits. The individual examples belong into 327/38/81 train/dev/test tree-structured discussions.

Provided dataset was collected from Twitter and Reddit tree-shaped discussions. The stance labels were obtained via crowdsourcing. The Twitter discussions are based on recent popular topics – Sydney siege, Germanwings crash etc. and there are 9 total topics covered in the training data. The Twitter part of test data contains different topics. The Reddit discussions cover various topics and the discussions are in most cases not related to each other. We provide a deeper insight at dataset in Appendix A.1.

4 BUT-FIT’s System Description

4.1 Preprocessing

We replace URLs and mentions with special tokens and using tweet-processor333https://github.com/s/preprocessor. We use spaCy444https://spacy.io/ to split each post into sentences and add

token to terminate each sentence. Then we use tokenizer that comes with Hugging Face pytorch re-implementation of BERT

555https://github.com/huggingface/pytorch-pretrained-BERT. The tokenizer lowercases the input and applies the WordPiece encoding Wu et al. (2016)

to split input words into most frequent n-grams present in the pre-training corpus, effectively representing text at the sub-word level while keeping only 30,000 token vocabulary.

4.2 Model

Figure 2: An architecture of BUT-FIT’s system. Text segment containing is annotated with green color, segment that contains (target post) is annotated with blue cursive. The input representation is obtained by summing the input embedding matrices , with being the input length and input dimensionality. The input is passed times via transformer encoder. Finally, the -token level output is fed via two dense layers yielding the class predictions.

Following the recent trend in transfer learning from language models (LM), we employ the pre-trained BERT model. The model is first trained on the concatenation of BooksCorpus (800M words)

Zhu et al. (2015) and English Wikipedia (2,500M words) using the multi-task objective consisting of LM and machine comprehension (MC) sub-objectives. The LM objective aims at predicting the identity of 15% randomly masked tokens present at the input666The explanation of token masking is simplified and we refer readers to read details in the original paper Devlin et al. (2018).. Given two sentences from the corpus, the MC objective is to classify whether the second sentence follows the first sentence in the corpus. The sentence is replaced randomly in half of the cases. During the pre-training, the input consists of two documents, each represented as a sequence of tokens divided by special token and preceeded by token used by the MC objective, i.e. . The input tokens are represented via jointly learned token embeddings , segment embeddings capturing whether the word belongs into or and positional embeddings since self-attention is position-invariant operation (see Devlin et al. (2018) for details).

In our solution, we follow the assumption that the stance of the discussion’s post depends only on itself, the source thread post and previous thread post. Since the original input is composed of two documents, we experimented with various ways of encoding the input (Section 6) ending up with just a concatenation of source and previous post as (left empty in case of source post being the target post) and target post as . The discriminative fine-tuning of BERT is done using the

-token level output and passing it via two dense layers yielding the posterior probabilities as depicted in Figure

2. Weighted cross-entropy loss is used to ensure the flat prior over the classes.

4.3 Ensembling

Before submission we trained 100 models, which differed just by learning rate. We experimented with 4 different system fusions in order to increase F1 measure and compensate for overfitting:
TOP-N fusion chose 1 model randomly to add into the ensemble, then randomly shuffled the rest and tried adding them into ensemble one at the time, while iteratively calculating ensemble’s F1 by averaging the output probabilities to approximate the bayesian model averaging. If adding model into ensemble increased the F1, model has been permanently included in the ensemble. The process has been repeated until no further model improving the ensemble’s F1 has been found. This resulted into set of 17 best models.
EXC-N fusion chose all models into the ensemble and then iterativly dropped one model at the time s.t. dropping it resulted in the largest increase of the ensemble’s F1, stopping when dropping any ensemble’s model did not increased the F1. Using this approach, we ended up using 94 models.
TOP-N is analogous to TOP-N fusion, but we average the pre-softmax scores instead of output class probabilities.
OPT-F1 fusion aims at learning weights summing up to 1 for weighted average of the output probabilities from models selected via the procedure used in TOP-N

. The weights are estimated using modified Powell’s method from SciPy to maximize the F1 score on dev data.

# Acc macro F1 macro F1 F1 F1 F1 F1
Branch-LSTM 453K - 49.30 43.80 55.00 7.10 91.30
FeaturesNN 205K 40.29 40.12 17.69 80.43
BiLSTM+SelfAtt 28M 42.21 45.20 17.75 81.92
BERT 109M 43.49 59.88 18.42 90.36
BERT 335M 42.37 55.17 24.44 90.15
BERT 335M 43.02 56.93 26.53 90.51
BERT 335M 44.29 57.07 35.02 90.41
BERT EXC-N - 48.89 62.80 37.50 91.94
BERT TOP-N - 48.25 62.86 39.74 91.83
BERT OPT-F1 - 48.03 62.26 42.77 92.01
BERT TOP-N - 61.67 49.11 64.45 41.29 91.84
Table 2:

Our achieved results. Results for single model were obtained by training at least 10 models and we report mean and standard deviation for these. #

denotes the number of parameters. The columns F1 through F1 contain individual F1 scores for problem classes. All ensemble models are optimized for F1-score on dev data. BiLSTM+SelfAtt contains 4.2M parameters without pre-trained BERT embeddings. BERT and BERT denote ablations with empty source or target post respectively. Note that the accuracy is biased towards different training data prior as shown in Table 1. Our SemEval submissions are denoted with . Winning BLCU-nlp system achieved 61.87 F1 score on test data. More available at http://tinyurl.com/y3m5mskd.

5 Experimental Setup

We implemented our models in pytorch, where we use Hugging Face re-implementation (Footnote 5) in ”bert-large-uncased” setting pre-trained with transformer layers, hidden unit size of , attention heads and parameters. When building an ensemble, we picked the learning rates from the interval

. Each epoch, we iterate over dataset in an ordered manner, starting with shortest sequence as we found this to be helpful. We truncate sequences at maximum length

with a heuristic – firstly we truncate the to length , if that is not enough, then we truncate the to the same size. We kept batch size at

and keep other hyperparameters the same as in BERT paper. We use the same Adam optimizer with L2 weight decay of

and no warmup. We trained the model on GeForce RTX 2080 Ti GPU.

6 Results and Analysis

We compare our solution with three baselines. The first is branch-LSTM baseline provided by the task organizers777http://tinyurl.com/y4p5ygn7 – inspired by the winning system of RumourEval 2017. The second baseline (FeaturesNN

) is our re-implementation of first baseline in pytorch without LSTM – posts are classified via 2 layer network (ReLU/Softmax) only by features named in Footnote

2. In the third case (BiLSTM+SelfAtt

), we used the same input representation as our submitted model, but replaced BERT with 1-layer BiLSTM followed by self-attention and a softmax layer as proposed by

Lin et al. (2017), except the orthogonality constraint is not used as we did not found it helpful.

The results are shown in Table 2

. Our BERT models encountered high variance of the results during the training. We assume the cause of this might be the problem difficulty, small training set and the model complexity. To counteract, we decided to discard all the models with less than 55 F1 score on dev data and we averaged the output class probability distributions when ensembling. Our initial experiments used sequences up to length 512, but we found no difference when truncating them down to 200.

What features weren’t helpful: We tried adding a number of features to the pooled output (after dense/tanh layer) including positive, neutral and negative sentiment and all the features used by FeaturesNN baseline. We also tried adding jointly learned POS, NER and dependency tag embeddings as well as third segment embedding888We tried adding the learned representations to the input the same way the segment/positional embeddings are added. or explicit token to separate source and previous post in BERT’s input without observing any improvement.

7 Conclusion

Our approach achieves 61.67 macro F1 score improving over baseline by 12.37%, while using only discussion’s source post, previous post and the target post to classify the target post’s stance to rumour. In our case study, we noticed that few examples are not answerable by human while using only these information sources. Therefore, in future we would like to extend our system with relevance scoring system, scoring the all discussion’s posts and picking up the most relevant ones to preserve the context of understanding.

Acknowledgments

This work was supported by [Acknowledgments will be filled upon acceptance.]

References

Appendix A Supplemental Material

a.1 Dataset Insights

For each discussion from Twitter and Reddit, the dataset contains its whole tree structure and metadata, which are different for both sites (e.g. upvotes in Reddit). When analyzing the data, we also uncovered a few anomalies: 12 data points to do not contain any text and according to organizers they were deleted by users at the time of download and been left it in place so as not to break the conversational structure, the query stance of few examples taken from subreddit DebunkThis999https://www.reddit.com/r/DebunkThis/ is strictly dependent on domain knowledge and the strict class of some examples is amibgious and they should probably be labelled with multiple classes.

a.1.1 Domain knowledge dependency

Examples from subreddit DebunkThis have all the same format ”Debunk this: [statement]”, e.g. ”Debunk this: Nicotine isn’t really bad for you, and it’s the other substances that makes tobacco so harmful.”. All these examples are labelled as queries.

a.1.2 Class ambiguity

For instance source/previous post ”This is crazy! #CapeTown #capestorm #weatherforecast https://t.co/3bcKOKrCJB and target post ”@RyGuySA Oh my gosh! Is that not a tornado?! Cause wow, It almost looks like one!”, officialy labelled in the test data as a comment, but we believe it might be a query as well.

a.2 Additional Introspection

The following figures 3, 4, 5, 6 contain selected insights at the attention matrices from multi-head attention defined as (1), where are matrices containing query/value vectors and is the key/value dimension. The insights are selected from the heads at the first layer of transformer encoder.

(1)
Figure 3: Intra-segment attention – the attention is made only between the subword units from the same segment.
Figure 4: Attention matrix capturing the subword similarity.
Figure 5: ’Soft’ local context aggregation.
Figure 6: ’Hard’ local context aggregation – the signal is mostly sent further to another transformer encoder layer.