DeepAI
Log In Sign Up

On the Importance of Delexicalization for Fact Verification

In this work we aim to understand and estimate the importance that a neural network assigns to various aspects of the data while learning and making predictions. Here we focus on the recognizing textual entailment (RTE) task and its application to fact verification. In this context, the contributions of this work are as follows. We investigate the attention weights a state of the art RTE method assigns to input tokens in the RTE component of fact verification systems, and confirm that most of the weight is assigned to POS tags of nouns (e.g., NN, NNP etc.) or their phrases. To verify that these lexicalized models transfer poorly, we implement a domain transfer experiment where a RTE component is trained on the FEVER data, and tested on the Fake News Challenge (FNC) dataset. As expected, even though this method achieves high accuracy when evaluated in the same domain, the performance in the target domain is poor, marginally above chance.To mitigate this dependence on lexicalized information, we experiment with several strategies for masking out names by replacing them with their semantic category, coupled with a unique identifier to mark that the same or new entities are referenced between claim and evidence. The results show that, while the performance on the FEVER dataset remains at par with that of the model trained on lexicalized data, it improves significantly when tested in the FNC dataset. Thus our experiments demonstrate that our strategy is successful in mitigating the dependency on lexical information.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

10/11/2020

Connecting the Dots Between Fact Verification and Fake News Detection

Fact verification models have enjoyed a fast advancement in the last two...
05/24/2022

Beyond Fact Verification: Comparing and Contrasting Claims on Contentious Topics

As the importance of identifying misinformation is increasing, many rese...
12/30/2020

Joint Verification and Reranking for Open Fact Checking Over Tables

Structured information is an important knowledge source for automatic ve...
08/28/2021

Mitigation of Diachronic Bias in Fake News Detection Dataset

Fake news causes significant damage to society.To deal with these fake n...
12/16/2021

Logically at Factify 2022: Multimodal Fact Verification

This paper describes our participant system for the multi-modal fact ver...

1 Motivation and Contribution

Neural networks (NNs) play a key role in in most natural language processing systems with state of the art (SOA) performance

(Devlin et al., 2018; Sun et al., 2018; Bohnet et al., 2018) especially in the domain of recognizing textual entailment (Kim et al., 2018), fake news detection (Baird et al., 2017) and fact verification (Nie et al., 2018).

However, we suspect that these models depend heavily on lexical information that may transfer poorly between different domains. For example, in early experiments in the fact verification space, we observed that out of all the statements containing the phrase “American Author,” 91% of them belonged to one class label. Such information could be meaningful in the literature news domain, but transfers poorly to other domains such as science or entertainment.

In this work we aim to understand and estimate the importance that a neural network assigns to various aspects of the data while learning and making predictions. Here we focus on the recognizing textual entailment (RTE) task (Dagan et al., 2013), and its application to fact verification (Thorne et al., 2018). RTE is the task of determining if one piece of text can be plausibly inferred from another. In the Fact Extraction and Verification (FEVER) shared task Thorne et al. (2018)

, the RTE module was used determine if a given set of evidence sentences, when compared with the claim provided, can be classified as

supports, refutes, or not enough information. In this context, the contributions of this work are:

(1)

We investigate the attention weights a state of the art RTE method Parikh et al. (2016) assigns to input tokens in the RTE component of fact verification systems, and confirm that most of the weight is assigned to POS tags of nouns (e.g., NN, NNP etc.) or their phrases, which verified our “American Author” observation stated above.

(2)

To verify that these lexicalized models transfer poorly, we implement a domain transfer experiment where a RTE component is trained on the FEVER data, and tested on the Fake News Challenge (FNC) (Pomerleau and Rao, 2017) dataset. As expected, even though this method achieves high accuracy when evaluated in the same domain, the performance in the target domain is poor, marginally above chance.

(3)

To mitigate this dependence on lexicalized information, we experiment with several strategies for masking out names by replacing them with their semantic category, coupled with a unique identifier to mark that the same or new entities are referenced between claim and evidence. The results show that, while the performance on the FEVER dataset remains at par with that of the model trained on lexicalized data, it improves significantly when tested in the FNC dataset. Thus our experiments demonstrate that our strategy is successful in mitigating the dependency on lexical information.

2 Experimental Setup

2.1 Datasets

For our analysis, we utilize two RTE datasets. The FEVER Thorne et al. (2018) dataset was used for training and for evaluating in-domain performance. It consists of around 145,000 training instances, each of which has a claim and a set of evidences retrieved from Wikipedia using a baseline information retrieval module. The claim-evidence pairs in the gold FEVER data set were assigned labels from three classes: supports, refutes, and not enough info. Even though the partition of the FEVER dataset that was used in the final shared task competition was released publicly, the gold labels were not. Hence we used the publicly released development portion (19,999 instances) instead as our test partition and created a development partition by randomly dividing the training partition into 80% for training and 20% for development. The FNC Pomerleau and Rao (2017) dataset was used for evaluating cross-domain transfer. The FNC dataset contains 4 classes (agree, disagree, discuss, and unrelated) and has publically available training (40,904 data points), development (9,086 data points), and test partitions (25,413 data points). To make both the datasets comparable we converted the FEVER dataset from 3 to 4 classes. Data points that belonged to the class supports in FEVER were relabeled agree, and refutes as disagree. The not enough info class was divided into two classes. In the first (discuss), the evidences were retrieved using the -nearest neighbors Thorne et al. (2018). In the second (unrelated), the evidence was retrieved randomly.

2.2 Approach

For all of our experiments we use the Decomposable Attention (DA) model Parikh et al. (2016) which achieved state of the art performance on the FEVER task. In particular, we use the AllenNLP111https://github.com/allenai/allennlp implementation of DA, which was provided by the FEVER task organizers.

Figure 1: Distribution of POS tags which the model gave importance to while wrongly classifying out-of-domain data.

2.3 Masking Techniques

While visualizing the attention weights (Gardner et al., 2018)

used in the decomposable attention model, we discovered that the model placed very high significance on named entities. This was particularly true when looking at errors made by the model in the cross-domain setting, as shown in figure

1

. To mitigate this issue, we experimented with several techniques that make use of named entity recognition (NER) to mask named entities.


Deletion: Lexical items which are tagged as named entities (Manning et al., 2014) are deleted. Basic NER: Lexical items which are tagged as named entities are replaced with their corresponding NER tags (e.g., location, person).
Custom NER

: Built on top of the Basic NER masking, we additionally note lexical overlap between the claim and evidence sentences with custom suffixes. That is, the first instance of a given entity in the claim is tagged with “c1” where “c” denotes the fact that it was found in the claim sentence (eg:., personc1). Wherever this same entity is found later, in claim or in evidence, it is replaced with this unique tag. If an entity is found only in evidence, then it is denoted by an “e” tag. (eg:., locatione3). We create pseudo pretrained embeddings for these new Custom NER tokens by adding a small amount of random Gaussian noise (mean 0 and variance of 0.1) to pre-trained embeddings 

(Pennington et al., 2014) of the root word corresponding to the category (e.g., “person”). Thus the embeddings of all the sub-tags, while being unique, were also perturbed from that of the root word.

Configuration In-domain test Out-of-domain test Out-of-domain dev
No masking 79.21% 40.45% 34.44%
Deletion 70.53% 29.91% 27.33%
Basic NER 71.45% 39.59% 37.85%
Custom NER 78.45% 52.60% 51.69%
Table 1: Various masking techniques and their performance accuracies, both in-domain and out-of-domain.

3 Results and Observations

Table 1 shows the overall accuracies of various masking techniques when the trained model was tested on the in-domain and out-of-domain setting. The key observation is that while addition of various masking techniques reduced the accuracy slightly in the in-domain setting (i.e., from 79.21% to 78.45%), it significantly improved the performance of the model in the out-of-domain setting (i.e., from 40.45% to 52.60%). This demonstrates the utility of masking named entities for increasing cross-domain performance in RTE tasks, and we hypothsize that the technique can be extended (e.g., perhaps through superset masking, etc.) for further gains.

References