While social media brings convenient communication and allows easier interactions between people, it is also rooted in the dissemination of misinformation and disinformation [41, 10]. Misinformation refers to false and incorrect information while disinformation is purposefully manipulated news . Wikipedia has been used as source for large-scale corpus of real claims and evidence documents , and thus has been adopted for fact-checking and verification against fake and false information, such as WikiFactCheck-English , FEVER , MultiFC , WikiCitations , and Wiki-Check 
. Wikipedia can be also utilized to construct knowledge graphs to enhance applications, such as recommender systems, question answering , and dialogue Systems . Although Wikipedia is widely exploited, dealing with noisy and low-quality Wikipedia articles is still critical. Therefore, is crucial for editors and fact-checkers to have some approaches to identify Wikipedia articles containing incorrect information.
To improve the quality of Wikipedia articles, this work dives into the detection of self-contradiction articles in Wikipedia. An article is regarded as self-contradiction if it contains multiple claims or ideas that are inherently in disagreement. That said, if an article possesses at least two statements that contradict one another, we can say that this article contradicts itself, i.e., is self-contradiction. In Wikipedia, a “Self-contradictory” template111https://en.wikipedia.org/wiki/Template:Self-contradictory is created to annotate self-contradiction articles. Editors can use the “self-contradictory” template to manually indicate whether an article is self-contradiction, resulting in the historical collection of self-contradiction articles222https://en.wikipedia.org/wiki/Category:Self-contradictory_articles. We aim to accordingly create a dataset for detecting self-contradiction articles in Wikipedia. Besides, we further propose the first model, Pairwise Contradiction Neural Network (PCNN), for the detection task.
|Wikipedia Article 1: Tyler Acord|
|…Tyler Acord (born September 12, 1990), better known by his stage|
|name Lophiile and formerly known as Scout, is an American record|
|producer, DJ, multi-instrumentalist, and songwriter born in Lakewood,|
|Washington. …Tyler Acord was born in Renton, Washington on|
|September 12, 1990. …|
|Wikipedia Article 2: Pink Chanel suit of Jacqueline Bouvier Kennedy|
|…There was long a question among fashion historians and experts|
|whether the suit was a genuine Chanel or a quality copy purchased|
|from New York’s Chez Ninon, a popular dress shop that imported|
|European label designs. …A number of sources claimed it was more|
|than likely a copy of a Chanel pink bouclé wool suit trimmed with a|
|navy blue collar …|
Here we give two examples of self-contradictory articles in Wikipedia, as presented in Table 6. For the first example, i.e., Tyler Acord, it can be clearly found that two highlighted sentences are contradicted with one another because the birthplaces are different. For the second example, i.e., Pink Chanel suit of Jacqueline Bouvier Kennedy, it is about the authenticity of the suit. One sentence mentions that the suit could be a genuine or a copy, and the other emphasizes a copy.
Detecting self-contradiction articles is different from conventional contradiction detections in various text data. The input of typical contradiction detection is a pair of sentences or claims, and its goal is to classify whether they contradict each other. Previous work has explored contracdictions in domains such as scientific reviews [3, 6], short-text posts on social media [24, 26], “rumorous claims” , and commercial item reviews [27, 43]. As for the self-contradiction detection proposed in this work, we are given an article containing a number of sentences, and the task is to simultaneously classify the article as self-contradiction or not and identify which pairs of sentences are contradicted with one another. The key difference is that detecting self-contradiction requires a model to understand the semantics and topics of the input article, in addition to have pairwise comparisons between sentences, so that the sentences whose meanings contradict with other sentences or the whole article can be highlighted. Although an existing study  found that few of near-duplicate sentences can contradict with each other in Wikipedia. They simply perform sentence clustering with lexical Jaccard similarity, but do not address the problem of finding contradiction sentences with different phrasings.
In this paper, we create a new dataset, WikiContradiction, which contains both self-contradiction and non-contradiction articles in Wikipedia. We develop the first model, Pairwise Contradiction Neural Network (PCNN), for the detection task. The main idea of PCNN is three-fold. First, we fine-tune the Sentence-BERT model to generate the representations of sentences in a Wikipedia article. Second, we pre-train a pairwise contradiction learning network to generate the contradiction probability of each pair of sentences. Third, we select top- sentences with the highest contradiction probabilities, and utilize their embeddings to generate the binary classification outcomes.
|Data||Type||Model Input||Feature Extraction||Classifier||Explainability|
|||Scientific Claims||Pair||Sentence Pair||Linguistic||SVM|
|||Item Reviews||Pair||Sentence Pair||Aspects & Sentiments||Decision Tree|
|||Event Tweets||Pair||Sentence Pair||POS Tags & Dependency Parsing||MaxEnt|
|||Product Tweets||Pair||Sentence Pair||Sentiments||CBOW & ESIM|
|||Rumorous Claims||Pair||Sentence Pair||Text Similarity||Random Forest|
|||Video Snippets||Pair||Sentence Pair||Word Embeddings||CNN|
|||Video Snippets||Pair||Sentence Pair||Word Embeddings||GRU|
|This work||Wikipedia Articles||Self||Article||Pre-Training & Fine-Tuning||MLP||✓|
Experiments conducted on our WikiContradiction dataset deliver three main findings. First, PCNN can apparently outperform typical document classification models. Second, the pre-training of pairwise contradiction learning has the most significant contribution to the detection performance. Third, the conducted case studies exhibit that PCNN can truly identify the most contradictory pairs of sentences regarding contradictions within an article.
Below we list the contributions of this work.
We create a novel wiki dataset, WikiContradiction, for self-contradiction Wikipedia article detection777Data and code can be access at this link: https://github.com/Wiki-Contradictory/Wiki-Self-Contradictory/. To the best of our knowledge, it is the first dataset for the self-contradiction detection task on Wikipedia.
We define the task of detecting self-contradiction articles and highlighting contradiction sentence pairs, and solve it by developing a novel model, Pairwise Contradiction Neural Network (PCNN). We propose to pre-train PCNN using two benchmarks SNLI and MNLI, and fine-tune it via our WikiContradiction dataset.
Experimental results show that PCNN can not only lead to the promising performance in both imbalanced and balanced settings, but also highlight the most contradictory pairs of contradicting sentences in an article.
This paper is organized as follows. We review the relevant related work in Section II, next in Section III we describe WikiContradiction dataset. Section IV gives the technical details of the proposed PCNN model. We describe the evaluation settings and experimental results in Section V. Last, we discuss the implication and limitation in Section VI, and conclude this work in Section VII.
Ii Related Work
The typical contradiction detection task aims to classifying whether one sentence or claim is contradicted with another. The input instance is a pair of sentences and claims. A variety of supervised learning-based models has been developed. Alamri and Stevensony
extract linguistic features, and use support vector machine to classify potentially contradictory scientific claims. Ismail et al. take advantage of review polarity to train supervised classifiers for predicting contradiction intensity of item reviews. Lendvai et al.  rely on part-of-speech and dependency parsing, together with the MaxEnt Classifier , to classify the contradiction of social media posts. Li et al. 
incorporate sentiment analysis with contradiction detection on Amazon’s customer reviews. Lendvai and Reichel utilize a set of textual similarity features, including vocabulary overlap, local alignment, corpus statistics, and apply random forest to classify whether a pair of rumorous claims is contradictory. Li et al. 
learn contradiction-specific word embeddings, and use convolutional neural network to recognize contradiction relations between a pair of sentences. Besides, Tan et al.
propose a dual attention-based gated recurrent unit to learn aspect-level sentiments for recognizing conflict opinions. All of these approaches to contradiction detection compare two sentences or claims from various aspects and side information. Nevertheless, they cannot be directly applied to the self-contradiction detection of an article that involves all pairwise reasoning.
In Table II we compare our work with existing studies on contradiction detection. There are six different aspects being compared, including data for contradiction detection (“Data”), contradiction type (“Type”), model input, feature extraction, classifier, and explainability. Based on that table, to best of our knowledge we are the first to deal with self-contradiction detection within an article while all of past work focus on pairwise contradiction detection for a given sentence pair. In addition, in extracting features, we do not rely on hand-crafted engineering, but utilize pre-training and fine-tuning strategies to learn features regarding contradiction. Furthermore, our model is the only one that can provide model explainability through highlighting which parts of the model input contradict with each other.
Recognizing Textual Entailment. The task of recognizing textual entailment aims to predict whether a given textual premise entails or implies a given hypothesis, i.e., a binary classification task . One of the essential entailing relations between sentences is a contradiction. Early approaches rely on the sentence structure with linguistic resources, like WordNet  and Verbnet 
, to bring semantics into the estimation of entailment by alignment and transformation . The supervised Context-Enriched Neural Network (CENN)  utilize multiple embeddings from different contexts to better represent text pairs, along with an attention mechanism to combine them for predicting the entailment relation. Silva et al.  further explore structured knowledge graphs to better classify and explain the entailment relation. Recently, with the advances of techniques for Natural Language Inference (NLI), a category of textual entailment is to classify a pair of text pieces as entailment, neutral or contradiction
. A number of methods are now based on deep learning[11, 17], and especially resorting to attention mechanisms [34, 37, 47]. Well-known transformer-based models, such as BERT , RoBERTa , and XLNet , further lead to state-of-the-art performances.
The detection of self-contradiction articles can be considered as the document classification task because it performs binary classification based on the whole textual content of a document. Conventional approaches extract features like bag-of-words, TFIDF, n-gram, word embeddings, and Doc2Vec 2, 22]. Techniques on deep learning can generate better representations of documents, and thus significantly improve the performance of document classification . Typical methods include TextCNN , CharCNN , and hierarchical attention networks (HAN) , and RegLSTM [18, 1]. Recently, the pre-training language representation learning models [14, 29] had been adopted for document classification with promising performance.
Iii WikiContradiction Dataset
Wikipedia editors utilize the “Self-contradictory Template” to tag articles that contain contradictory information. “Templates are pages that are embedded into other pages to allow for the repetition of information”. 888https://en.wikipedia.org/wiki/Wikipedia:Templates Templates can be used - among other things - to signal problems with articles’ content, allowing readers and other editors to understand issues with an article or a specific piece of content. One example of a well-known template is the “citation needed” tag . We can assume that templates used to signal problematic content had a good precision, because adding a template requires expertise on Wikipedia, meaning that the editor adding that template is likely to have a good knowledge of how Wikipedia works. In that sense, we can consider this as a high-quality manual annotation of data. This methodology has already been used to generate high-quality datasets to indicate other content reliability issues in Wikipedia . However, for the same reasons - requiring expert editors - the recall could be low, because we cannot assume that every article has been reviewed by an expert editor. In our dataset, the 87% of the users that added a template had over 1,000 edits on Wikipedia at the time they added the self-contradictory template. The usage of templates is not trivial and requires knowledge of the MediaWiki functionalities. Moreover, templates such as “self-contradictory” are not very popular (appears in less than 1% of English Wikipedia articles), meaning that users adding such templates have a deep knowledge on Wikipedia’s conventions. Therefore, we consider Wikipedia editors with more than 1,000 edits as high-quality annotators.
To build a balanced dataset with examples of self-contradiction and non-self-contradiction (normal) articles, we look at all the versions (a.k.a. revisions) of all articles in English Wikipedia, and select the ones that have had the self-contradiction added in one old version. Next, we search for newer versions where that template has been removed, meaning that editors have reviewed the new version of the article, and removed the template showing that the self-contradiction problem has been resolved in this new version. This methodology allows us to build a balanced dataset, where both categories has been annotated by experts.
We took all articles in English Wikipedia until March 2020, and running a simple string matching process we detected all the revisions that contain a self-contradiction template. Next, we scan all the newer versions of those articles until finding a version of the article that does not contain that template. In total, we find articles with the template, where had a version that resolved the self-contradiction. We discarded articles that do no resolve the contradiction. That said, eventually we have positive (self-contradiction) articles and negative (non-self-contradiction) articles. When splitting the data into training and testing, we ensure both self-contradiction and non-self-contradiction versions of an article totally appear in either training or testing set to avoid the leaking of classification label.
Iv The Proposed Model: PCNN
Let be a Wikipedia article, consisting of sentences . We treat the Wikipedia article self-contradiction detection task as the binary classification problem. Specifically, each article can be true (i.e., self-contradiction, ) or false (i.e., non-contradiction, ). For a self-contradiction Wikipedia article , there exists at least two sentences and whose semantic meanings or referring facts contradict with one another. In addition to the detection task, we also aim to learn a ranking list from all pairs of sentences in , according to the prediction probabilities, where denotes the -th most explainable pair of sentences that contradict with each other.
Problem Definition: Wikipedia Self-Contradiction Article Detection. Give an article in Wikipedia, our task is to learn a self-contradiction detection function , such that it maximizes the classification probability with explainable pairs of sentences ranked highest in list .
PCNN Model Architecture. The architecture of PCNN model is presented in Figure 2, which consists of four components. The first is Sentence Representation Generation. Given a Wikipedia article, we generate the representation vector for each sentence through a pre-trained Sentence-BERT model. This component with Sentence-BERT network architecture is fine-tuned based on our data. The second is Pre-Training Pairwise Contradiction Learning (PCL) network based on two natural language inference benchmarks SNLI and MNLI. PCL aims at generating a contradiction probability value that depicts how two sentences contradict with each other. The PCL pre-training is treated as initializing PCL model parameters that capture comparative semantics between sentences. The third is Fine-Tuning PCL Network, which is end-to-end trained with fine-tuning Sentence-BERT component and followed by the last component. We fine-tune the PCL network and produce the contradiction probability of every sentence pair in our data. The last component is Article Representation & Classification. We select the most “suspicious” sentence pairs based on their contradiction probabilities, accordingly utilize a self-attention layer to encode their correlation, and generate the final binary classification outcome.
Iv-a Sentence Representation
Since self-contradiction detection involves the inspection between sentences within an article, we can consider the task as sentence-pair regression tasks like semantic textual similarity mentioned in BERT  and RoBERTa . Sentence-BERT  improves the representation capability for sentences through siamese and triplet networks, in which BERT networks are fine-tuned with shared weights. In order to have sentences being semantically comparable with an article, we fine-tune the pre-trained Sentence-BERT model999https://www.sbert.net/ to generate sentence embeddings, denoted as for each sentence . The derived sentence embeddings are used as the initial vectors of our pairwise sentence contradiction learning.
Iv-B Pairwise Contradiction Learning
In detecting self-contradiction articles, we compare the semantics of sentences and generate a contradiction probability depicting the degree that two sentences contradict each other. Intuitively we should enumerate all pairs of sentences within an article for semantically-contradiction comparison. Instead, we learn to produce the contradiction probability of each sentence pair within a paragraph. The reason is two-fold. First, existing studies have pointed out that sentences within a paragraph of an article in Wikipedia are semantically coherent and topically consistent [16, 7, 28]. Such a discovery encourages us to first examine sentences in a paragraph in a pairwise manner. We will verify whether paragraph-level pairwise sentence comparison is better than the article-level version in the experiment. Second, popular articles in Wikipedia can contain hundreds of sentences. Enumerating all pairs of sentences can bring high computational cost.
Given the embeddings of two sentences, and , we pre-train a pairwise contradiction learning (PCL) model that generates pairwise sentence embedding and a probability . We first utilize a learnable weight matrix to generate intermediate vectors, given by: and . Then we concatenate the vectors and with the element-wise difference , and multiply it with a trainable weight , along with the softmax function, to generate the contradiction probability between sentences and , denoted as , given by:
where is the softmax function, denotes the concatenation operation, and . The cross-entropy loss is employed for optimization. Here the pre-training is performed using the set of sentences with “contradiction” label (i.e., binary classification) in both Stanford Natural Language Inference dataset (SNLI)101010http://nlp.stanford.edu/projects/snli/  and the Multi-Genre NLI (MNLI) dataset111111https://cims.nyu.edu/~sbowman/multinli/ . Eventually we fine-tune the pre-trained model to generate pairwise sentence embedding and the contradiction probability using our complied Wiki-Contradiction data for the detection of self-contradiction articles.
Iv-C Article Representation & Classification
We aim at generating the representation of the given article, and accordingly produce its self-contradiction probability. Since the number of sentences involving self-contradiction tends to be limited in an article, we consider only the most contradicting sentence pairs to determine whether an article is self-contradiction. Sentence pairs with higher probability are utilized to learn the article representation. In other words, we select the embedding vectors of sentence pairs with top- probabilities.
Sentence pairs in an article can be correlated with one another. The semantics (e.g., coherent or not in their meanings) of contradiction pairs can differ from that of non-contradiction sentence pairs. Therefore, we need to model how different sentence pairs correlate with each other and how they contribute to self-contradiction detection in article representation learning. We exploit a self-attention layer  to achieve this goal. Specifically, we can generate the article embedding based on:
where is the attention weight vector (estimating the contribution of each sentence pair) obtained from a softmax function, which is applied to the dot product of two different transformations on , and is the learnable parameter.
To produce the classification probability, we feed the article embedding
into a one-hidden-layer feed-forward network, together with a sigmoid function, to generate the probability of classifying as self-contradiction. The cross-entropy loss is used for model optimization. In detail, we use a batch-size of, Adam optimizer  with a learning rate , and a linear learning rate warm-up over 10% of the training data.
V-a Evaluation Settings
Although the compiled dataset with self-contradiction and non-self-contradiction Wikipedia articles are balanced, it is relatively rare to have self-contradiction ones in the real world. Hence, we divide the experiments into two sets, balanced and imbalanced. For the balanced setting, we randomly sample sets of the equal article number for each class. For the imbalanced settings, we change the ratio of positive (self-contradiction) and negative (non-self-contradiction) articles, i.e., 10%:90%, 30%:70%, and 50%:50%, together with TR=80%, and randomly generate
sets for each imbalanced ratio. We use Precision (Pre), Recall (Rec), F1, and Accuracy (Acc) as the evaluation metrics. The default training-test split is 80%:20%. We report the average results. We will also vary the training percentages (TR) with 20%, 40%, 60%, and 80%.
We compare the proposed PCNN with four baselines. (a) Random: determining the self-contradiction or not in a random manner; (b) LSTM : sequentially feeding the FastText Wiki pre-trained word embeddings121212https://fasttext.cc/docs/en/pretrained-vectors.html into LSTM, and utilizing the last hidden layer to produce the binary classification; (c) HAN : a well-known deep model with hierarchical attention networks for document classification; and (d) BERT 
: fine-tuning the BERT model to the training data. The hyperparameters of competing methods are set by following the settings mentioned in respective studies, and their word embedding dim is, and all of their intermediate embedding dim are .
All experiments are conducted with PyTorch running on GPU machines (Nvidia GeForce GTX 1080 Ti). The default settings for the hyperparameters of PCNN is listed here. Both the dimensionalities of sentence embedding and pairwise sentence embedding are. The dimensionality of article embedding is . The selection of top- most contradiction sentence pairs is set with .
V-B Experimental Results
Main Results. The performance comparison of PCNN with the competing methods under the balanced setting is reported in Table III, in which the performance scores are derived based on all testing sentences. We can find that PCNN consistently outperforms the other methods over different training percentages and various evaluation metrics. Such results exhibit the promising capability of detecting self-contradiction articles by PCNN. The reason is two-fold. First, PCNN models the pairwise semantics between sentences that can compare whether there exists a conflict, while document classification models HAN and LSTM simply learn the whole article meaning. Second, fine-tuning the pre-training language model BERT has some effect, but is not that good as PCNN. Such an outcome could result from that all sentences and their pairs are equally treated and learned in BERT, but the contradiction tends to appear in a few of sentence pairs that need to be emphasized by the model. Although PCNN leads to better results by tackling these issues, the scores of PCNN with 80% trained data are still not higher than . This tells that detecting self-contradiction articles is indeed challenging.
Ranking Results. By using the prediction probabilities reported by a model, we can have a ranking evaluation, in addition to the overall performance shown in Table III. To be specific, we generate the scores of Precision@ and Recall@ with based on articles with the highest prediction probabilities. The results are displayed in Table IV
. We can observe that PCNN again leads to the highest scores on Precision and Recall. Although the Recall scores are quite low, the superiority of PCNN can be still maintained. In other words, PCNN can effectively find the self-contradiction articles by presenting top-ones.
Ablation Study. We conduct an ablation study to examine whether every component in PCNN does take effect. By removing each component from the full PCNN model (Full Model), we report the performance in terms of four metrics. There are five components to be investigated in this ablation study, as listed below.
Full model without the self-attention layer (w/o SA): utilizing simple concatenation to fuse all selected sentence pairs’ embeddings, rather than the self-attention layer.
Full model without pairwise contradiction learning (w/o PCL): replacing pairwise contradiction learning with simply concatenating the embeddings of two sentences, i.e., no contradiction comparison between sentences.
Full model without Sentence-BERT (w/o SBERT): using LSTM to replace Sentence-BERT in learning sentence representations;
Full model without the selection of top- contradiction sentence pairs (w/o TopPair): employing all sentence pairs for self-attention and prediction, no filtering out less contradiction ones;
Full model without considering each paragraph for PCL (w/o Paragraph: utilizing all sentence pairs for PCL, instead of applying paragraph-level PCL.
The performance comparison on the ablation study is presented in Figure 3. The results bring several findings. First, every component, except for the paragraph-level PCL, in the proposed PCNN does contribute to the performance. Nevertheless, dividing an article into several paragraphs for PCL can be still to lower down the computational cost, i.e., improving the learning time efficiency, since we can perform PCL in parallel over paragraphs. Second, among all components, PCL brings the most significant performance improvement. This result verifies that the semantic and contradiction comparison between sentences can effectively capture the conflicts. Third, the components of Sentence-BERT and selecting top- sentence pairs are also useful. These exhibit the importance of sentence representation learning and eliminating non-conflicting sentence pairs. We would believe future advances on sentence representation learning and fine-grained determination of top- can further improve the performance. Last, the self-attention layer has a relatively minor contribution. Such an outcome informs us that the semantic modeling between “sentences” is more influential than that between “sentence pairs.” In summary, the results of the ablation study provide guidance on where future extensions can be performed to improve the detection of self-contradiction articles.
Top- Sentence Pairs. Our PCNN requires the selection of top- contradiction sentence pairs to classify the self-contradiction of an article. We aim at examining how the number of sentence pairs affects the performance. By varying numbers in PCNN, , the results on Precision, Recall, and F1 are shown in Figure 4. Based on the resultant scores, PCNN can generate better performance when is around to . This implies that a moderate selection of contradiction sentence pairs can benefit the performance. A very limited number (e.g., or ) cannot collect all of the evidences of contradiction between sentences. Selecting too many sentence pairs (e.g., or ) can include non-contradiction sentence pairs, and thus damages the performance. We think a moderate number is reasonable since it is less possible for a sentence to contradict too many other sentences in an article.
|GT=1 and Pred=1|
|Article 1||Title: Athanasius II of Constantinople|
|s1||He supposedly served from 1450 to 1453|
|s2||Athanasius II of Constantinople In office 1451 – 1453|
|Article 2||Title: The Silent Scream (1979 film)|
|s1||The film was released theatrically by American Cinema Releasing in limited theaters in November 23, 1979 in Victor, Texas, and in January 30, 1980 in Bismarck, North Dakota.|
|s2||The film is released 1980/1/18|
|GT=1 and Pred=0|
|Article 1||Title: Embassy of Cambodia in Washington, D.C.|
|s1||His Excellency Sounry Chum is the current Cambodian Ambassador to the United States, and was appointed to the role in 2018.|
|s2||Royal Embassy of Cambodia in Washington, D.C is the diplomatic mission of the Kingdom.|
|Article 2||Title: Arthur Desmond|
|s1||As with most aspects of Arthur Desmond’s life, his birth statistics are problematic, and Arthur Desmond spent his adult life concealing his origins as well as his identity.|
|s2||Whatever his real origins, the first concrete evidence of Arthur Desmond’s life comes when he stood for parliament in Hawke’s Bay.|
|GT=0 and Pred=1|
|Article 1||Title: Calcio Fiorentino|
|s1||They try to pin and force into submission as many players possible. Once there are enough incapacitated players, the other teammates come and swoop up the ball and head to the goal.|
|s2||It is also prohibited for more than one player to attack an opponent. Any violation leads to being expelled from the game.|
|Article 2||Title: House of Terror (1960 film)|
|s1||Casimiro (Tin Tan), the night watchman at a wax museum of horrors, has been napping more frequently on the job because his boss, Professor Sebastian (Yerye Beirute).|
|s2||As he struggles to awareness, the clouds outside part, the full moon shines on his face through a window, and the resurrected corpse transforms into a werewolf.|
Imbalance Analysis. We further aim to investigate whether the proposed PCNN can survive from the imbalance between self-contradiction (P) and non-self-contradiction articles (N). By fixing the training percentage 80%, we change the imbalance ratios, and report the performance comparison in Table V. The results bring few findings. First, although a more imbalanced setting (e.g., P:N=10%:90%) causes performance drop across all methods, PCNN is still able to consistently outperform the competing models. This shows the superiority of PCNN when facing class imbalance. Second, although the Precision scores are quite low in the imbalanced setting of P:N=10%:90%, the corresponding Recall scores can be maintained as similar as those in P:N=30%:70% and P:N=50%:50%. The results indicate when regarding detecting all self-contradiction articles, the capability of PCNN is still effective in the imbalanced setting since the number of self-contradiction articles is small.
V-C Case Study
To validate whether PCNN can comprehensively highlight the most contradictory sentence pairs in an article, we conduct the case study. We consider three scenarios: both the ground-truth (GT) and PCNN prediction (Pred) are self-contradictory, and the inconsistency between the ground-truth and PCNN prediction. By reporting two articles for each scenario, and exhibiting the most contradictory sentence pair highlighted by PCNN, we display the results of case studies in Table VI. For the scenario “GT=1 and Pred=1”, we can find that PCNN can nicely identify the pair of sentences that contradict one another. For the scenario “GT=1 and Pred=0”, PCNN predicts Article 1 as non-self-contradiction since it considers “Cambodian Ambassador to the United States” in s1 is semantically consistent with “Embassy of Cambodia in Washington, D.C”. In Article 2, PCNN also feels “birth statistics are problematic” in s1 does not contradict with “Whatever his real origins” in s2. For the scenario “GT=0 and Pred=1”, in Article 1, PCNN is confused by “submissions as many players possible” in s1 and “prohibited for more than one player” is s2, i.e., PCNN considers they are contradictory. Besides, PCNN thinks Article 2 is self-contradiction because “napping more frequently” contradicts with “struggles to awareness”. In summary, although PCNN can make incorrect classification, the sentence pairs highlighted by PCNN can reasonably explain the prediction results. That said, PCNN can identify inconsistent parts between sentences even the article is non-self-contradiction.
We discuss issues regarding the proposed self-contradiction detection model PCNN. The extensions of PCNN can be summarized in the following two points.
Pre-Training PCL. The current PCNN utilizes the datasets on textual entailment, i.e., SNLI and MNLI, to pre-train the pairwise contradiction learning (PCL) module. We think using more datasets with the contradiction label to pre-train PCL can cover more diverse aspects on contradiction, and thus can benefit the fine-tuning for self-contradiction article detection. Datasets with contradictory sentence pairs employed by the relevant studies, as listed in Table II, can be candidates to extend the coverage of predictable self-contradiction topics.
Article-level Contradiction. PCNN is now devised to model whether any sentence pairs are contradictory, i.e., sentence-level contradiction. In fact, PCNN can serve as a framework to allow the detection of article-level contradiction. To be specific, given a corpus containing a number of articles, PCNN can be moderately extended to detect whether two articles contradict with each other. We can change “sentence pairs” to “article pairs” in the encoders of PCNN.
There are two limitations in the current PCNN model. One is about the subjectivity of self-contradiction, and the other is being not able to deal with long documents.
Subjectivity. Determining whether an article contains contradiction elements is subjective [19, 46]. Although we have experienced and high-quality Wikipedia editors to annotate the “Self-Contradictory Template” to articles, it is unknown about how the subjectivity of editors affects the annotated self-contradiction labels. While our PCNN highly relies on the self-contradiction labels to fine-tune the PCL network, the degree of annotation subjectivity can to some extent influences how PCNN models contradiction and the detection performance. However, PCNN does not deal with the subjectivity issue. We think that mitigating the subjectivity on annotated self-contradictions in the future extension of PCNN will help improve the performance.
Long Documents. In the pairwise contradiction learning of PCNN, we only model sentences within a paragraph. However, for long documents, the contradiction could appear in sentences between paragraphs. PCNN can fail to detect self-contradiction for long articles. To deal with such an issue, one may resort to long-document transformer  to process long sequences of sentences and paragraphs in the encoding parts of PCNN.
PCNN can be directly utilized for two applications that concern the contradiction or inconsistency between texts. The first is to detect incongruence between news headline and its body text . This task can be performed if we put the news headline and body text together as an article, which can be directly fed into PCNN for incongruence detection. The second is to detect whether an article contradicts with other articles in Wikipedia. We can collect the Wikipedia articles that contradict with other articles through the “Contradicts others” template131313https://en.wikipedia.org/wiki/Template:Contradicts_others141414https://en.wikipedia.org/wiki/Category:Articles_contradicting_other_articles. By merging every pair of articles into a long article and treating such a long article as the input of PCNN, we can classify whether it contains contradiction.
In this paper we defined and developed a solution for the task of finding self-contradiction articles in Wikipedia. We create the first dataset, WikiContradiction for this task. We also present the first model, Pairwise Contradiction Neural Network (PCNN). The most essential component of PCNN is pairwise contradiction learning, which is pre-trained on SNLI and MNLI datasets and fine-tuned on our dataset. The empirical results exhibit promising performance of PCNN. The case study also shows the model explainability of PCNN. We believe this work can be a pioneer study on self-contradiction article detection. The compiled WikiContradiction dataset can be a training resource for improving the quality of Wikipedia articles, and further contribute to fact-checking and claim verification. Experimental results also point out where future work can improve, including sentence representation learning, pairwise contradiction reasoning, and finer-grained selection of sentence pairs.
This work is supported by Ministry of Science and Technology (MOST) of Taiwan under grants 110-2221-E-006-136-MY3, 110-2221-E-006-001, and 110-2634-F-002-051.
-  (2019) Rethinking complex neural network architectures for document classification. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4046–4051. Cited by: §II.
-  (2012) A survey of text classification algorithms. In Mining Text Data, C. C. Aggarwal and C. Zhai (Eds.), pp. 163–222. Cited by: §II.
-  (2015) Automatic identification of potentially contradictory claims to support systematic reviews. In Proceedings of the 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), BIBM ’15, pp. 930–937. Cited by: TABLE II, §I, §II.
-  (2020) Learning to retrieve reasoning paths over wikipedia graph for question answering. In International Conference on Learning Representation, ICLR ’20. Cited by: §I.
MultiFC: a real-world multi-domain dataset for evidence-based fact checking of claims.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4685–4697. Cited by: §I.
-  (2018) Predicting contradiction intensity: low, strong or very strong?. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, pp. 1125–1128. Cited by: TABLE II, §I, §II.
WikiWrite: generating wikipedia articles automatically.
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16, pp. 2740–2746. Cited by: §IV-B.
-  (2020) Longformer: the long-document transformer. arXiv:2004.05150. Cited by: 2nd item.
-  (2015) A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632–642. Cited by: §IV-B.
-  (2020) Detecting fake news in social media: an asia-pacific perspective. Commun. ACM 63 (4), pp. 68–71. Cited by: §I.
-  (2017) Enhanced LSTM for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1657–1668. Cited by: §II.
-  (2013) Recognizing textual entailment: models and applications. Synthesis Lectures on Human Language Technologies 6 (4), pp. 1–220. Cited by: §II.
-  (2008) Finding contradictions in text. In Proceedings of ACL-08: HLT, pp. 1039–1047. Cited by: §I.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §II, §II, §IV-A, §V-A.
Wizard of wikipedia: knowledge-powered conversational agents. In International Conference on Learning Representation, ICLR ’19. Cited by: §I.
-  (2009) Wikipedia-based semantic interpretation for natural language processing. J. Artif. Int. Res. 34 (1), pp. 443–498. Cited by: §IV-B.
-  (2018) Natural language inference over interaction space. In International Conference on Learning Representations, Cited by: §II.
-  (1997) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. Cited by: §II, §V-A.
-  (2010) Subjective discriminability of invisibility: a framework for distinguishing perceptual and attentional failures of awareness. Consciousness and Cognition 19 (4), pp. 1045–1057. Cited by: 1st item.
-  (2014) Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Cited by: §II.
-  (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §IV-C.
-  (2019) Text classification algorithms: a survey. Information 10 (4). Cited by: §II.
Distributed representations of sentences and documents.
Proceedings of the 31st International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 32, pp. 1188–1196. Cited by: §II.
-  (2016) Monolingual social media datasets for detecting contradiction and entailment. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 4602–4605. Cited by: TABLE II, §I, §II.
-  (2016) Contradiction detection for rumorous claims. In Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics (ExProM), pp. 31–40. Cited by: TABLE II, §I, §II.
-  (2018) A computational approach to finding contradictions in user opinionated text. In Proceedings of the 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM ’18, pp. 351–356. Cited by: TABLE II, §I, §II.
-  (2017) Contradiction detection with contradiction-specific word embedding. Algorithms 10 (2). Cited by: TABLE II, §I, §II.
-  (2018) Generating wikipedia by summarizing long sequences. In International Conference on Learning Representations (ICLR), Cited by: §IV-B.
-  (2019) RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §II, §II, §IV-A.
-  (2013) Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, pp. 3111–3119. Cited by: §II.
-  (1995-11) WordNet: a lexical database for english. Commun. ACM 38 (11), pp. 39–41. Cited by: §II.
-  (2021-04) Deep learning–based text classification: a comprehensive review. ACM Comput. Surv. 54 (3). Cited by: §II.
-  (2020) Layered graph embedding for entity recommendation using wikipedia in the yahoo! knowledge graph. In Companion Proceedings of the Web Conference 2020, WWW ’20, pp. 811–818. Cited by: §I.
A decomposable attention model for natural language inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2249–2255. Cited by: §II.
-  (2019) Citation needed: a taxonomy and algorithmic assessment of wikipedia’s verifiability. In The World Wide Web Conference, WWW ’19, pp. 1567–1578. Cited by: §I, §III.
-  (2019) Sentence-BERT: sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992. Cited by: §IV-A.
-  (2016) Reasoning about entailment with neural attention. In International Conference on Learning Representations (ICLR), Cited by: §II.
-  (2019) Online disinformation and the role of wikipedia. arXiv:1910.12596. Cited by: §I.
-  (2020) Automated fact-checking of claims from Wikipedia. In Proceedings of the 12th Language Resources and Evaluation Conference, pp. 6874–6882. Cited by: §I.
VerbNet: a broad-coverage, comprehensive verb lexicon. Ph.D. Thesis, University of Pennsylvania. Cited by: §II.
-  (2017) Fake news detection on social media: a data mining perspective. SIGKDD Explor. Newsl. 19 (1), pp. 22–36. Cited by: §I.
-  (2018) Recognizing and justifying text entailment through distributional navigation on definition graphs.. In AAAI, pp. 4913–4920. Cited by: §II.
-  (2019) Recognizing conflict opinions in aspect-level sentiment classification with dual attention networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3426–3431. Cited by: TABLE II, §I, §II.
-  (2018) FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 809–819. Cited by: §I.
WikiCheck: an end-to-end open source automatic fact-checking api based on wikipedia. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 4155–4164. Cited by: §I.
-  (2012) Survey on mining subjective data on the web. Data Mining and Knowledge Discovery 24 (), pp. 478–514. Cited by: 1st item.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30, pp. 5998–6008. Cited by: §II, §IV-C.
-  (2008) An divide-and-conquer strategy for recognizing textual entailment. In In Proc. of the Text Analysis Conference, Cited by: §II.
-  (2007) Recognizing textual entailment using a subsequence kernel method. In AAAI International Conference on Artificial Intelligence, AAAI, pp. 937–942. Cited by: §II.
-  (2015) Identifying duplicate and contradictory information in wikipedia. In Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’15, pp. 57–60. Cited by: §I.
-  (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122. Cited by: §IV-B.
-  (2021) Wiki-reliability: a large scale dataset for content reliability on wikipedia. To appear in the 44th International ACM SIGIR Conference on Research and Developmentin Information Retrieval (SIGIR’21). Cited by: §III.
-  (2019) XLNet: generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, Vol. 32, pp. 5753–5763. Cited by: §II.
-  (2016) Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489. Cited by: §II, §V-A.
-  (2021) Learning to detect incongruence in news headline and body text via a graph neural network. IEEE Access 9 (), pp. 36195–36206. Cited by: §VI.
-  (2017) A transformation-driven approach for recognizing textual entailment. Natural Language Engineering 23 (4), pp. 507–534. Cited by: §II.
-  (2017) A context-enriched neural network method for recognizing lexical entailment. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §II.
-  (2015) Character-level convolutional networks for text classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pp. 649–657. Cited by: §II.
-  (2020) A survey of fake news: fundamental theories, detection methods, and opportunities. ACM Comput. Surv. 53 (5). Cited by: §I.