DeepAI
Log In Sign Up

Exploring Decomposition for Table-based Fact Verification

Fact verification based on structured data is challenging as it requires models to understand both natural language and symbolic operations performed over tables. Although pre-trained language models have demonstrated a strong capability in verifying simple statements, they struggle with complex statements that involve multiple operations. In this paper, we improve fact verification by decomposing complex statements into simpler subproblems. Leveraging the programs synthesized by a weakly supervised semantic parser, we propose a program-guided approach to constructing a pseudo dataset for decomposition model training. The subproblems, together with their predicted answers, serve as the intermediate evidence to enhance our fact verification model. Experiments show that our proposed approach achieves the new state-of-the-art performance, an 82.7% accuracy, on the TabFact benchmark.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

09/05/2019

TabFact: A Large-scale Dataset for Table-based Fact Verification

The problem of verifying whether a textual hypothesis holds the truth ba...
10/06/2020

Program Enhanced Fact Verification with Verbalization and Graph Attention Network

Performing fact verification based on structured data is important for m...
09/09/2021

Table-based Fact Verification with Salience-aware Learning

Tables provide valuable knowledge that can be used to verify textual sta...
04/21/2021

Sattiy at SemEval-2021 Task 9: An Ensemble Solution for Statement Verification and Evidence Finding with Tables

Question answering from semi-structured tables can be seen as a semantic...
09/14/2021

Logic-level Evidence Retrieval and Graph-based Verification Network for Table-based Fact Verification

Table-based fact verification task aims to verify whether the given stat...
04/28/2020

LogicalFactChecker: Leveraging Logical Operations for Fact Checking with Graph Module Network

Verifying the correctness of a textual statement requires not only seman...
04/19/2022

Table-based Fact Verification with Self-adaptive Mixture of Experts

The table-based fact verification task has recently gained widespread at...

1 Introduction

Fact verification aims to validate if a statement is entailed or refuted by given evidence. It has become crucial to many applications such as detecting fake news and rumor Rashkin et al. (2017); Thorne et al. (2018); Goodrich et al. (2019); Vaibhav et al. (2019); Kryscinski et al. (2020). While existing research mainly focuses on verification based on unstructured text Hanselowski et al. (2018); Yoneda et al. (2018); Liu et al. (2020); Nie et al. (2019), a recent trend is to explore structured data as evidence, which is ubiquitous in our daily life.

[width=trim=0cm 8.6cm 16.2cm 0cm,clip] final_intro_case.pdf

Figure 1: Overview of the proposed approach. An example of executable program parsed from the statement is: .

Verification performed with structured data presents research challenges of fundamental interests, as it involves both informal inference based on language understanding and symbolic operations such as mathematical operations (e.g., count and max). While all statements share the same set of operations, complex statements, which involve multiple operations, are more challenging than simple statements. Pre-trained models such as BERT Devlin et al. (2019) have presented superior performances on verifying simple statements while still struggling with complex ones: a performance gap exists between the simple and complex tracks Chen et al. (2020).

In this paper, we propose to decompose complex statements into simpler subproblems to improve table-based fact verification, as shown in a simplified example in Figure 1. To avoid manually annotating gold decompositions, we design a program-guided pipeline to collect pseudo decompositions for training generation models by distinguishing four major decomposition types and designing templates accordingly. The programs we used are parsed from statements with a weakly supervised parser with the training signals from final verification labels. Figure 1 shows a statement-program example. We adapt table-based natural language understanding systems to solve the decomposed subproblems. After obtaining the answers to subproblems, we combine them in a pairwise manner as intermediate evidence to support the final prediction.

We perform experiments on the recently proposed benchmark TabFact Chen et al. (2020) and achieve a new state-of-the-art performance, an 82.7% accuracy. Further studies have been conducted to provide details on how the proposed models work.

2 Method

2.1 Task Formulation and Notations

Given an evidence table and a statement , we aim to predict whether entails or refutes , denoted by . For each statement , the executable program derived from a semantic parser is denoted as . An example of program is given in Figure 1. Each program consists of multiple symbolic operations , and each operation contains an operator (e.g., max) and arguments (e.g., all_rows and attendance). A complex statement can be decomposed into subproblems , with the answers being . Using combined problem-answer pairs as intermediate evidence where , our model maximizes the objective .

2.2 Statement Decomposition

Constructing a high-quality dataset is key to the decomposition model training. Since semantic parsers can map statements into executable programs that not only capture the semantics but also reveal the compositional structures of the statements, we propose a program-guided pipeline to construct a pseudo decomposition dataset.

2.2.1 Constructing Pseudo Decompositions

Program Acquisition.

Following Chen et al. (2020), we use latent program algorithm (LPA) to parse each statement into a set of candidate programs . To select the most semantically consistent program among all candidates and mitigate the impact of spurious programs, we follow Yang et al. (2020) to optimize the program selection model with a margin loss, which is detailed in Appendix A.1.

By further removing programs that are label-inconsistent or cannot be split into two isolated sub-programs from the root operator, we obtain the remaining triples as the source of data construction111These triples do not involve any tables or statements in the dev/test set of the dataset used in this paper..

Decomposition Templates.

Programs are formal, unambiguous meaning representations for the corresponding statements. Designed to support automated inference, the program encodes the central feature of the statement and reveals its compositional structures. Our statement decomposition is based on the structure of the program. Specifically, we first extract program skeleton by omitting arguments in the selected program , then we group the triples by to identify four major decomposition types: conjunction222The conjunction type has overlap with the other three types in the cases that the sub-statements connected by conjunctions can be further decomposed., comparative, superlative, and uniqueness.

Some simple templates associated with each decomposition type are designed, which contain instructions on how to decompose the statement, and this manual process only takes a few hours. In this way, we can construct pseudo decompositions, including sub-statements and sub-questions, by filling the slots in templates according to the original statements or program arguments. Templates and decomposition examples can be found in Figure 2. Each sample in our constructed pseudo dataset is denoted as a triple, where indicates one of the four types and is a sequence of pseudo decompositions.

[width=trim=0cm 0.5cm 15.9cm 0cm,clip] final_decomp_template.pdf

Figure 2: Decomposition templates.
Data Augmentation.

With the triples, we perform data augmentation. Since some entity mentions in and can be linked to cells in , we can randomly replace the linked entities in and with different values in the same column of . For example, in Figure 1, we can replace the linked entity “firhill” with another randomly selected entity “cappielow”. Another augmentation strategy is inverting superlative and comparative. For the examples belong to superlative and comparative, we replace the original superlative or comparative in statements with its antonym, such as higher lower and longest shortest. In this way, we generate another 3k pseudo statement-decomposition pairs. In total, the final decomposition dataset used for generation model training includes 9,696 samples. More statistics are available in Appendix A.2.

2.2.2 Learning to Decompose

Decomposition Type Detection.

Given a statement

, we train a five-way classifier based on BERT to identify whether the statement is decomposable and if yes, which decomposition type it belongs to. In addition to the four types mentioned in the previous section, we add an

atomic category by involving additional non-decomposable samples. Only the statements not assigned with atomic labels can be used for decomposition.

Decomposition Model.

We finetune the GPT-2 Radford et al. (2019) on the pseudo dataset for decomposition generation. Specifically, given the triple, we train the model by maximizing the likelihood . We provide the model with gold decomposition type during training and the predicted type during testing. Only informative and well-formed decompositions are involved in the subsequent process to enhance the downstream verification. In case some sub-statements need further decomposition, it can be implemented by resending them to our pipeline333In most cases, there is no need to perform iterative decomposition, and we leave finer-grained decomposition for future research..

2.3 Solving Subproblems

We adapt TAPAS Eisenschlos et al. (2020), a SOTA model on table-based fact verification and QA task, to solve the decomposed subproblems. Verifying sub-statements is formulated as a binary classification with the TAPAS model fine-tuned on the TabFact Chen et al. (2020) dataset. To answer each sub-question, we use the TAPAS fine-tuned on WikiTableQuestions Pasupat and Liang (2015) dataset. We combine the subproblems and their answers in a pairwise manner to obtain the intermediate evidence , an example evidence is shown in Figure 1.

2.4 Recombining Intermediate Evidence

Downstream tasks can utilize the intermediate evidence in various ways. In this paper, we train a model to fuse the evidence together with the statement and table for table-based fact verification444For the non-decomposable statements, we put “no evidence” as the placeholder.. Specifically, we jointly encode and with TAPAS to obtain the concentrated representation . We encode multiple evidence sentences with another TAPAS following the document-level encoder proposed in Liu and Lapata (2019) by inserting [CLS] token at the beginning of every single sentence and taking the corresponding [CLS] embedding in the final layer to represent .

We employ a gated attention model to obtain aggregated evidence representation

and predict the final label as follows:

where are trainable parameters,

is the sigmoid function, and

indicates concatenation.

3 Experiments

Setup.

We conduct our experiments on a large-scale table-based fact verification benchmark TabFact Chen et al. (2020). The test set contains a simple and complex subset according to difficulty. A small test set is further annotated with human performance. Following the previous work, we use accuracy

as the evaluation metric. Details of the data are listed in Appendix 

A.3.

Implementation Details.

During fine-tuning the GPT-2 model to generate decomposition, we run the model with a batch size of 5 for 30 epochs using Adam optimizer 

Kingma and Ba (2015) with a learning rate of 2e-6. We optimize the model for final verification prediction using Adam optimizer with a learning rate of 2e-5 and a batch size of 16. It usually takes 11 to 14 epochs to converge. Our code is available at https://github.com/arielsho/Decomposition-Table-Reasoning.

Model Val Test Simple Complex Small Human - - - - 92.1 LPA 57.7 58.2 68.5 53.2 61.5 Table-BERT 66.1 65.1 79.1 58.2 68.1 LogicalFactChecker 71.8 71.7 85.4 65.1 74.3 HeterTFV 72.5 72.3 85.9 65.7 74.2 SAT 73.3 73.2 85.5 67.2 - ProgVGAT 74.9 74.4 88.3 67.6 76.2 TAPAS-Base 79.1 79.1 91.4 73.1 81.2 TAPAS-Large 81.5 81.2 93.0 75.5 84.1 Ours-Base 80.8 80.7 91.9 75.1 82.5 Ours-Large 82.7 82.7 93.6 77.4 84.7
Table 1: The accuracy (%) of models on TabFact.
Type Tapas-base Ours-base Conj. (15%) 79.9 82.6 Sup. (13%) 81.3 82.4 Comp. (13%) 69.1 72.1 Uniq. (.6.%) 70.4 74.4 Atomic (53%) 81.7 82.5
Table 2: Decompositions improve the performance on test set over 4 decomposition types.
BLEU-4 on Dev Human Val
Our Decomp. 56.75 68%
w/o data aug 48.42 56%
w/o type info 54.74 63%
Table 3: Evaluation of decomposition quality.
train val test simple complex
Our Decomp. 41.6 46.3 46.7 20.2 59.5
w/o data aug 35.2 39.1 39.4 16.3 50.7
Table 4: Percentage of valid decomposition on all splits in TabFact.
Main Results.

We compare our model with different baselines on TabFact, including LPA Chen et al. (2020), Table-BERT Chen et al. (2020), LogicalFactChecker Zhong et al. (2020), HeterTFV Shi et al. (2020), SAT Zhang et al. (2020), ProgVGAT Yang et al. (2020), and TAPAS Eisenschlos et al. (2020). Details of the compared systems can be found in Appendix A.4.

Table 3 presents the test accuracy of our Base model and Large model, which are built upon TAPAS-Base and TAPAS-Large, respectively. Results show that our model consistently outperforms the TAPAS baseline (80.7% vs. 79.1% for the base and 82.7% vs. 81.2% for the large model)555

We also conduct significance tests over both the base and large models (the proposed model vs. TAPAS), with the one-tail t-test. For the base model, the p-value is 4.7e-6 and for the large model, 3.2e-7.

. We show in Table 3 that our decomposition model decomposes roughly 47% of the total TabFact test cases, and our model outperforms the TAPAS model over all types of decomposed statements.

Evaluation of Decompositions.

We use both an automated metric and human validation to evaluate the decomposition quality. For the automated metric, we randomly sample 1,000 training cases from the pseudo decomposition dataset as the hold-out validation set, based on which we use BLEU-4 Papineni et al. (2002) to measure the generation quality. We also sample 100 decomposable cases from the Tabfact test set and ask three crowd workers to judge whether the model produces plausible decompositions. The ablation results in Table 3 indicate that data augmentation and the use of type information improve the decomposition quality, and the BLEU-4 score on the pseudo decomposition dataset well reflects the human judgements.

Since we remove the defective decompositions to reduce noise in the verification task, the number of decomposed cases involved by our final verification model varies according to the decomposition quality. We provide the percentages of valid decompositions on all data splits of TabFact in Table 4. The results show that our decompositions do not completely align with the simple/complex split provided in TabFact, and data augmentation can improve the number of valid decomposition by around 7%. On the downstream verification task, a lower-quality decomposition () yields a performance drop compared to our proposed decomposition model ().

4 Related Work

Existing work on fact verification is mainly based on evidences from unstructured text Thorne et al. (2018); Hanselowski et al. (2018); Yoneda et al. (2018); Thorne et al. (2019); Nie et al. (2019); Liu et al. (2020). Our work focuses on fact verification based on structured tables Chen et al. (2020). Unlike the previous work Chen et al. (2020); Zhong et al. (2020); Shi et al. (2020); Zhang et al. (2020); Yang et al. (2020); Eisenschlos et al. (2020), we propose a framework to verify statements via decomposition.

Sentence decomposition takes the form of Split-and-Rephrase proposed by Narayan et al. (2017) to split a complex sentence into a sequence of shorter sentences while preserving original meanings Aharoni and Goldberg (2018); Botha et al. (2018); Guo et al. (2020). In QA task, question decomposition has been applied to help answer multi-hop questions Iyyer et al. (2016); Talmor and Berant (2018); Min et al. (2019); Wolfson et al. (2020); Perez et al. (2020). Our work mainly focuses on decomposing statements for table-based fact verification with pseudo supervision from programs.

5 Conclusion

In this paper, we propose a framework to better verify the complex statements via decomposition. Without annotating gold decompositions, we propose a program-guided approach to creating pseudo decompositions on which we finetune the GPT-2 for decomposition generation. By solving the decomposed subproblems, we can integrate useful intermediate evidence for final verification and improve the state-of-the-art performance to an 82.7% accuracy on TabFact.

Acknowledgements

We thank the anonymous reviewers for their insightful comments. We also thank Yufei Feng for his helpful comments and suggestions on the paper writing.

References

  • R. Aharoni and Y. Goldberg (2018) Split and rephrase: better evaluation and stronger baselines. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, pp. 719–724. External Links: Document, Link Cited by: §4.
  • J. A. Botha, M. Faruqui, J. Alex, J. Baldridge, and D. Das (2018) Learning to split and rephrase from Wikipedia edit history. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    ,
    Brussels, Belgium, pp. 732–737. External Links: Document, Link Cited by: §4.
  • W. Chen, H. Wang, J. Chen, Y. Zhang, H. Wang, S. Li, X. Zhou, and W. Y. Wang (2020) TabFact: A large-scale dataset for table-based fact verification. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: 1st item, 2nd item, §A.3, §1, §1, §2.2.1, §2.3, §3, §3, §4.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Document, Link Cited by: §A.1, §1.
  • J. Eisenschlos, S. Krichene, and T. Müller (2020) Understanding tables with intermediate pre-training. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 281–296. External Links: Document, Link Cited by: 7th item, §2.3, §3, §4.
  • B. Goodrich, V. Rao, P. J. Liu, and M. Saleh (2019) Assessing the factual accuracy of generated text. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019, pp. 166–175. External Links: Document, Link Cited by: §1.
  • Y. Guo, T. Ge, and F. Wei (2020) Fact-aware sentence split and rephrase with permutation invariant training. In

    The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020

    ,
    pp. 7855–7862. External Links: Link Cited by: §4.
  • A. Hanselowski, H. Zhang, Z. Li, D. Sorokin, B. Schiller, C. Schulz, and I. Gurevych (2018) UKP-athene: multi-sentence textual entailment for claim verification. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), Brussels, Belgium, pp. 103–108. External Links: Document, Link Cited by: §1, §4.
  • J. Herzig, P. K. Nowak, T. Müller, F. Piccinno, and J. Eisenschlos (2020) TaPas: weakly supervised table parsing via pre-training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 4320–4333. External Links: Document, Link Cited by: 7th item.
  • M. Iyyer, W. Yih, and M. Chang (2016) Answering complicated question intents expressed in decomposed question sequences. arXiv preprint arXiv:1611.01242. Cited by: §4.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §3.
  • W. Kryscinski, B. McCann, C. Xiong, and R. Socher (2020)

    Evaluating the factual consistency of abstractive text summarization

    .
    In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 9332–9346. External Links: Document, Link Cited by: §1.
  • Y. Liu and M. Lapata (2019) Text summarization with pretrained encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3730–3740. External Links: Document, Link Cited by: §2.4.
  • Z. Liu, C. Xiong, M. Sun, and Z. Liu (2020) Fine-grained fact verification with kernel graph attention network. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7342–7351. External Links: Document, Link Cited by: §1, §4.
  • S. Min, V. Zhong, L. Zettlemoyer, and H. Hajishirzi (2019) Multi-hop reading comprehension through question decomposition and rescoring. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6097–6109. External Links: Document, Link Cited by: §4.
  • S. Narayan, C. Gardent, S. B. Cohen, and A. Shimorina (2017) Split and rephrase. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 606–616. External Links: Document, Link Cited by: §4.
  • Y. Nie, H. Chen, and M. Bansal (2019) Combining fact extraction and verification with neural semantic matching networks. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp. 6859–6866. External Links: Document, Link Cited by: §1, §4.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. External Links: Document, Link Cited by: §3.
  • P. Pasupat and P. Liang (2015) Compositional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, pp. 1470–1480. External Links: Document, Link Cited by: §2.3.
  • E. Perez, P. Lewis, W. Yih, K. Cho, and D. Kiela (2020) Unsupervised question decomposition for question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 8864–8880. External Links: Document, Link Cited by: §4.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: §2.2.2.
  • H. Rashkin, E. Choi, J. Y. Jang, S. Volkova, and Y. Choi (2017) Truth of varying shades: analyzing language in fake news and political fact-checking. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2931–2937. External Links: Document, Link Cited by: §1.
  • Q. Shi, Y. Zhang, Q. Yin, and T. Liu (2020) Learn to combine linguistic and symbolic information for table-based fact verification. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), pp. 5335–5346. External Links: Document, Link Cited by: 4th item, §3, §4.
  • A. Talmor and J. Berant (2018) The web as a knowledge-base for answering complex questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 641–651. External Links: Document, Link Cited by: §4.
  • J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal (2018) FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 809–819. External Links: Document, Link Cited by: §1, §4.
  • J. Thorne, A. Vlachos, O. Cocarascu, C. Christodoulopoulos, and A. Mittal (2019) The FEVER2.0 shared task. In Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER), Hong Kong, China, pp. 1–6. External Links: Document, Link Cited by: §4.
  • V. Vaibhav, R. Mandyam, and E. Hovy (2019) Do sentence interactions matter? leveraging sentence level representations for fake news classification. In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), Hong Kong, pp. 134–139. External Links: Document, Link Cited by: §1.
  • T. Wolfson, M. Geva, A. Gupta, M. Gardner, Y. Goldberg, D. Deutch, and J. Berant (2020) Break it down: a question understanding benchmark. Transactions of the Association for Computational Linguistics 8, pp. 183–198. External Links: Document, Link Cited by: §4.
  • X. Yang, F. Nie, Y. Feng, Q. Liu, Z. Chen, and X. Zhu (2020) Program enhanced fact verification with verbalization and graph attention network. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 7810–7825. External Links: Document, Link Cited by: 6th item, §A.1, §2.2.1, §3, §4.
  • T. Yoneda, J. Mitchell, J. Welbl, P. Stenetorp, and S. Riedel (2018) UCL machine reading group: four factor framework for fact finding (HexaF). In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), Brussels, Belgium, pp. 97–102. External Links: Document, Link Cited by: §1, §4.
  • H. Zhang, Y. Wang, S. Wang, X. Cao, F. Zhang, and Z. Wang (2020) Table fact verification with structure-aware transformer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 1624–1629. External Links: Document, Link Cited by: 5th item, §3, §4.
  • W. Zhong, D. Tang, Z. Feng, N. Duan, M. Zhou, M. Gong, L. Shou, D. Jiang, J. Wang, and J. Yin (2020) LogicalFactChecker: leveraging logical operations for fact checking with graph module network. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 6053–6065. External Links: Document, Link Cited by: 3rd item, §3, §4.

Appendix A Appendix

a.1 Program Selection

We fine-tune the BERT Devlin et al. (2019) to model

, the probability of program

being semantically consistent with . Since the gold programs are not available, we use the final verification labels as weak supervision. To mitigate the impact of spurious programs, i.e., programs execute to correct answers with incorrect operation combinations, we follow Yang et al. (2020) to optimize the model with a margin loss:

where and denote the label-inconsistent and label-consistent programs with the highest probability, respectively. is the parameter to control the margin. The margin loss can encourage selecting one program that is most semantically relevant to the statement while maintaining a margin between the positive (label-consistent) and the negative (label-inconsistent) programs.

a.2 Statistics of Pesudo Dataset

We have 9,696 pseudo statement-decomposition pairs in total, and the number of samples belong to four decomposition types is given in Table 5. To train the decomposition type detection model, we add an additional atomic category with 1,739 statements.

Decomp. Type # of samples
Conjunctive 1,798
Superlative 2,452
Comparative 4,528
Uniqueness 918
Table 5: Statistics of pseudo decomposition dataset.

a.3 Statistics of TabFact Dataset

The statistics of TabFact Chen et al. (2020) can be found in Table 6, a large-scale table-based fact verification benchmark dataset on which we evaluate our method. The test set is further split into a simple set and a complex set, which include 4,171 and 8,608 sentences, respectively. A small test set with 1,998 samples are provided for human performance evaluation.

Split Sentence Table Row Col
Train 92,283 13,182 14.1 5.5
Val 12,792 01,696 14.0 5.4
Test 12,779 01,695 14.2 5.4
Table 6: Statistics of TabFact.

a.4 Compared Systems

  • LPA Chen et al. (2020) derives a program for each statement by ranking the synthesized program candidates and takes the program execution results as predictions.

  • Table-BERT Chen et al. (2020) takes a linearized table and a statement as the input of BERT for fact verification.

  • LogicalFactChecker Zhong et al. (2020) utilizes the structures of programs to prune irrelevant information in tables and modularize symbolic operations with module networks.

  • HeterTFV Shi et al. (2020) is a graph-based reasoning approach to combining linguistic information and symbolic information.

  • SAT Zhang et al. (2020) is a structure-aware Transformer that encodes structured tables by injecting the structural information into the mask of the self-attention layer.

  • ProgVGAT Yang et al. (2020) leverages the symbolic operation information to enhance verification with a verbalization technique and a graph-based network.

  • TAPAS Herzig et al. (2020); Eisenschlos et al. (2020) is the previous SOTA model on TabFact which extends BERT’s architecture to encode tables and is jointly pre-trained with text and tables.