Understanding Unnatural Questions Improves Reasoning over Text

by   Xiao-Yu Guo, et al.
Monash University

Complex question answering (CQA) over raw text is a challenging task. A prominent approach to this task is based on the programmer-interpreter framework, where the programmer maps the question into a sequence of reasoning actions which is then executed on the raw text by the interpreter. Learning an effective CQA model requires large amounts of human-annotated data,consisting of the ground-truth sequence of reasoning actions, which is time-consuming and expensive to collect at scale. In this paper, we address the challenge of learning a high-quality programmer (parser) by projecting natural human-generated questions into unnatural machine-generated questions which are more convenient to parse. We firstly generate synthetic (question,action sequence) pairs by a data generator, and train a semantic parser that associates synthetic questions with their corresponding action sequences. To capture the diversity when applied tonatural questions, we learn a projection model to map natural questions into their most similar unnatural questions for which the parser can work well. Without any natural training data, our projection model provides high-quality action sequences for the CQA task. Experimental results show that the QA model trained exclusively with synthetic data generated by our method outperforms its state-of-the-art counterpart trained on human-labeled data.



There are no comments yet.


page 1

page 2

page 3

page 4


Training Question Answering Models From Synthetic Data

Question and answer generation is a data augmentation method that aims t...

Contrastive Domain Adaptation for Question Answering using Limited Text Corpora

Question generation has recently shown impressive results in customizing...

When in Doubt, Ask: Generating Answerable and Unanswerable Questions, Unsupervised

Question Answering (QA) is key for making possible a robust communicatio...

Unsupervised Question Answering by Cloze Translation

Obtaining training data for Question Answering (QA) is time-consuming an...

Generating Synthetic Data for Neural Keyword-to-Question Models

Search typically relies on keyword queries, but these are often semantic...

AutoQA: From Databases To QA Semantic Parsers With Only Synthetic Training Data

We propose AutoQA, a methodology and toolkit to generate semantic parser...

Question Decomposition with Dependency Graphs

QDMR is a meaning representation for complex questions, which decomposes...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The complex question answering (CQA) task, which requires multi-step, discrete actions to be executed over text to obtain answers, is a challenging task. On the recently released DROP benchmark [4], the state-of-the-art method Neural Module Networks (NMNs) [5] learns to interpret each question as a sequence of neural modules, or discrete actions, and execute them to yield the answer. The training of CQA models such as NMNs requires a large amount of (question, action sequence) pairs, which is expensive to acquire and augment. Therefore, the label scarcity problem remains a challenge to the CQA problem.

Motivated by this, we propose a projection model to alleviate the label scarcity challenge by generating synthetic training data. The projection model can automatically label large amounts of unlabelled questions with action sequences, so that a CQA model can be trained without natural supervised data. Our method is inspired by the recent “simulation-to-real” transfer approach [8, 6]. As the name suggests, internal knowledge in synthetic data is firstly learned in the “simulation” phase, and then transferred “to-real” circumstances with natural-language utterance using the projection model. In the “simulation” phase, we design an -gram based generator to produce synthetic training data, i.e. (question, action sequence) pairs which constitute the synthetic dataset. We then train a semantic parser to learn the internal knowledge in this dataset. In the “to-real” phase, we train a projection model that projects each unlabelled natural-language question to a synthetic question and obtain the corresponding action sequence by interpreting the synthetic question using the trained semantic parser. In this way, the internal knowledge is implicitly transferred from “simluation” phase “to-real” phase. With action sequences obtained for the natural questions, an interpreter is employed to execute these action sequences and generate answers consequently.

Experimental results on the challenging DROP dataset demonstrate the effectiveness and practicability of our projection model. Based on the synthetic dataset produced by the data generator, our projection model help NMNs model achieve a 78.3 F1 score on the DROP development dataset. The result indicates the promise of “simulation-to-real” as a development paradigm for NLP problems.

Our contributions are as follows:

  • We leverage a projection model to connect synthetic questions with natural-language questions and alleviate the label scarcity problem for the CQA task.

  • Our projection model helps generate answers for CQA task. With high-quality action sequences, the NMNs model achieves higher F1 and Exact Match scores.

Figure 1: An overview of our proposed model. 1⃝ The synthetic dataset is generated by a data generator. 2⃝ A specific model is trained on the synthetic dataset. 3⃝ A projection model is employed to project natural-language questions to synthetic questions. Note that 1⃝ and 2⃝ belong to the “simulation” phase, while 3⃝ belongs to the “to-real” phase.

2 Background

The complex question answering (CQA) task aims to generate answers for compositional questions that require a thorough understanding of questions and contents such as paragraphs. One recent DROP dataset [4] fits this case exactly and requires discrete reasoning over contents of paragraphs, which are extracted from Wikipedia. Meanwhile, all questions and corresponding answers are generated and annotated by human workers. The questions in DROP are complex and compositional in nature, and are thus especially challenging to existing QA models such as BiDAF [7].

Gupta et al. NMNs:2019 address this challenge with the fully differentiable Neural Module Networks (NMNs) [2, 1] and achieves state-of-the-art performance on DROP. The NMNs follows the programmer-interpreter sturcture and consists of two components: a parser (programmer) and an interpreter. The parser is an encoder-decoder model with attention and learns to interpret a natural-language question into an executable sequence of modules (i.e. actions, and thereafter we use the terms “module” and “action” interchangeably). The interpreter then takes the sequence along with the paragraph as inputs, and predicts the answer after executing the actions one-by-one.

These modules, such as , and , are defined to perform independent reasoning over text, numbers and dates. Besides, actions take arguments from questions. For example, the question “How many yards was the first field goal” is interpreted as the first action field goal, where “field goal” is the argument of the “” action. More details can be found in Appendix A.1.

For realizing the “simulation-to-real” projection model, we define a series of concepts as follows. Formally, our goal is to obtain a semantic parser, i.e. a general model that maps natural questions to action sequences . We approximate the general model by another semantic parser, the specific model , where is the set of synthetic questions. Note that the model is trained on a synthetic dataset , where (,) represents a pair of (synthetic question, action sequence). In order to obtain the action sequence for a natural question , we propose a projection model , with which we find a synthetic question for each natural question. Therefore, the action sequence for a natural question could be computed using the specific model: . The specific model

can be trained using a supervised machine learning method on synthetic data only. Therefore, it remains only to find the projection model


3 Model

The overall structure of our model is shown in Figure 1. We will firstly introduce an -gram based data generator in Section 3.1. In Section 3.2

, a cosine-similarity projection model and a classifier-based projection model are proposed. Experimental results demonstrate that the classifier-based project model achieves better question answering performance.

3.1 Data Generator

From the existing (question, action sequence) pairs in the training set of the NMNs [5], we firstly summarize a list of first -grams of questions to provide sufficient coverage of the available action sequences. Note that the “” in -gram is a tunable parameter and there can be multiple action sequences for a single -gram. Some examples are listed in Table 1.

First -grams Synthetic Questions Action Sequences
how many touchdowns were scored How many touchdowns were scored?
How many touchdowns were scored by Elam in the first quarter?
what happened first What happened first, the crisis or the French Revolution?

Table 1:

Examples of first n-grams, synthetic questions and action sequences.

Secondly, given an -gram and action sequence, we generate a question with blanks. For example, for the -gram “what happened first” and the action sequence “”, we generate a synthetic question with two blanks to compare two event dates: “what happened first, blank1 or blank2 ?”.

Finally, the blanks in each generated question are replaced with names, events, noun phrases, constrained words, etc. from the corresponding paragraph. We extract these various types of entities from natural-language questions and paragraphs using spaCy111https://spacy.io/.

3.2 Projection model

A straightforward way to project natural-language questions onto synthetic questions is Cosine similarity of question embeddings. We leverage contextualized representations of question words and define the question embedding as the average of all word embeddings in the question. Note that we employ the bert-base-uncased model [3] and define the projection model by:



represents the Cosine similarity between two vectors,

, represents a natural and synthetic question separately. However, Equation 1 requires a large amount of computations, as the entire set of synthetic questions need to be compared for the projection of each natural-language question.

To reduce the time complexity and also improve the performance of the projection model, we enumerate possible action sequences (without arguments) as class labels and treat the projection model as a classification problem. With contextualized representations, we define a new projection model:


Note that with this classifier-based projection model, we no longer need to employ the specific model to further find an action sequence for an input question. Instead, the project model interprets natural questions as action sequences directly.

4 Experiments

In this section, we will evaluate our model from two aspects. On the one hand, the performance of the two models proposed in Section 3.2 will be evaluated. On the other hand, we employ the classifier projection model to generate action sequences and provide NMNs [5] with action sequences.

4.1 Dataset and Models

The experiments are performed on the subset of the DROP dataset used in NMNs [5], containing approx. 19,500 (question, answer) pairs for training, 441 for validation and 1,723 for testing. 2,420 questions in the training set and all questions in the validation set have been manually annotated with ground-truth action sequences, which are used to train the NMNs model, which we denote as “original” below. We evaluate two variants of our method based on the training data used to train our classifier-based projection model: (1) “synthetic”, where the projection model is trained on the 2,420 synthetic questions generated by the data generator in Section 3.1, and (2) “natural”, where the projection model is trained on the 2,420 natural questions in the original dataset. Once trained, the projection model is applied to the remaining questions in the training set without action sequences to provide additional training data for NMNs.

4.2 Results

By learning models in Section 3.2, we employ the Cosine projection model on the synthetic dataset and find that it achieves an accuracy of 83.2% on the validation set. Meanwhile, we trained two distinct classifiers using only synthetic dataset or natural dataset, with which we gain 93.2% and 96.1% respectively. These results show that the synthetic dataset produced by the data generator is high-quality enough and the projection model trained on it is comparable to the one trained on natural dataset.

To further evaluate the projection model, we provide NMNs with classifier-based projection results to evaluate the performance on the downstream CQA task on the DROP dasatet [4]. Concretely, we evaluate our “simulation-to-real” approach in two settings: “synthetic” and “natural”, where two classifiers (Section 3.2) are trained on the synthetic dataset and natural dataset respectively. We compare our methods with the original NMNs model that is trained on the NMNs dataset. We denote this baseline method “original”.

Methods original (baseline) synthetic natural
F1 77.4 78.3 79.1
EM 74.0 74.9 75.9
Table 2: F1 and EM scores for NMNs trained on different datasets.

In Table 2, we report F1 and Exact Match (EM) scores for the CQA task. As shown in Table 2, compared with the original model, “synthetic” NMNs achieves a higher F1 and EM scores using action sequences generated by our projection model. Moreover, we find the quality of synthetic data can be further improved, as the projection model trained on the natural dataset achieves better performance. The reason why our projection model can help understand complex questions is mainly from two aspects. On the one hand, our projection model provide more supervised (question, action sequence) pairs with the parser. On the other hand, the action sequences produced by the projection model are more accurate than the generated sequences by parser. See details in A.2.

5 Conclusion

In this paper, we propose a projection model that only employs synthetic data to develop supervisions for real-world data. Experimental results show that our approach can be equivalent to supervised learning on natural dataset in performance. In addition, with projection results, we employ the NMNs model to solve complex question answering problem and generate answers for the DROP dataset. Higher F1, Excat Match scores demonstrate that our projection model can help improve the performance of the downstream CQA task and provide a good reference to relevant works.


  • [1] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein (2016)

    Learning to compose neural networks for question answering

    In Proceedings of the NAACL HLT 2016, pp. 1545–1554. External Links: Link, Document Cited by: §2.
  • [2] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein (2016) Neural module networks. In Proceedings of the CVPR 2016, pp. 39–48. External Links: Link, Document Cited by: §2.
  • [3] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL HLT 2019, pp. 4171–4186. External Links: Link, Document Cited by: §3.2.
  • [4] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner (2019) DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the NAACL HLT 2019, pp. 2368–2378. External Links: Link, Document Cited by: §1, §2, §4.2.
  • [5] N. Gupta, K. Lin, D. Roth, S. Singh, and M. Gardner (2020) Neural module networks for reasoning over text. In Proceeding of the ICLR 2020, External Links: Link Cited by: §1, §3.1, §4.1, §4.
  • [6] A. Marzoev, S. Madden, M. F. Kaashoek, M. J. Cafarella, and J. Andreas (2020) Unnatural language processing: bridging the gap between synthetic and natural language data. arXiv:2004.13645. External Links: Link Cited by: §1.
  • [7] M. J. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi (2017) Bidirectional attention flow for machine comprehension. In Proceeding of the ICLR 2017, External Links: Link Cited by: §2.
  • [8] E. Tzeng, C. Devin, J. Hoffman, C. Finn, X. Peng, S. Levine, K. Saenko, and T. Darrell (2015) Towards adapting deep visuomotor representations from simulated to real environments. arXiv:1511.07111. Cited by: §1.

Appendix A Appendix

a.1 NMNs model overview

Gupta et al. NMNs:2019 propose a Neural Module Networks (NMNs) model to solve the complex question answering problem. Containing a parser and an interpreter, NMNs have a more interpretable structure as shown in Figure 2. Note that the parser and the interpreter are jointly learned with auxiliary supervision in the training period.

Figure 2: NMNs model architecture.

As Figure 2 shows, NMNs takes as inputs the question and the paragraph. The encoder-decoder based parser firstly interprets the question into an executable action sequence. The action-based interpreter then executes this sequence against the corresponding paragraph to produce the final answer. By calculating attention matrix, all actions independently achieve reasoning over raw text or the outputs from other actions. For example, the touchdown action finds all the “touchdown pass” in the paragraph and assigns larger attention weights to all words in yellow background. Then first quarter filters “touchdown pass” belong to the first quarter by producing an attention mask over the output of action. In Table 3, we list some examples of questions, answers and the corresponding action sequences.

Questions Action Sequences Answers
How many touchdowns did the Giants score in the fourth quarter? 2
Who kicked the most field goals? Rackers
Who threw the longest touchdown pass of the first quarter? Aaron Rodgers
How many yards was the longest touchdown reception? 14 yards
Which happened earlier, the formation of the United Nations or the dissolution of the Soviet Union? formation of the United Nations
How many years after the formation of the United Nations was the Universal Declaration of Human Rights adopted? 3 years
Table 3: Examples for questions, answers and action sequences.

a.2 Examples from distinct models

Table 4 lists (question, action sequence) pairs generated by different projection models and compares their impact on the final answer. All questions are selected from the DROP validation dataset, and action sequences are either generated by the parser (“original”) or generated by our projection models. As can be seen, our projection methods are able to generate action sequences that lead to correct answers. Apparently, the action sequences from baseline sometimes are erroneous, as shown in the first question, which does not lead to the correct answer.

Question Method Action Sequences F1
How many years did it take for the Allies to take five towns from the Dutch? original 0.0
synthetic 1.0
natural 1.0
Which happened first, the second Kandyan War, or Sri Lankan independence? original 0.0
synthetic 0.0
natural 0.55
How many years was between the oil crisis and the energy crisis? original 0.0
synthetic 0.0
natural 1.0
Table 4: Different (question, action sequence) pairs obtained when training on different datasets.