Two-Stage Synthesis Networks for Transfer Learning in Machine Comprehension

06/29/2017 ∙ by David Golub, et al. ∙ Microsoft IEEE Stanford University 0

We develop a technique for transfer learning in machine comprehension (MC) using a novel two-stage synthesis network (SynNet). Given a high-performing MC model in one domain, our technique aims to answer questions about documents in another domain, where we use no labeled data of question-answer pairs. Using the proposed SynNet with a pretrained model from the SQuAD dataset on the challenging NewsQA dataset, we achieve an F1 measure of 44.3 model and 46.6 (F1 measure of 50.0 without use of provided annotations.



page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine comprehension (MC), the ability to answer questions over a provided context paragraph, is a key task in natural language processing. The rise of high-quality, large-scale human-annotated datasets for this task

(Rajpurkar et al., 2016; Trischler et al., 2016)

has allowed for the training of data-intensive but expressive models such as deep neural networks

(Wang et al., 2016; Xiong et al., 2016; Seo et al., 2016). Moreover, these datasets have the attractive quality that the answer is a short snippet of text within the paragraph, which narrows the search space of possible answer spans.

However, many of these models rely on large amounts of human-labeled data for training. Yet data collection is a time-consuming and expensive task. Moreover, direct application of a MC model trained on one domain to answer questions over paragraphs from another domain may suffer performance degradation.

While understudied, the ability to transfer a MC model to multiple domains is of great practical importance. For instance, the ability to quickly use a MC model trained on Wikipedia to bootstrap a question-answering system over customer support manuals or news articles, where there is no labeled data, can unlock a great number of practical applications.

In this paper, we address this problem in MC through a two-stage synthesis network (SynNet). The SynNet generates synthetic question-answer pairs over paragraphs in a new domain that are then used in place of human-generated annotations to finetune a MC model trained on the original domain.

The idea of generating synthetic data to augment insufficient training data has been explored before. For example, for the target task of translation, Sennrich et al. (2016) present a method to generate synthetic translations given real sentences to refine an existing machine translation system.

However, unlike machine translation, for tasks like MC, we need to synthesize both the question and answers given the context paragraph. Moreover, while the question is a syntactically fluent natural language sentence, the answer is mostly a salient semantic concept in the paragraph, e.g., a named entity, an action, or a number, which is often a single word or short phrase.222This assumption holds for MC datasets such as SQuAD and NewsQA, but there are exceptions in certain subdomains of MSMARCO. Since the answer has a very different linguistic structure compared to the question, it may be more appropriate to view answers and questions as two different types of data. Hence, the synthesis of a (question, answer) tuple is needed.

In our approach, we decompose the process of generating question-answer pairs into two steps, answer generation conditioned on the paragraph, and question generation conditioned on the paragraph and answer. We generate the answer first because answers are usually key semantic concepts, while questions can be viewed as a full sentence composed to inquire the concept.

Using the proposed SynNet, we are able to outperform a strong baseline of directly applying a high-performing MC model trained on another domain. For example, when we apply our algorithm using a pretrained model on the Stanford Question-Answering Dataset (SQuAD) (Rajpurkar et al., 2016), which consists of Wikipedia articles, to answer questions on the NewsQA dataset (Trischler et al., 2016), which consists of CNN/Daily Mail articles, we improve the performance of the single-model SQuAD baseline from 39.0% to 44.3% F1, and boost results further with an ensemble to 46.6% F1, approaching results of previously published work of Trischler et al. (2016) (50.0% F1), without use of labeled data in the new domain. Moreover, an error analysis reveals that we achieve higher accuracy over the baseline on all common question types.

2 Related Work

2.1 Question Answering

Question answering is an active area in natural language processing with ongoing research in many directions (Berant et al., 2013; Hill et al., 2015; Golub and He, 2016; Chen et al., 2016; Hermann et al., 2015). Machine comprehension, a form of extractive question answering where the answer is a snippet or multiple snippets of text within a context paragraph, has recently attracted a lot of attention in the community. The rise of large-scale human annotated datasets with over 100,000 realistic question-answer pairs such as SQuAD (Rajpurkar et al., 2016), NewsQA (Trischler et al., 2016), and MSMARCO (Nguyen et al., 2016)

, has led to a large number of successful deep learning models

(Lee et al., 2016; Seo et al., 2016; Xiong et al., 2016; Dhingra et al., 2016; Wang and Jiang, 2016).

2.2 Semi-Supervised Learning

Semi-supervised learning has a long history (c.f. Chapelle et al. (2009) for an overview), and has been applied to many tasks in natural language processing such as dependency parsing (Koo et al., 2008)

, sentiment analysis

(Yang et al., 2015),machine translation (Sennrich et al., 2016), and semantic parsing (Berant and Liang, ; Wang et al., ; Jia and Liang, 2016). Recent work generated synthetic annotations on unsupervised data to boost the performance of both reading comprehension and visual question answering models (Yang et al., 2017; Ren et al., 2015), but on domains with some form of annotated data. There has also been work on generating high-quality questions (Yuan et al., 2017; Serban et al., 2016; Labutov et al., ), but not how to best use them to train a model. In contrast, we use the two-stage SynNet to generate data tuples to directly boost performance of a model on a domain with no annotations.

2.3 Transfer Learning

Transfer learning (Pan and Yang, 2010)

has been successfully applied to numerous domains in machine learning, such as machine translation

(Zoph et al., 2016)

, computer vision,

(Sharif Razavian et al., 2014), and speech recognition (Doulaty et al., 2015)

. Specifically, object recognition models trained on the large-scale ImageNet challenge

(Russakovsky et al., 2015) have proven to be excellent feature extractors for diverse tasks such as image captioning (i.e., Lu et al. (2016); Fang et al. (2015); Karpathy and Fei-Fei (2015)) and visual question answering (i.e., Zhou et al. (2015); Xu and Saenko (2016); Fukui et al. (2016); Yang et al. (2016)), among others. In a similar fashion, we use a model pretrained on the SQuAD dataset as a generic feature extractor to bootstrap a QA system on NewsQA.

3 The Transfer Learning Task for MC

We formalize the task of machine comprehension below. Our MC model takes as input a tokenized question , a context paragraph , where are words, and learns a function where and are pointer indices into paragraph , i.e., the answer .

Given a collection of labeled paragraph, question, answer triples from a particular domain , i.e., Wikipedia articles, we can learn a MC model that is able to answer questions in that domain.

However, when applying the model trained in one domain to answer questions in another, the performance may degrade. On the other hand, labeling data to train a model in the new domain is expensive and time-consuming.

In this paper, we propose the task of transferring a MC system that is trained in a source domain to answer questions over another target domain, . In the target domain , we are given an unlabeled set of paragraphs. During test time, we are given an unseen set of paragraphs, , in the target domain, over which we would like to answer questions.

4 The Model

4.1 Two-Stage SynNet

To bootstrap our model we use a SynNet (Figure 1), which consists of answer synthesis and question synthesis modules, to generate data on

. Our SynNet learns the conditional probability of generating answer

and question given paragraph ,

. We decompose the joint probability distribution

into a conditional probability distribution

, where we first generate the answer , followed by generating the question conditioned on the answer and paragraph.

4.1.1 Answer Synthesis Module

In our answer synthesis module we train a simple IOB tagger to predict whether each word in the paragraph is part of an answer or not.

More formally, given a set of words in a paragraph , our IOB tagging model learns the conditional probability of labels , where if a word is marked as an answer by the annotator in our train set, NONE otherwise.

We use a bi-directional Long-Short Term Memory Network (Bi-LSTM)

(Hochreiter and Schmidhuber, 1997) for tagging. Specifically, we project each word

into a continuous vector space via pretrained GloVe embeddings

(Pennington et al., 2014). We then run a Bi-LSTM over the word embeddings to produce a context-dependent word representation , which we feed into two fully connected layers followed by a softmax to produce our tag likelihoods for each word.

We select all consecutive spans where produced by the tagger as our candidate answer chunks, which we feed into our question synthesis module for question generation.

4.1.2 Question Synthesis Module

Our question synthesis module learns the conditional probability of generating question given answer and paragraph , . We decompose the joint probability distribution of generating all the question words into generating the question one word at a time, i.e. .

The model is similar to an encoder-decoder network with attention (Bahdanau et al., 2014), which computes the conditional probability . We run a Bi-LSTM over the paragraph to produce context-dependent word representations . To model where the answer is in the paragraph, similar to Yang et al. (2017), we insert answer information by appending a zero/one feature to the paragraph word embeddings. Then, at each time step , a decoder network attends to both and the previously generated question token

to produce a hidden representation

. Since paragraphs may often have named entities and rare words not present during training, we incorporate a copy mechanism into our models (Gu et al., 2016).

We use an architecture motivated by latent predictor networks (Ling et al., 2016) to force the model to learn when to copy vs. directly predict the word, without direct supervision of what action to choose. Specifically, at every time step , two latent predictors generate the probability of generating word , a pointer network (Vinyals et al., 2015) which can copy a word from the context paragraph, and a vocabulary predictor which directly generates a probability distribution of choosing a word from a predefined vocabulary. The likelihood of choosing predictor at time step is proportional to , and the likelihood of predicting question token is given by , where represents the vocabulary predictor and represents the copy predictor, and is the likelihood of the word given by the predictor.333Since we only have two predictors, For training, since no direct supervision is given as to which predictor to choose, we minimize the cross entropy loss of producing the correct question tokens by marginalizing out latent variables using a variant of the forward-backward algorithm (see Ling et al. (2016) for full details).

During inference, to generate a question , we use greedy decoding in the following manner. At time step , we select the most likely predictor ( or ), followed by the most likely word given the predictor. We feed the predicted word as input at the next timestep back into the decoder until we predict the end symbol, END, after which we stop decoding.

Snippet of context paragraph (answer in bold) Generated questions (bold) vs. human questions
…At this point, some of these used-luxe models have been around so long that they almost qualify as vintage throwback editions. Recently, Consumer Report magazine issued its list of best and worst used cars, and divvied them up by price range … What magazine made best used cars in the USAF?
Who released a list of best and worst used cars
…A high court in northern India on Friday acquitted a wealthy businessman facing the death sentence for the killing of a teen in a case dubbed ”the house of horrors.“ Moninder Singh Pandher was sentenced to death by a lower court in February. The teen was one of 19 victims – children and … How many victims were in India ?
What was the amount of children murdered ?
Joe Pantoliano has met with the Obama and McCain camps to promote mental health and recovery. Pantoliano, founder and president of the eight-month-old advocacy organization No Kidding, Me Too, released a teaser of his new film about various forms of mental illness… Which two groups did Joe Pantoliano meet with?
Who did he meet with to discuss the issue?
…Former boxing champion Vernon Forrest , 38 , was shot and killed in southwest Atlanta , Georgia , on July 25 . A grand jury indicted the three suspects – Charman Sinkfield , 30 ; Demario Ware , 20 ; and Jquante… Where was the first person to be shot ?
Where was Forrest killed?

Table 1: Randomly sampled paragraphs and corresponding synthetic vs. human questions from the NewsQA train set. Human-selected answers from the train set were used as input.

4.2 Machine Comprehension Model

Our machine comprehension model learns the conditional likelihood of predicting answer pointers given paragraph and question , . In our experiments we use the open-source Bi-directional Attention Flow (BiDAF) network (Seo et al., 2016)444See since it is one of the best-performing models on the SQuAD dataset,555See for latest results although we note that our algorithm for data synthesis can be used with any MC model.

4.3 Algorithm Overview

Having given an overview of our SynNet and a brief overview of the MC model we describe our training procedure, which is illustrated in Algorithm 1.

Input :  triplets from source domain ; pretrained MC model on , ; paragraphs from target domain ,
Output : MC model on target domain,
1 Train SynNet to maximize on source ;
2 Generate samples on text in target domain ;
3 Use to finetune MC model on domain . For every batch sampled from , sample batches from ;
Algorithm 1 Training Algorithm

4.4 Training

Our approach for transfer learning consists of several training steps. First, given a series of labeled examples from domain , paragraphs from domain , and pretrained MC model , we train the SynNet to maximize the likelihood of the question-answer pairs in .

Second, we fix our SynNet and we sample question-answer pairs on the paragraphs in domain . Several examples of generated questions can be found in Table 1.

We then transfer the MC model originally learned on the source domain to the target domain using SGD on the synthetic data. However, since the synthetic data is usually noisy, we alternatively train the MC model with mini-batches from and , which we call data-regularization. Every batches from , we sample 1 batch of synthetic data from , where is a hyper-parameter, which we set to 4. Letting the model encounter many examples from source domain serves to regularize the distribution of the synthetic data in the target domain with real data from . We checkpoint finetuned model every mini-batches, in our experiments, and save a copy of the model at each checkpoint.

At test time, to generate an answer, we feed paragraph and question through our finetuned MC model to get , for all . We then use dynamic programming (Seo et al., 2016) to find the optimal answer span . To improve the stability of using our model for inference, we average the predicted answer likelihoods from model copies at different checkpoints, which we call , prior to running the dynamic programming algorithm.

5 Experimental Setup

We summarize the datasets we use in our experiments, parameters for our model architectures, and training details.

The SQuAD dataset consists of approximately 100,000 question-answer pairs on Wikipedia, 87,600 of which are used for training, 10,570 for development, and an unknown number in a hidden test set. The NewsQA dataset consists of 92,549 train, 5,166 development and 5,165 test questions on CNN/Daily Mail news articles. Both the domain type (i.e., news) and question types differ between the two datasets. For example, an analysis of a randomly generated sample of 1,000 questions from both NewsQA and SQuAD (Trischler et al., 2016) reveals that approximately 74.1% of questions in SQuAD require word matching or paraphrasing to retrieve the answer, as opposed to 59.7% in NewsQA. As our test metrics, we report two numbers, exact match (EM) and F1 score.

We train a BIDAF model on the SQuAD train dataset and use a two-stage SynNet to finetune it on the NewsQA train dataset.

We initialize word-embeddings for the BIDAF model, answer synthesis module, and question synthesis module with 300-dimensional-GloVe vectors (Pennington et al., 2014) trained on the 840 Billion Words Common Crawl corpus. We set all embeddings of unknown word tokens to zero.

For both the answer synthesis and question synthesis module, we use a vocabulary of size 110,179. We use LSTMs with hidden states of size 150 for the answer module vs. those of size 100 for the question module since the answer module is less memory intensive than the question module.

We train both the answer and question module with Adam (Kingma and Ba, 2014)

and a learning rate of 1e-2. We train a BIDAF model with the default hyperparameters provided in the open-source repository. To stop training of the question synthesis module, after each epoch, we monitor both the loss as well as the quality of questions generated on the SQuAD development set. To stop training of the answer synthesis module, we similarly monitor predictions on the SQuAD development set.

Method System EM F1
Transfer Learning (SQuAD baseline) 24.9 39.0
+ + (single model on NewsQA) 26.6 40.9
+ + (single model on NewsQA) 29.0 43.1
+ + (single model on NewsQA, cpavg) 30.6 44.3
+ + + (4-model ensemble, cpavg) 32.8 46.6
+ + + + (4-model ensemble, cpavg) 33.0 46.6
Supervised Learning Barb-LSTM on NewsQA (Trischler et al., 2016) 34.9 50.0
Match-LSTM on NewsQA (Trischler et al., 2016) 34.1 48.2
BIDAF on NewsQA 37.1 52.3
BIDAF on SQuAD finetuned on NewsQA 37.3 52.2
Table 2: Main Results. Exact match (EM) and span F1 scores on the NewsQA test set of a BIDAF model finetuned with our SynNet. refers to a baseline BIDAF model trained on SQuAD, , refers to using answers generated from our SynNet respectively to finetune the model on NewsQA, refers to using answers extracted from a standard NER system to generate questions, refers to using checkpoint-averaging, and refers to using the baseline SQuAD model in the ensemble.


System EM F1


46.3 60.8
+ 47.9 61.5


Table 3: NewsQA to SQuAD. Exact match (EM) and span F1 results on SQuAD development set of a NewsQA BIDAF model baseline vs. one finetuned on SQuAD using the data generated by a 2-stage SynNet ().

To train the question synthesis module, we only use the questions provided in the SQuAD train set. However, to train the answer synthesis module, we further augment the human-annotated labels of each paragraph with tags from a simple NER system666 because labels of answers provided in the train set are underspecified, i.e., many words in the paragraph that could be potential answers are not labeled. Therefore, we assume any named entities could also be potential answers of certain questions, in addition to the answers explicitly labeled by annotators.

To generate question-answer pairs on the NewsQA train set using the SynNet, we first run every paragraph through our answer synthesis module. We then randomly sample up to 30 candidate answers extracted by our module, which we feed into the question synthesis module. This results in 250,000 synthetic question-answer pairs that we can use to finetune our MC model.


A) EM F1 B) EM F1


k=0 27.2 40.5 2s + 22.8 36.1
k=2 29.8 43.9 all + 27.2 40.5
k=4 30.4 44.3 2s + 31.3 45.2
all + 32.5 46.8


Table 4: Ablation Studies. Exact match (EM) and span F1 results on NewsQA test set of a BIDAF model finetuned with a 2-stage SynNet. In study A, we vary , the number of mini-batches from SQuAD for every batch in NewsQA. In study B, we set , and vary the answer type and how much of the paragraph we use for question synthesis. refers to using two sentences before answer span, while refers to using the entire paragraph. refers to using an NER system and refers to using the human-annotated answers to generate questions.

6 Experimental Results

We report the main results on the NewsQA test set (Table 2), report brief results on SQuAD (Table 3), conduct ablation studies (Table 4), and conduct an error analysis.

6.1 Results

We compare to the best previously published work, which trains BARB (Trischler et al., 2016) and Match-LSTM (Wang and Jiang, 2016) architectures, and a BIDAF model we train on NewsQA. Directly applying a BIDAF model trained on SQuAD to predict on NewsQA leads to poor performance with an F1 measure of 39.0%, 13.2% lower than one trained on labeled NewsQA data. Using the 2-stage SynNet already leads to a slight boost in performance (F1 measure of 40.9%), which implies that having exposure to the new domain via question-answer pairs provides important signal for the model during training. With checkpoint-averaging, we see an additional improvement of 3.4% (F1 measure of 44.3%). When we ensemble a BIDAF model trained on questions and answers from the SynNet with three BIDAF models trained on questions by and answers from a generic NER system, we have an additional 2.3% performance boost. Finally, when we ensemble the original BIDAF model trained on SQuAD in the ensemble, we boost the EM further by 0.2%. Our final system achieves an F1 measure of 46.6%, approaching previously published results of 50.0%. The results demonstrate that using the proposed architecture and training procedure, we can transfer a MC model from one domain to another, without use of annotated data.

We also evaluate the SynNet on the NewsQA-to-SQuAD direction. We directly apply the best setting from the other direction and report the result in Table 3. The SynNet improves over the baseline by 1.6% in EM and 0.7% in F1. Limited by space, we leave out ablation studies in this direction.

6.2 Ablation Studies

To better understand how various components in our training procedure and model impact overall performance we conduct several ablation studies, as summarized in Table 4.

6.2.1 Answer Synthesis

We experiment with using the answer chunks given in the train set, , to generate synthetic questions, versus those from an NER system, . Results in Table 4(A) show that using human-annotated answers to generate questions leads to a significant performance boost over using answers from an answer generation module. This supports the hypothesis that the answers humans choose to generate questions for provide important linguistic cues for finetuning the machine comprehension model.

6.2.2 Question Synthesis

To see how copying impacts performance, we explore using the entire paragraph to generate the question vs. only the two sentences before and one sentence after the answer span and report results in Table 4(B). On the NewsQA train set, synthetic questions that use 2 sentences contain an average of 3.0 context words within 10 words to the left and right of the answer chunk, those that use the entire context have 2.1 context words, and human generated questions only have 1.7 words. Training with generated questions that have a large amount of overlap with words close to the answer span (i.e., those that use 2-sentences vs. entire context for generation) leads to models that perform worse, especially with synthetic answer spans and no data regularization (35.6% F1 vs. 34.3% F1). One possible reason is that, according to analysis in Trischler et al. (2016), significantly more questions in the NewsQA dataset require paraphrase, inference, and synthesis as opposed to word-matching.

6.2.3 Model Finetuning

To see how the quantity of synthetic questions encountered during training impacts performance, we use mini-batches from SQuAD for every synthetic mini-batch from NewsQA to finetune our model, and average the prediction of 4 checkpointed models during testing. As we see from the results, letting the model to encounter data from human annotations, although from another domain, serves as a key form of data-regularization, yielding consistent improvement as increases. We hypothesize this is because the data distribution of machine-generated questions is different than human-annotated ones; our batching scheme provides a simple way to prevent over-fitting to this distribution.

6.3 Error Analysis

In this section we provide a qualitative analysis of some of our components to help guide further research in this task.

6.3.1 Answer Synthesis

We randomly sample and present a paragraph with answers extracted by our answer synthesis module (Tables 5 and 6). Although the module appears to have high precision, i.e., it picks up entities such as the “Atlantic Paranormal Society”, it misses clear entities such as “David Schrader”, which suggests training a system with full NER/POS tags as labels would yield better results, and also explains why augmenting synthetic data generated by SynNet with such tags leads to improved performance.

They are ghost hunters , or , as they prefer to be called ,
paranormal investigators . “ Ghost-Hunters ”, which airs a
special live show at 7 p.m. Halloween night , is helping lift
the stigma once attached to paranormal investigators . The
show has become so popular that the group featured in each
episode – Atlantic Paranormal Society - has spawned
imitators across United States and affiliates in countries .
TAPS , as the “ Hunters” group is informally known , even
has its own “ Reality Radio” show , magazine , lecture tours ,
T-shirts – and groupies . “ Hunters” has made creepy cool ,
says David Schrader , a paranormal investigator and co-host of
Radio ”, a radio show that investigates paranormal activity.
Table 5: Sample predictions from our answer synthesis module.

6.3.2 Question Synthesis

We randomly sample synthetic questions generated by our module and present our results in Table 6. Due to the copy mechanism, our module has the tendency to directly use many words from the paragraph, especially common entities, such as “Oklahoma” in the example. Thus, one way to generate higher-quality questions may be to introduce a cost function that promotes diversity during decoding, especially within a single paragraph. In turn, this would expose the RC model to a larger variety of training examples in the new domain, which can lead to better performance.

What is Oklahoma’s unemployment rate until Oklahoma City ?
What was the manager of the Oklahoma City agency ?
How many companies are in Oklahoma City ?
How many workers may Oklahoma have as fair hold ?
Who said the bureau has already hired civilians to choose
What was the average hour manager of Oklahoma City ?
How much would Oklahoma have a year to be held
What year did Oklahoma ’s census build job industry ?
Table 6: Predictions from the question synthesis module on a subset of a paragraph.
Figure 2: NewsQA accuracy of baseline BIDAF model trained on SQuAD (light green), vs. model finetuned with our method (red) vs. one trained from scratch on NewsQA (dark grey).

6.3.3 Machine Comprehension Model

We examine the performance over various question types of a finetuned BIDAF on NewsQA vs. one trained on NewsQA vs. one trained on SQuAD (Figure 2). Finetuning with SynNet improves performance over all question types given, with the largest performance boost on location and person-identification questions. Similarly, models trained on synthetic questions tend to approach in-domain performance on numeric and person-identification questions, but still struggle with questions that require higher-order reasoning, i.e. those starting with “what was” or “what did”. Designing a question generator that explicitly requires such reasoning may be one way to further bridge the gap in performance.

7 Conclusion

We introduce a two-stage SynNet for the task of transfer learning for machine comprehension, a task which is both challenging and of practical importance. With our network and a simple training algorithm where we generate synthetic question-answer pairs on the target domain, we are able to generalize a MC model from one domain to another with no annotated data. We present strong results on the NewsQA test set, with a single model improving performance of a baseline BIDAF model by 5.3% and an ensemble by 7.6% F1. Through ablation studies and error analysis, we provide insights into our methodology on the SynNet and MC models that can help guide further research in this task.


We would like to thank Yejin Choi and Luke Zettlemoyer for helpful discussions concerning this work.