EQuANt (Enhanced Question Answer Network)

06/24/2019 ∙ by François-Xavier Aubet, et al. ∙ UCL 0

Machine Reading Comprehension (MRC) is an important topic in the domain of automated question answering and in natural language processing more generally. Since the release of the SQuAD 1.1 and SQuAD 2 datasets, progress in the field has been particularly significant, with current state-of-the-art models now exhibiting near-human performance at both answering well-posed questions and detecting questions which are unanswerable given a corresponding context. In this work, we present Enhanced Question Answer Network (EQuANt), an MRC model which extends the successful QANet architecture of Yu et al. to cope with unanswerable questions. By training and evaluating EQuANt on SQuAD 2, we show that it is indeed possible to extend QANet to the unanswerable domain. We achieve results which are close to 2 times better than our chosen baseline obtained by evaluating a lightweight version of the original QANet architecture on SQuAD 2. In addition, we report that the performance of EQuANt on SQuAD 1.1 after being trained on SQuAD2 exceeds that of our lightweight QANet architecture trained and evaluated on SQuAD 1.1, demonstrating the utility of multi-task learning in the MRC context.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine Reading Comprehension (MRC) entails engineering an agent to answer a query about a given context. The complexity of the task comes from the need for the agent to understand both the question and the context. Progress has been largely driven by datasets that have addressed increasingly difficult intermediate tasks. In particular, the SQuAD 1.1 dataset Rajpurkar et al. (2016) was released in 2016, providing an extensive set of paragraphs, questions and answers. As models rivalled human performance on that dataset, SQuAD 2 was released with an additional 50,000 adversarially written unanswerable questions.

Motivated by the general question of how an MRC agent can be adapted when its original MRC task assumptions are relaxed, we work on the specific research problem of relaxing the answerability assumption on the MRC task, and we evaluate our work using the SQuAD 2 dataset.

QANet Yu et al. (2018) is a feedforward architecture using only convolutions and attention mechanisms for MRC. It is devoid of recurrence, which is a typical ingredient in previous MRC models, and despite its simplicity it achieved state-of-the-art performance on SQuAD 1.1. Observing the absence of a mechanism in QANet to allow for unanswerability, and noting that to the best of our knowledge there has so far been no effort to incorporate one, we decided to base our work on this architecture. Our contribution is two-fold:

Firstly, we present EQuANt, which extends the original QANet architecture to include an answerability module. Working within the time and resource constraints of this project, we achieved a 63.5 F1 score on SQuAD 2, almost double the accuracy of our baseline QANet method. For the sake of reproducibility, we make available an open-source implementation of our model at https://github.com/Francois-Aubet/EQuANt.

Secondly, we show that by training EQuANt to accomplish two distinct tasks simultaneously, namely answerability prediction and answer extraction, we improve the model’s performance on SQuAD 1.1 from that of QANet, verifying that a multitask learning approach can improve an MRC model’s performance.

We begin in section 2 by presenting the background necessary to motivate and understand our contribution. In section 3, we give an overview of related work and how it complements and differs from our work. In sections 4, 5 and 6 we illustrate the design of our model and present and discuss our experimental results. Finally, in section 7, we summarise our work and propose potential future work which would extend our contribution.

2 Background

The problem of question answering can be formulated specifically in the open domain setting in the following way:

Given a question, or query, sequence , and a context paragraph sequence , assume the answer to the question is a unique connected subsequence of , then identify that subsequence. i.e. Identify , , such that the span is the answer to the query . ()

A recent and significant dataset responsible for much of the development of models in tackling the above-formulated problem is the Stanford Question Answering Dataset (SQuAD), and more specifically its two versions, SQuAD 1.0 and SQuAD 1.1, Rajpurkar et al. (2016)

. SQuAD consists of over 100,000 crowdsourced comprehension questions and answers based on Wikipedia articles. Importantly, this dataset is large enough to support complex deep learning models and contains a mixture of long- and short-phrase answers which are directly implied by the associated passage. Since its introduction, SQuAD has inspired healthy competition among researchers to hold the state-of-the-art position on its leaderboard.

The success of an MRC model hinges on its ability to represent both the structures of the questions and contexts, and the relationship between the questions and the contexts. The two most prominent methods in the literature to represent the structures of such kinds of sequential data are attention and recurrence, thus it is not surprising that the best performing models on SQuAD 1.0 leaderboard are attention-based models, e.g. BERT Devlin et al. (2018), and RNN-based models, e.g. RNet, Wang et al. (2017). One prominent attention-based candidate on the leaderboard is QANet, Yu et al. (2018), upon which our work is built. We will now provide a brief introduction to QANet and motivate our decision to work with this model.

QANet consists of five functional blocks: a context processing block, a question processing block, a context-query block, a start-probability block and a end-probability block. See figure

2 for a high level representation of the model. Within the context, question and context-query blocks, an embedding encoder of the form shown in figure 1 is used repeatedly. This is very similar to the transformer encoder block introduced in Vaswani et al. (2017), however possesses an additional convolutional layer after positional encoding and before the layernorm and self-attention layer. These additional separable convolutional layers enable the model to capture local structure of the input data. Having passed through the context-query block, the data is then passed into the two probability blocks, which are both standard feed-forward layers with softmax, to calculate the probability of each word being a start- or end-word. For a detailed description of each portion of the model, the reader is referred to the original paper Yu et al. (2018), however further discussion of the components most relevant to our architecture design and experiments can be found in section 4.1.

The original QANet authors Yu et al. (2018) achieved a result of 73.6 Exact Match (EM) and 82.7 F1 score on the SQuAD 1 datasets, placing it among the then state-of-the-art models. This is quite remarkable given its apparent conceptual and practical simplicity. Armed with separable convolution and an absence of recurrence, it is able to achieve its results whilst having a faster training and inference time than all of the RNN-based models preceding it Yu et al. (2018). Thus we are motivated to investigate the properties of a simple, efficient and accurate model in hope of gaining fundamental understanding of question-answer modelling.

As methods on the top of the SQuAD 1.1 leaderboard started to outperform human EM and F1 scores, a more challenging task was called for, leading to the entrance of SQuAD 2.0 Rajpurkar et al. (2018). In addition to SQuAD 1.1, SQuAD 2.0 added over 50,000 unanswerable questions written adversarially to look similar to answerable ones.

Under this new setting, we reformulate the question answering problem as the following:

For and as given in , release the assumption that an answer exists, but if it does then assume it is a unique connected subsequence of . First identify the value of the indicator variable , such that if the answer exists, then , otherwise . Furthermore, if the context contains the answer, then identify , , such that the span is the answer to the query . ()

Inspecting the QANet architecture, it is not hard to see that the model would not give the desired prediction for unanswerable questions, as the value of is assumed to be and the length of the span is at least . This motivates our extension of QANet to accommodate for unanswerable questions and identifies SQuAD 2.0 as an appropriate benchmark dataset for our work.

Figure 1: The mechanism of one QANet encoder block.

3 Related Work

3.1 Open domain question answering

Recurrence has traditionally been a key component in many natural language tasks, including QA, and since the entrance of attention mechanisms into the QA domain, models have also found success by using attention to guide recurrence, such as the BiDAF and SAN models Seo et al. (2016); Liu et al. (2017). A key drawback in traditional recurrence-based architectures is the long training time due to the complexity in modelling relations between words which are words apart. Replacing recurrence with pure attention completely reduces the complexity to constant, providing faster algorithms Vaswani et al. (2017).

SAN, alongside other models, e.g. Hill et al. (2015); Dhingra et al. (2016), used multi-step reasoning, implemented using recurrent layers, to predict the answer spans. The purpose of using multiple reasoning states is to extract higher order logical relations in contexts. For example, the model may first learn what a pronoun is referring to before extracting the answer span based upon the reference. In contrast, QANet used entirely multi-headed attention and convolution mechanisms, which encapsulate the intra-context relations, and is in addition superior at modelling long-range relations. Moreover, the large recurrence component of these models creates a burden on training speed, whereas QANet’s attention and separable convolution approach saves an order of magnitude on the training complexity by the result stated in the first paragraph of this section.

The transformer architecture proposed in Vaswani et al. (2017) and its pre-trained counterpart, BERT Devlin et al. (2018), have become the common factor in all leading QA models. Unlike QANet, which is specifically designed for QA, BERT is an all-in-one model, capable of aiding many natural language tasks. Thus it is not surprising that BERT is a much larger model than QANet, containing parameters in the base model, compared with the fewer than parameters that were present in QANet during our computational experiments. As a result, BERT will have significantly greater inference time than QANet. Furthermore, as a result of its multi-faceted abilities, impressive though they are, BERT is less capable of illustrating the interaction of the model with the QA problem specifically. QANet, however, as a simple feed-forward, QA-specific model with less training time, allows more insights into the model’s reaction to the problem, hence providing researchers with more intuitions into model enhancement.

3.2 Unanswerability extension

One body of unanswerability extension relies on incorporating a no-answer score to the model, which is the main inspiration for our work. Levy et al. extended BiDAF Levy et al. (2017); Seo et al. (2016) to deal with unanswerable questions by effectively setting a threshold , whereby the model will output no answer if the model’s highest confidence in any answers is less than . Our work uses an approach similar to Levy’s to verify that QANet generates generally lower probabilities in a dataset with unanswerable questions, but our final model adopts a more explicit approach.

In Liu et al. (2018), the original SAN authors extended SAN to accommodate unanswerable questions. Their work added an extra feed-forward module for discrimination of answerable/unanswerable questions and trained an objective which jointly accounts for answer correctness and answerability. We take inspiration from this extended SAN, but the summary statistic fed into the answerability module in our model is obtained from a fundamentally different procedure which is completely devoid of recurrence. We also favour the approach of minimising a joint objective over different tasks in response to recent successes of multi-task approaches in NLP which suggest that a learner’s learning generalisability improves as it tries to accomplish more than one task.

Read+Verify and UNet Hu et al. (2018); Sun et al. (2018) both use an additional answer verifier to improve performance by finding local entailment that supports the answer by comparing the answer sentence with the question. The local entailment finding is able to improve the answer accuracy as it is sensitive to specific types of unanswerability, such as an impossible condition. Due to time constraints, we leave the exploration of utilising verification modules for future work.

4 Methods

4.1 Light QANet implementation

Given that our model builds directly upon QANet, a natural first step was to work with a computational implementation of this base model. We chose to use the open-source Tensorflow implementation of QANet hosted at

https://github.com/NLPLearn/QANet

for our QANet experiments and as a base for our extension. A particularly attractive aspect of this implementation is that it allows straightforward customisation of the hyperparameters involved in the QANet model. In order to allow for a larger number of design iterations and to account for limited computational resources, we chose to utilise this customisability to make our trained QANet model lightweight, and we refer to this scaled-down version of the original QANet as “light QANet”. Although this choice is likely to mean that our results, quoted in section

6, could be improved by increasing the complexity of the architecture, we stress that our aim is not to surpass state-of-the-art performance on SQuAD2, but instead to show that it is possible to successfully extend QANet to the unanswerable domain.

Figure 2: EQuANt architecture: combination of QANet and unanswerability extension module.

As in the original paper, our character embeddings are trainable and are initialised by truncating GloVe vectors

Pennington et al. (2014). However, in the interest of model size, we choose to retain of the of each GloVe vector rather than as in the original paper. We then used the character embedding convolution of QANet to map the -dimensional vector to a -dimensional representation, making the output of our context and query input embedding layers of dimension , rather than used by the original authors, resulting in a significant reduction in the number of parameters in our model. Having utilised the input embedding to represent each word in the context and query as a -dimensional vector, these vectors then flow into the embedding encoder blocks (of the form shown in 1), where they are transformed using a series of convolutions, feedforward layers and self-attention. In these encoder blocks and throughout the rest of the network, we choose our hidden layer size to be , as opposed to original QANet’s hidden layer size of . Furthermore, although the typical transformer architecture relies on multi-headed self-attention, with the original QANet using heads in all layers, this introduces additional computational overhead. As a result, we minimise this by using only a single head, however it is straightforward to change the number of heads in our implementation and this may yield fruitful results. All other architecture and hyperparameter choices match those described in the original paper Yu et al. (2018).

As well as using this process to gain an understanding of the inner workings of QANet, we utilised light QANet to provide a principled initialisation for the training of our extended architecture. In Yu et al. (2018)

, the authors describe how they used an augmented dataset generated using neural machine translation and how this significantly improved their results. As having access to this dataset would likely result in improved outcomes for our model, we initiated contact with the QANet authors, however access to the augmented dataset was not granted. As a result, we trained light QANet on SQuAD 1.1 for

iterations, providing the results shown in table 4(a) and saved the corresponding weights, allowing them to be restored and used as principled initialisations when performing subsequent model training.

4.2 Problem Analysis

In order to gain a better understanding of our research problem both conceptually and practically, and to assess our intuition to extend QANet with an extra answerability module, we specifically ask ourselves two sub-questions: 1) How much does QANet detect distinction between answerable and unanswerable questions? 2) What should be chosen as the input to our answerability module?

We first ran our light QANet on answerable and unanswerable questions extracted from the SQuAD 2 dataset and analysed the results.

Figure 3: Attempts to extend QANet to EQuANt.

Our results showed that QANet assigns generally lower probability to all possible “answers” to unanswerable questions. More precisely, given context-query pairs from answerable and unanswerable questions, we study the maximum start-word and end-word probabilities assigned by QANet to all words in the context, and we find that unanswerable questions on average receive lower start- and end-word probabilities on all words in the corresponding context. This shows that the original QANet already captures information about unanswerability, validating the possibility of answerability detection by appending an additional functional module to the basic QANet structure.

Upon inspection of the intermediate outputs of the QANet architecture, we found that QANet respects the variable length of input queries and contexts, resulting in all intermediate outputs of the architecture having variable size. Whilst this is compatible with QANet’s original aim of assigning probabilities to every word in the context, it is not immediately compatible with our extension, the purpose of which is to assign an answerability score to the context as a whole. It is thus necessary to design our extension to handle variable input size. In section 4.3, we outline three attempted solutions.

4.3 Enhanced Question Answer Network (EQuANt)

We now provide details on our exact architecture design, which we name Enhanced Question Answer Network (EQuANt). EQuANt is based on light QANet, with an answerability extension module as motivated in section 4.2. We investigated three extension designs, with the final one achieving promising results, in particular almost doubling light QANet’s accuracy on SQuAD 2 in both EM and F1 measures.

A component of the QANet architecture which is particularly relevant to the design of our architectures and our analysis of the inner workings of the model is the context-query attention layer (see figure 2

). This layer takes in the encoded context and query and combines them into a single tensor. A core aspect of this layer is the similarity matrix,

, which has size . The element of this matrix represents the similarity of context word and query word , calculated using the trilinear function described in Seo et al. (2016). This matrix is important for two reasons. Firstly, visual inspection of its components allows interpretation of the quality of the context and question encodings. Secondly, if the model is to be successful, must contain the information required to determine answerability or lack thereof and represents the first point in the network where a single tensor must contain this information, making it a natural focal point for our architecture designs.

These architecture designs are outlined in the remainder of this section and visualised in figure 3. Of the three designs that were implemented, the third (EQuANt 3) exhibited the best performance (see table 4(a)) and was therefore chosen to be our final model. Discussions regarding the performance of each architecture can be found in sections 5 and 6.

4.3.1 EQuANt 1

EQuANt 1 takes the context-query attention weights from the context-query attention layer, which are of size length of context length of question, and applies two convolutional layers followed by global mean pooling and a feedforward layer. The variable-size dimensions inherited from variable context and question lengths are reduced to 1 during global mean pooling. The final feed-forward layer then transforms the channel dimension obtained from convolution layers to size 1, giving us a scalar which we use as the score. This model performed poorly on the SQuAD 2 dev set, likely due to the information loss when convolving the context-query attention matrix. More discussion is provided in section 5.

4.3.2 EQuANt 2

The EQuANt 2 extension stems from the output of the first stacked encoder layer, making each of its inputs of dimension length of context number of hidden nodes. We apply two encoder transformations as in figure 1

and then a feedforward network which transforms the size of the hidden layer (96) to 1, followed by padding the context-length dimension to constant length. Then we apply two layers of 1d convolution, before a final feedforward layer to map to a scalar which we take as the score. This model also performed poorly. Note that padding decreases the proportion of non-zero elements along the context-length dimension in many data points significantly, essentially causing interesting information to be compressed, potentially explaining the lack of success for this model.

4.3.3 EQuANt 3 (Final design)

After learning from the failure cases of EQuANt 1 and 2, we aim to design a model which extracts more useful information from the context-query attention matrix, whilst avoiding diluting it with zeros. To extract higher level information from the context-query attention matrix, we use two more encoder blocks as in figure 1, after which we down-sample the context-query dimension using global mean pooling. Our exact design is as follows: the answerability module again starts from the output of the first stacked encoder layer and two encoder transformations are also applied. The output from this then undergoes three feedforward layers, which transforms the hidden dimensions from 96 to 48, 48 to 24 and 24 to 1. Finally, a global mean pooling layer takes the variable-length dimension inherited from the context length and transforms it to size 1, giving us the answerability score. EQuANt 3 performed respectfully on the SQuAD 2 dev set, achieving 70.26% accuracy on answerability prediction.

Figure 4: Attention maps. Top: Unanswerable question. Middle: Answerable question. Bottom: Shuffled question.

4.3.4 Loss function

Let denote the vector of parameters, the predicted answerability probability, the ground truth of answerability, the predicted start-word probability of the true answer and

the predicted end-word probability of the true answer. Then our loss function can be formulated as

where with denotes the cross entropy loss associated with answerability, start-word and end-word predictions respectively.

In our experiments, we performed stochastic gradient descent using the Adam optimiser with hyperparameter settings: batch size = 32, learning rate = 0.001,

, and .

5 Experiments

As mentioned in section 4.3, the context-query similarity matrix, , offers insight into the quality of the model’s encodings and should contain the information required to infer answerability or lack thereof. In order to gain the most insight into potential model behaviour, we investigated the form of within our light QANet for three different types of context-query pairs. The first two types are the standard answerable and adversarially designed unanswerable varieties taken directly from SQuAD 2. The final type is referred to as shuffled, for which we pair a given context with a question from a different article, meaning that the question is almost certainly unanswerable and unrelated to the context paragraph.

For visualisation purposes, we focused on short contexts, and an example of for a specific short context and each of the three types of question is shown in figure 4. These results are interesting for two main reasons. Firstly, they show that the learnt encodings are meaningful. For example, the word “when” in the adversarial unanswerable question attends to the date-related part of the context, and the word “population” in the shuffled question attends to words in the context associated with geographical regions. Furthermore, these results perhaps offer insights into why the initial convolution approach was unsuccessful. In particular, it seems that answerable and adversarially unanswerable questions both lead to matrices with peaked context words for each query word, making it hard for convolutions to successfully identify unanswerability. However, as expected, the matrices for shuffled questions appear more diffuse and random due to the largely unrelated meanings of the context and query, further emphasising the subtlety in distinguishing answerable and adversarially unanswerable questions.

Name No. of Params Training Iterations EM F1 Accuracy
Light QANet 788,673 32,000 31.390 37.432 49.928
Light QANet 788,673 62,000 32.903 38.412 49.928
EQuANt 1 996,196 40,000 32.881 38.356 49.914
EQuANt 2 1,001,520 40,000 33.512 38.894 49.914
EQuANt 3 927,970 62,000 56.843 60.980 69.114
EQuANt 3 927,970 78,000 58.140 62.360 70.26
(a) SQuAD 2 dev set results.
Name No. of Params Training Iterations EM F1
Light QANet 788,673 32,000 62.270 74.058
Light QANet 788,673 62,000 63.623 75.841
EQuANt 3 927,970 62,000 69.29 78.80
(b) SQuAD 1.1 dev set results.

6 Results & Discussion

As mentioned in section 4.1, our first step was to train our light QANet on SQuAD 1.1 for iterations in order to generate a suitable initialisation which was used for the subsequent training of all other models. Evaluation of this trained model on SQuAD 1.1 yields the results shown in the first row of table 4(b). The quoted number of training iterations for other models in tables 4(b) and 4(a) therefore includes these pre-training iterations. In order to observe how our lightweight model trained on SQuAD 1.1 without data augmentation compares to the original QANet without data augmentation, we trained for a further iterations on SQuAD 1.1, yielding the results shown in the second row of table 4(b). These EM/F1 scores are 9.98/6.853 lower than the corresponding results for the full QANet architecture Yu et al. (2018), implying that our choice to employ a lightweight architecture has a noticeable impact on performance.

We evaluated these trained light QANet models on the SQuAD 2 dev set, implicitly treating all questions as answerable. This led to the results shown in the first and second rows of table 4(a), which act as baselines to compare our EQuANt results against. The accuracy column in table 4(a) contains the proportion of questions correctly identified as being answerable or unanswerable.

Having investigated the performance of light QANet on SQuAD 1.1 and 2, we moved on to train each of the EQuANt architectures described in section 4.3 on SQuAD 2. As can be seen in table 4(a), EQuANt 1 and 2 did not perform well on SQuAD 2. In fact, their performance is identical. This is explained by both models learning to output a constant answerability probability of 0.69, independent of the query-context pair considered. Note that this probability matches the proportion of SQuAD2 training examples which are answerable, meaning that these models have been unable to extract the necessary features for accurately predicting answerability and have defaulted to the most basic frequentist approach of predicting the mean.

However, as shown in the final row of table 4(a), EQuANt 3 is capable of both answerability prediction and question answering, significantly exceeding baseline performance on SQuAD 2.

As laid out in this blog post by Sebastian Ruder, multi-task learning has recently been successfully applied to numerous NLP tasks. We therefore decided to measure the performance of EQuANt 3, trained on the two tasks of question answering and answerability prediction, at question answering alone by evaluating its EM and F1 scores on SQuAD 1 by providing EQuANt 3 with the ground truth answerability of true for each SQuAD1 question. As shown in the final row of table 4(b), EQuANt 3 outperforms light QANet by 5.667/2.959 on F1/EM scores, suggesting that it indeed benefits from this multi-task approach.

7 Conclusion

In this work, we have presented EQuANt, an MRC model which extends QANet to cope with unanswerable questions. In sections 2 and 3, we motivated our work and placed it in the wider context of MRC and unanswerability. Following this, in section 4, we presented our lightweight QANet implementation and laid out in detail the 3 EQuANt architectures that were trained and whose performance was evaluated. In section 5, we investigated the context-query attention maps within our lightweight QANet, allowing us to verify the quality of our learnt encodings and gain insight into why our initial architecture, EQuANt 1 did not predict answerability effectively. Finally, in section 6, we presented our results and discussed how the observed performance of EQuANt 3 on SQuAD 1.1 suggests that multi-task learning is a valuable approach in the context of MRC.

References