Question-Aware Sentence Gating Networks for Question and Answering

by   Minjeong Kim, et al.
Korea University

Machine comprehension question answering, which finds an answer to the question given a passage, involves high-level reasoning processes of understanding and tracking the relevant contents across various semantic units such as words, phrases, and sentences in a document. This paper proposes the novel question-aware sentence gating networks that directly incorporate the sentence-level information into word-level encoding processes. To this end, our model first learns question-aware sentence representations and then dynamically combines them with word-level representations, resulting in semantically meaningful word representations for QA tasks. Experimental results demonstrate that our approach consistently improves the accuracy over existing baseline approaches on various QA datasets and bears the wide applicability to other neural network-based QA models.


page 1

page 2

page 3

page 4


Learning to Paraphrase for Question Answering

Question answering (QA) systems are sensitive to the many different ways...

A Question-Focused Multi-Factor Attention Network for Question Answering

Neural network models recently proposed for question answering (QA) prim...

A Context-aware Attention Network for Interactive Question Answering

Neural network based sequence-to-sequence models in an encoder-decoder f...

Adaptive Memory Networks

We present Adaptive Memory Networks (AMN) that processes input-question ...

Question-Answering with Grammatically-Interpretable Representations

We introduce an architecture, the Tensor Product Recurrent Network (TPRN...

Hierarchical Memory Networks for Answer Selection on Unknown Words

Recently, end-to-end memory networks have shown promising results on Que...

Multi-task Sentence Encoding Model for Semantic Retrieval in Question Answering Systems

Question Answering (QA) systems are used to provide proper responses to ...

1 Introduction

A machine comprehension question answering (QA) task has gained significant popularity from research communities. Given a passage and a question, the machine finds the correct answer by understanding the passage using the contextual information from various levels of semantics. In the example shown in Figure 1, the question ‘What other role do many pharmacists play?’ requires the model to properly get the meaning of the first sentence (the first role of the pharmacists), so as to understand that the role presented in the second sentence corresponds to the ‘other role,’ which is needed by the question.

Figure 1: An example from SQuAD dataset.

To tackle this machine comprehension, recently, many neural network-based models have been proposed Seo et al. (2017); Liu et al. (2017); Pan et al. (2017); Huang et al. (2017); Hu et al. (2017)

. Most of them are trying to solve this problem by extracting the matching information between the word-level encoding vectors in the passage and those in the question. For example,

Xiong et al. (2017) introduces a co-attention encoder that captures the word-level question-passage matching information, and  Wang et al. (2017) additionally incorporates self-matching information among words within the passage.

However, most of the existing approaches still focus on improving word-level information, but not directly utilize the higher-level semantics contained in, say, sentences and paragraphs. Motivated by this, we propose the novel question-aware sentence gating networks that directly impose the sentence-level information into individual word encoding vectors.

Our proposed approach works as follows: (1) Sentence encoding: we first generate initial encoding vectors of individual sentences in the passage and the question. (2) Sentence matching: we further transform them into question-aware vectors by considering its degree of matching with the question. (3) Sentence gating: For each word, our gating module determines how much to add the sentence information to it, and the new word-level encoding vector is computed accordingly as a convex combination of its original word vector and its corresponding sentence vector.

Finally, the contribution of our paper includes the following:

  • We propose the novel question-aware sentence gating networks that impose the sentence-level semantics into individual word encoding vectors in QA tasks.

  • We extend several existing state-of-the-art models by integrating our sentence gating networks.

  • We show the effectiveness of our proposed approach by conducting quantitative comparisons using well-known QA datasets and models as well as qualitative analysis of our model.

Figure 2: Overview of our proposed question-aware sentence gating networks. Inputs of networks are word tokens of the -th sentence in a passage and question tokens. The networks yield the gated word representations for the -th sentence.
Figure 3: A detailed illustration on the sentence encoding and the sentence matching modules. Note that the produced output is passed to the sentence gating module.

2 Problem Setting

Taking a passage and a question as input, a QA model jointly understands the two and finds out the corresponding answer from the passage. The passage is represented as a sequence of sentences, , where is the number of sentences in the passage. The -th sentence is further decomposed into a sequence of words, , where denotes the number of words in sentence . Alternatively, the passage is represented just as a single sequence of words across multiple sentences, , where . The (single-sentence) question is similarly denoted as , where denotes the number of words in . We assume that the answer always has the form of either a single entity or a phrase existing in the passage.

3 Question-Aware Sentence Gate

In this section, we describe our proposed question-aware sentence gating networks that directly incorporate the sentence-level information into the encoding module of individual words that belong to the corresponding sentence. The overview of our proposed method is summarized in Figure 2.

3.1 Sentence Encoding Module

Given the word sequence of a passage, and that of a question, , the first layer of our model is an embedding layer Mikolov et al. (2013); Pennington et al. (2014), which represents individual words as pre-trained word embedding vectors and

. Our model then applies bi-directional Gated Recurrent Units (BiGRU) to each set of embedding vectors in the passage and the question, respectively, as

where and are the word-level encoding vector of the -th word in the passage and the question, respectively, and is the hidden dimension.

Afterwards, an average pooling layer is applied to the set of these word-level encoding vectors per sentence in the passage, i.e.,

where represents the -th word of the -th sentence.

As will be seen in our experimental results (Table 4), we choose the average pooling approach for sentence encoding since it outperforms other approaches while using relatively small number of parameters.

3.2 Sentence Matching Module

The generated passage and question sentence representations are then fed into our sentence-matching module to incorporate the question information into the sentence vectors in the passage.

Inspired by the generic natural language inference approaches Liu et al. (2016); Conneau et al. (2017), we compute the degree of matching between each passage sentence representation for and the question sentence representation in a diverse manner. Specifically, we propose the three different matching schemes Mou et al. (2015): (1) the concatenation of the two representation vectors , (2) the element-wise product , where represents the Hadamard product, and (3) the absolute element-wise difference . The results of three matching methods are then concatenated as


which works as the question-aware sentence representation . It captures the sentence-level relation between the passage and the question, which is then fed to the sentence gating module to flow the sentence-level matching information to the word-level representation.

3.3 Sentence Gating Module

We propose the sentence gating networks that determine how much the sentence-level matching information should be incorporated into each word representation towards more semantically meaningful word representations for QA tasks.

To this end, we first generate another word-level representations by using additional BiGRU module given the entire passage word embedding vectors , i.e.,

Note that a newly indexed word-level representation corresponds to where the -th word in the passage is the -th word in the -th sentence.

Then, our sentence gating module takes the input as the word representation and its corresponding sentence representation (defined in Eq. 1) and generates a new word representation (fused with the sentence information) and its dimension-wise gating vector , respectively, as

where and are the trainable parameters, and

represents an element-wise sigmoid activation function.

As our final output, the gated word representation is computed as a convex combination of and , i.e.,


where and represents the Hadamard product. Each element of determines the amount of information in the corresponding dimension of to be added to the previous-layer word representation .

This proposed algorithm is widely applicable to the passage and the question encoding processes in various existing QA models by replacing the word-level encoding module output with our sentence-gated word representations , as will be discussed in Section 4.

Data # Train # Validation # Test
SQuAD 87,599 10,570 -
WDW 127,786 10,000 10,000
WDW-R 185,978 10,000 10,000
Table 1: Data Statistics.

4 QA Model with Sentence Gate

This section discusses how we extend existing QA models using our proposed sentence gating networks. Specifically, we select two QA models: Match-LSTM with Answer Pointer WANG and JIANG (2017) and Gated-Attention Reader Dhingra et al. (2017).  111All the implemented codes will be released upon acceptance.

4.1 Match-LSTM with Answer Pointer

Model Description

To extend Match-LSTM with Answer Pointer (Match-LSTM), we first generate gated word representations (Eq. 2) and encode question word representations using our approach. Next, a bi-directional match-LSTM Wang and Jiang (2016) layer computes attention weights over the entire encoded question word vectors for each encoded passage word, which produces hidden word representations that reflects its relevance to the question. From these vectors, an answer pointer layer selects the final answer span.

On the other hand, this model originally uses BiGRU to encode the passage and the question words, and the outputs are then fed to the match-LSTM layer. As illustrated in Section 5, we compare our extended model with the original model.


We evaluate the extended and the baseline model on Stanford Question Answering Dataset (SQuAD)  Rajpurkar et al. (2016). SQuAD contains more than 100,000 questions on 536 Wikipedia articles. The answer of each question is represented as a span in the passage.

Peripheral Details

We tokenize all the passages and questions using spaCy 222 library. We use -dimensional pre-trained GloVe word embeddings Pennington et al. (2014) 333We use ‘’ downloaded from for both passages and questions, and we do not fine-tune the embedding vectors during training. Out-of-vocabulary words are assigned zero vectors, and we do not use character-level embeddings.

We set the batch size as 40 and the hidden vector dimension is set to 150 for all layers. Also, we use a single BiGRU layer for encoding of word tokens. We apply dropout Srivastava et al. (2014) with its rate of 0.2 between adjacent layers. We use ADAMAX  Kingma and Ba (2014) with the coefficients of and

. The initial learning rate is set as 0.002 and multiplied by 0.9 every epoch.

4.2 Gated-Attention Reader

Model Description

Gated-Attention Reader (GA Reader) is based on a multi-hop architecture and an attention mechanism. We impose our question-aware sentence gating networks across the hops in this model. Specifically, in the -th hop out of hops, our algorithm first generates gated word vectors and encoded question word vectors . They are fed to the Gated-Attention module where each word vector in the passage incorporates the matching information from the entire question words. In the last hop, an attention-sum module  Kadlec et al. (2016)

selects the answer based on a probability distribution over the given candidates.

Similar to Match-LSTM, this model originally uses BiGRU to encode words in the passage and those in the question, and then the gated-attention module takes these outputs as inputs. This model will also be compared with its extension with our proposed module in Section  5.

Model EM F1
 WANG and JIANG (2017)
(Our implementation)
+ Sentence gate
0.6814 0.7736
Table 2: Performance comparisons of the question-aware sentence gate over the baseline Match-LSTM on SQuAD dataset.
Model WDW dev WDW test WDW-R dev WDW-R test
GA Reader Dhingra et al. (2017) 0.716 0.712 0.726 0.726
GA Reader (Our implementation) 0.717 0.710 0.722 0.712
GA Reader
+ Sentence gate
0.722 0.712 0.733 0.727
Table 3: Performance comparisons of our question-aware sentence gate over the baseline GA Reader on WDW and WDW-R datasets.

We conduct the experimental comparison between the extended GA Reader and the original GA model on a cloze-style dataset called Who Did What (WDW) Onishi et al. (2016). It consists of questions with an omitted entity, corresponding to a person name, and the machine is asked to choose the right answer from a set of choices. WDW has two variants, a Strict (WDW) and a Relaxed (WDW-R) version, and we report experimental results on these two datasets.

Match-LSTM EM F1
Max pooling
Inner attention
Average pooling 0.6814 0.7736
Table 4: Performance comparisons on different sentence encoding methods.
Match-LSTM EM F1
Scalar gate
Vector gate
Concatenation + Matching
Scalar gate + Matching
Vector gate + Matching 0.6814 0.7736
Table 5: Performance comparisons over sentence matching and gating methods. Matching refers to the sentence matching module illustrated in section 3.2.
Figure 4: Performance comparisons depending on various sentence lengths in SQuAD for (a) and (b) and WDW datasets for (c).
Peripheral Details

For word embedding, we use the 100-dimensional pre-trained GloVe embedding vectors Pennington et al. (2014), which are fixed during training. Out-of-vocabulary tokens at test time are assigned with distinct random vectors. We train the model with the batch size of 50 and the hidden dimension of 128. We use a single BiGRU layer and adopt a dropout rate of 0.3. The number of hops is set to 3. We also use the trainable character embedding vectors of 50 dimensions, as in the original model. We also apply the same optimizer and learning rate as in Match-LSTM.

5 Experiment Results

We analyze our algorithm under two main categories, quantitative and qualitative evaluations. We denote our question-aware sentence gating networks as Sentence gate for all the tables and figures presented in this section.

5.1 Quantitative Evaluations

In this section, we compare the baseline methods and their extended models we proposed in Section  4 and also analyze our model structure. Lastly, performance comparisons on various sentence lengths of the passage are presented.

5.1.1 Model Performance Comparison


The performance results on Match-LSTM are reported in Table 2

. We use two metrics to evaluate the model performance: Exact Match (EM) and F1 score. EM measures the percentage of questions in which the model prediction is exactly matched with the ground truth and F1 measures the harmonic average of the precision and recall between a model prediction and ground truth answer span at character level. Our model outperforms the baseline model by 1.53% in EM.


Table 3 shows that the query-aware sentence gate boosts performance on Cloze-style datasets. Our model improves the accuracies on all the benchmarks over reported baseline scores.

Figure 5: Word-level visualization of sentence gate values in sample passages.
Figure 6: Distributions of sentence gate values with respect to different hops

5.1.2 Model Structure Analysis

All the experiments in this sections are performed on the SQuAD dataset with Match-LSTM as the baseline.

Sentence Encoding Module

We first compare our average pooling method against the other three sentence encoding methods: 1) BiGRU-last: concatenation of the last hidden states of a BiGRU, 2) Max pooling: taking the maximum value over each dimension of the hidden states, 3) Inner attention: applying an attention mechanism over the hidden states Liu et al. (2016); Conneau et al. (2017). Table 4 shows the results that our approach, i.e., the average pooling, outperforms the other methods.

Sentence Matching and Gating Modules

We conduct experiments on the effect of the sentence matching and sentence gating modules suggested in Section  3.2 and  3.3. We compare the combining methods for word and sentence representations with or without the sentence matching module (referred as Matching in Table 5). As for the combining word and sentence vectors, three methods are compared: 1) Concatenation: concatenating sentence and word representations, 2) Scalar gate: applying the same gating weight across dimensions, 3) Vector gate: our proposed method. Also, for the method without the sentence matching module, the output of the average pooling is used for a sentence vector.

Table 5 shows that when it comes to a method for combining word and sentence representation, both the scalar gate and the vector gate outperform the concatenation method by a large margin. Among the two gating mechanisms, the vector gate achieves the highest EM and F1 scores along with the sentence matching module. Also, we can see that the sentence matching module improves the EM and F1 scores of SQuAD dataset, except for the case in conjunction with the concatenation. This indicates that with the proper manner of combining word and sentence, matching information between a passage and question sentence leads to a better model performance.

5.1.3 Performance Comparison on Various Sentence Length

We observe that our model shows remarkable performance gains for longer sentences compared to the baselines in both datasets. To verify this, we first split passages into five groups based on the average sentence length of each passage and then measure the performance of each group. In Figure 4, (a) and (b) show that our model achieves better performances as the average length of sentence increases in SQuAD dataset. Likewise, our model records a higher accuracy on WDW than the baseline as shown in (c). This implies that our model better tackles a passage which consists of long sentences than word match-based QA models.

highest lowest
Tesla, 2011, Luther, 2016, There, and, modern, were, being, was,
SQuAD BSkyB, Kublai, 2015, Goldenson, very, against, as, largely, highly,
(Match-LSTM) Chloroplasts, 2012, Genghis, Newton, considered, traditional, political,
TFEU, San, ctenophores, Esiason extremely, particularly, rich, large

Barack, Robert, Nicolas, the, a, of, best, top, final,
WDW-R George, Hugo, Timothy, James, straight, economic, to, been,
(GA Reader) Eric, Alex, Janet, Chudinov, last, largest, lost, guaranteed,
Jacqueline, Shaoxuan negotiating, first, more, be, all

Table 6: Word tokens with the highest and the lowest gate values.

5.2 Qualitative Evaluations

we also perform in-depth qualitative analyses on the sentence gate value to examine the behavior of our proposed module.

5.2.1 Gating Value by Word Tokens

Each dimension of the sentence gate value determines the amount of sentence-level information added to the word-level representations. A higher sentence gate value thus indicates that the corresponding word requires more sentence-level question-passage matching information.

To analyze such behaviors with respect to different words, we sample random passages from SQuAD and WDW-R and plot the sentence gate value of each word token averaged across dimensions. We highlight the words with a relatively high gate value.444More visualization samples are reported in the supplementary material. Figures 5(a) and (b) shows the results for SQuAD and WDW-R, respectively, and each line in (b) corresponds to one of three different hops in the GA Reader.

Note that word tokens with relatively high gate values are related to the question. For instance, in Figure 5(a), words related to the question such as 1817, April, WSE or 2009 are given high gate values. Likewise, in Figure 5(b), because the question asks a person name, the names of persons are assigned with high gate values across different hops. On the other hand, we observe that some stop words, e.g., was, and, and by, and adverbs, e.g., previously generally take low gate values on both datasets. To further justify this observation, we list word tokens with the highest and the lowest gate values in Table 6. Note that since WDW-R dataset always requires the model to find a correct person name, high gate values are assigned to person names. These results prove our initial hypothesis that the gate values efficiently expose the words to the relevant sentence-level matching information, which is needed to solve given questions.

5.2.2 Gate Value by hops

Once each word token learns sentence-level semantics, it is natural for the model to narrow down the focus on word-level in order to find the exact answer tokens.

To visualize this intuition, we draw a boxplot for gate value distributions over three hops obtained by GA Reader with the WDW-R dataset. To obtain a single representative gate value for a passage, we average the values over dimensions and words. Figure 6 shows that the average gate value tends to be smaller in later hops. This observation is consistent with our assumption that the model focuses more on word-level information as it approaches to the answer prediction stage.

6 Related Works

In recent years, various tasks on machine comprehension have been attracting considerable attention from the research community. Among the studied tasks are Automatic Summarization  

Cheng and Lapata (2016); Nallapati et al. (2017); Cao et al. (2016), Machine Translation  Bahdanau et al. (2014); Kalchbrenner et al. (2016)

, Named Entity Recognition  

Lample et al. (2016); Agerri and Rigau (2016) and Question Answering  Wang et al. (2017); Xiong et al. (2017).

Several studies have been proposing end-to-end neural networks models for QA. Most of the state-of-the-art models in QA utilize the attention mechanism Bahdanau et al. (2014) on top of RNN for matching words from a passage to the question or self-matching for better encoding of the passage Wang et al. (2017). For instance, Dynamic Coattention Network Xiong et al. (2017) employs a co-attentive encoder that iteratively explores the relationship between question and the given context. Bi-Directional Attention Flow networks Seo et al. (2017) also utilize the attention mechanism in two different directions to capture the intricate dynamics of questions and contexts.

While most state-of-the-art models in QA are based on word-level attention mechanism, in Natural Language Inference (NLI) tasks, sentence-level matching methods are popular. For capturing sentence-level semantics, TBCNN Mou et al. (2015) introduces the sentence matching module architecture which is widely in use in other studies of the entailment task Liu et al. (2016); Choi et al. (2017); Chen et al. (2017). In these studies, the sentence matching module is shown to be effective in extracting relations between sentences. However, this approach is not widely adopted in QA area and to the best of our knowledge only a few works has adopted this method  Samothrakis et al. (2017).

Across the models, Long Short-term Memory  

Hochreiter and Schmidhuber (1997) networks or Gated Recurrent Unit  Cho et al. (2014) are widely used as building blocks for the efficient text encoding. The gating mechanism existing in both algorithms allows neural networks models to calculate and select the right amount of necessary information out of two or more distinct modules. Another application is gating between character and word level embedding vectors for improved representation of hidden states Yang et al. (2016); Miyamoto and Cho (2016). It has been demonstrated that by a fine-grained gating mechanism, the model benefits from the strengths of both character-level and word-level representations.

7 Conclusion

In this paper, we propose the novel question-aware sentence gating networks. The sentence gating, as our experiments revealed, efficiently and dynamically allows words to take different amount of contextual information toward more semantically meaningful word representations. Our experiments on two QA datasets, SQuAD and Cloze-style, demonstrate that the proposed sentence gate improves performance over baseline models with the meaningful behavior of its gate score. Future work involves extending our algorithm to other tasks in machine comprehension.