A machine comprehension question answering (QA) task has gained significant popularity from research communities. Given a passage and a question, the machine finds the correct answer by understanding the passage using the contextual information from various levels of semantics. In the example shown in Figure 1, the question ‘What other role do many pharmacists play?’ requires the model to properly get the meaning of the first sentence (the first role of the pharmacists), so as to understand that the role presented in the second sentence corresponds to the ‘other role,’ which is needed by the question.
. Most of them are trying to solve this problem by extracting the matching information between the word-level encoding vectors in the passage and those in the question. For example,Xiong et al. (2017) introduces a co-attention encoder that captures the word-level question-passage matching information, and Wang et al. (2017) additionally incorporates self-matching information among words within the passage.
However, most of the existing approaches still focus on improving word-level information, but not directly utilize the higher-level semantics contained in, say, sentences and paragraphs. Motivated by this, we propose the novel question-aware sentence gating networks that directly impose the sentence-level information into individual word encoding vectors.
Our proposed approach works as follows: (1) Sentence encoding: we first generate initial encoding vectors of individual sentences in the passage and the question. (2) Sentence matching: we further transform them into question-aware vectors by considering its degree of matching with the question. (3) Sentence gating: For each word, our gating module determines how much to add the sentence information to it, and the new word-level encoding vector is computed accordingly as a convex combination of its original word vector and its corresponding sentence vector.
Finally, the contribution of our paper includes the following:
We propose the novel question-aware sentence gating networks that impose the sentence-level semantics into individual word encoding vectors in QA tasks.
We extend several existing state-of-the-art models by integrating our sentence gating networks.
We show the effectiveness of our proposed approach by conducting quantitative comparisons using well-known QA datasets and models as well as qualitative analysis of our model.
2 Problem Setting
Taking a passage and a question as input, a QA model jointly understands the two and finds out the corresponding answer from the passage. The passage is represented as a sequence of sentences, , where is the number of sentences in the passage. The -th sentence is further decomposed into a sequence of words, , where denotes the number of words in sentence . Alternatively, the passage is represented just as a single sequence of words across multiple sentences, , where . The (single-sentence) question is similarly denoted as , where denotes the number of words in . We assume that the answer always has the form of either a single entity or a phrase existing in the passage.
3 Question-Aware Sentence Gate
In this section, we describe our proposed question-aware sentence gating networks that directly incorporate the sentence-level information into the encoding module of individual words that belong to the corresponding sentence. The overview of our proposed method is summarized in Figure 2.
3.1 Sentence Encoding Module
Given the word sequence of a passage, and that of a question, , the first layer of our model is an embedding layer Mikolov et al. (2013); Pennington et al. (2014), which represents individual words as pre-trained word embedding vectors and
. Our model then applies bi-directional Gated Recurrent Units (BiGRU) to each set of embedding vectors in the passage and the question, respectively, as
where and are the word-level encoding vector of the -th word in the passage and the question, respectively, and is the hidden dimension.
Afterwards, an average pooling layer is applied to the set of these word-level encoding vectors per sentence in the passage, i.e.,
where represents the -th word of the -th sentence.
As will be seen in our experimental results (Table 4), we choose the average pooling approach for sentence encoding since it outperforms other approaches while using relatively small number of parameters.
3.2 Sentence Matching Module
The generated passage and question sentence representations are then fed into our sentence-matching module to incorporate the question information into the sentence vectors in the passage.
Inspired by the generic natural language inference approaches Liu et al. (2016); Conneau et al. (2017), we compute the degree of matching between each passage sentence representation for and the question sentence representation in a diverse manner. Specifically, we propose the three different matching schemes Mou et al. (2015): (1) the concatenation of the two representation vectors , (2) the element-wise product , where represents the Hadamard product, and (3) the absolute element-wise difference . The results of three matching methods are then concatenated as
which works as the question-aware sentence representation . It captures the sentence-level relation between the passage and the question, which is then fed to the sentence gating module to flow the sentence-level matching information to the word-level representation.
3.3 Sentence Gating Module
We propose the sentence gating networks that determine how much the sentence-level matching information should be incorporated into each word representation towards more semantically meaningful word representations for QA tasks.
To this end, we first generate another word-level representations by using additional BiGRU module given the entire passage word embedding vectors , i.e.,
Note that a newly indexed word-level representation corresponds to where the -th word in the passage is the -th word in the -th sentence.
Then, our sentence gating module takes the input as the word representation and its corresponding sentence representation (defined in Eq. 1) and generates a new word representation (fused with the sentence information) and its dimension-wise gating vector , respectively, as
where and are the trainable parameters, and
represents an element-wise sigmoid activation function.
As our final output, the gated word representation is computed as a convex combination of and , i.e.,
where and represents the Hadamard product. Each element of determines the amount of information in the corresponding dimension of to be added to the previous-layer word representation .
This proposed algorithm is widely applicable to the passage and the question encoding processes in various existing QA models by replacing the word-level encoding module output with our sentence-gated word representations , as will be discussed in Section 4.
|Data||# Train||# Validation||# Test|
4 QA Model with Sentence Gate
This section discusses how we extend existing QA models using our proposed sentence gating networks. Specifically, we select two QA models: Match-LSTM with Answer Pointer WANG and JIANG (2017) and Gated-Attention Reader Dhingra et al. (2017). 111All the implemented codes will be released upon acceptance.
4.1 Match-LSTM with Answer Pointer
To extend Match-LSTM with Answer Pointer (Match-LSTM), we first generate gated word representations (Eq. 2) and encode question word representations using our approach. Next, a bi-directional match-LSTM Wang and Jiang (2016) layer computes attention weights over the entire encoded question word vectors for each encoded passage word, which produces hidden word representations that reflects its relevance to the question. From these vectors, an answer pointer layer selects the final answer span.
On the other hand, this model originally uses BiGRU to encode the passage and the question words, and the outputs are then fed to the match-LSTM layer. As illustrated in Section 5, we compare our extended model with the original model.
We evaluate the extended and the baseline model on Stanford Question Answering Dataset (SQuAD) Rajpurkar et al. (2016). SQuAD contains more than 100,000 questions on 536 Wikipedia articles. The answer of each question is represented as a span in the passage.
We tokenize all the passages and questions using spaCy 222https://spacy.io/ library. We use -dimensional pre-trained GloVe word embeddings Pennington et al. (2014) 333We use ‘glove.840B.300d.zip’ downloaded from https://nlp.stanford.edu/projects/glove/ for both passages and questions, and we do not fine-tune the embedding vectors during training. Out-of-vocabulary words are assigned zero vectors, and we do not use character-level embeddings.
We set the batch size as 40 and the hidden vector dimension is set to 150 for all layers. Also, we use a single BiGRU layer for encoding of word tokens. We apply dropout Srivastava et al. (2014) with its rate of 0.2 between adjacent layers. We use ADAMAX Kingma and Ba (2014) with the coefficients of and
. The initial learning rate is set as 0.002 and multiplied by 0.9 every epoch.
4.2 Gated-Attention Reader
Gated-Attention Reader (GA Reader) is based on a multi-hop architecture and an attention mechanism. We impose our question-aware sentence gating networks across the hops in this model. Specifically, in the -th hop out of hops, our algorithm first generates gated word vectors and encoded question word vectors . They are fed to the Gated-Attention module where each word vector in the passage incorporates the matching information from the entire question words. In the last hop, an attention-sum module Kadlec et al. (2016)
selects the answer based on a probability distribution over the given candidates.
Similar to Match-LSTM, this model originally uses BiGRU to encode words in the passage and those in the question, and then the gated-attention module takes these outputs as inputs. This model will also be compared with its extension with our proposed module in Section 5.
|Model||WDW dev||WDW test||WDW-R dev||WDW-R test|
|GA Reader Dhingra et al. (2017)||0.716||0.712||0.726||0.726|
|GA Reader (Our implementation)||0.717||0.710||0.722||0.712|
We conduct the experimental comparison between the extended GA Reader and the original GA model on a cloze-style dataset called Who Did What (WDW) Onishi et al. (2016). It consists of questions with an omitted entity, corresponding to a person name, and the machine is asked to choose the right answer from a set of choices. WDW has two variants, a Strict (WDW) and a Relaxed (WDW-R) version, and we report experimental results on these two datasets.
|Concatenation + Matching|
|Scalar gate + Matching|
|Vector gate + Matching||0.6814||0.7736|
For word embedding, we use the 100-dimensional pre-trained GloVe embedding vectors Pennington et al. (2014), which are fixed during training. Out-of-vocabulary tokens at test time are assigned with distinct random vectors. We train the model with the batch size of 50 and the hidden dimension of 128. We use a single BiGRU layer and adopt a dropout rate of 0.3. The number of hops is set to 3. We also use the trainable character embedding vectors of 50 dimensions, as in the original model. We also apply the same optimizer and learning rate as in Match-LSTM.
5 Experiment Results
We analyze our algorithm under two main categories, quantitative and qualitative evaluations. We denote our question-aware sentence gating networks as Sentence gate for all the tables and figures presented in this section.
5.1 Quantitative Evaluations
In this section, we compare the baseline methods and their extended models we proposed in Section 4 and also analyze our model structure. Lastly, performance comparisons on various sentence lengths of the passage are presented.
5.1.1 Model Performance Comparison
The performance results on Match-LSTM are reported in Table 2
. We use two metrics to evaluate the model performance: Exact Match (EM) and F1 score. EM measures the percentage of questions in which the model prediction is exactly matched with the ground truth and F1 measures the harmonic average of the precision and recall between a model prediction and ground truth answer span at character level. Our model outperforms the baseline model by 1.53% in EM.
Table 3 shows that the query-aware sentence gate boosts performance on Cloze-style datasets. Our model improves the accuracies on all the benchmarks over reported baseline scores.
5.1.2 Model Structure Analysis
All the experiments in this sections are performed on the SQuAD dataset with Match-LSTM as the baseline.
Sentence Encoding Module
We first compare our average pooling method against the other three sentence encoding methods: 1) BiGRU-last: concatenation of the last hidden states of a BiGRU, 2) Max pooling: taking the maximum value over each dimension of the hidden states, 3) Inner attention: applying an attention mechanism over the hidden states Liu et al. (2016); Conneau et al. (2017). Table 4 shows the results that our approach, i.e., the average pooling, outperforms the other methods.
Sentence Matching and Gating Modules
We conduct experiments on the effect of the sentence matching and sentence gating modules suggested in Section 3.2 and 3.3. We compare the combining methods for word and sentence representations with or without the sentence matching module (referred as Matching in Table 5). As for the combining word and sentence vectors, three methods are compared: 1) Concatenation: concatenating sentence and word representations, 2) Scalar gate: applying the same gating weight across dimensions, 3) Vector gate: our proposed method. Also, for the method without the sentence matching module, the output of the average pooling is used for a sentence vector.
Table 5 shows that when it comes to a method for combining word and sentence representation, both the scalar gate and the vector gate outperform the concatenation method by a large margin. Among the two gating mechanisms, the vector gate achieves the highest EM and F1 scores along with the sentence matching module. Also, we can see that the sentence matching module improves the EM and F1 scores of SQuAD dataset, except for the case in conjunction with the concatenation. This indicates that with the proper manner of combining word and sentence, matching information between a passage and question sentence leads to a better model performance.
5.1.3 Performance Comparison on Various Sentence Length
We observe that our model shows remarkable performance gains for longer sentences compared to the baselines in both datasets. To verify this, we first split passages into five groups based on the average sentence length of each passage and then measure the performance of each group. In Figure 4, (a) and (b) show that our model achieves better performances as the average length of sentence increases in SQuAD dataset. Likewise, our model records a higher accuracy on WDW than the baseline as shown in (c). This implies that our model better tackles a passage which consists of long sentences than word match-based QA models.
|Tesla, 2011, Luther, 2016, There,||and, modern, were, being, was,|
|SQuAD||BSkyB, Kublai, 2015, Goldenson,||very, against, as, largely, highly,|
|(Match-LSTM)||Chloroplasts, 2012, Genghis, Newton,||considered, traditional, political,|
|TFEU, San, ctenophores, Esiason||extremely, particularly, rich, large|
||Barack, Robert, Nicolas,||the, a, of, best, top, final,|
|WDW-R||George, Hugo, Timothy, James,||straight, economic, to, been,|
|(GA Reader)||Eric, Alex, Janet, Chudinov,||last, largest, lost, guaranteed,|
|Jacqueline, Shaoxuan||negotiating, first, more, be, all|
5.2 Qualitative Evaluations
we also perform in-depth qualitative analyses on the sentence gate value to examine the behavior of our proposed module.
5.2.1 Gating Value by Word Tokens
Each dimension of the sentence gate value determines the amount of sentence-level information added to the word-level representations. A higher sentence gate value thus indicates that the corresponding word requires more sentence-level question-passage matching information.
To analyze such behaviors with respect to different words, we sample random passages from SQuAD and WDW-R and plot the sentence gate value of each word token averaged across dimensions. We highlight the words with a relatively high gate value.444More visualization samples are reported in the supplementary material. Figures 5(a) and (b) shows the results for SQuAD and WDW-R, respectively, and each line in (b) corresponds to one of three different hops in the GA Reader.
Note that word tokens with relatively high gate values are related to the question. For instance, in Figure 5(a), words related to the question such as 1817, April, WSE or 2009 are given high gate values. Likewise, in Figure 5(b), because the question asks a person name, the names of persons are assigned with high gate values across different hops. On the other hand, we observe that some stop words, e.g., was, and, and by, and adverbs, e.g., previously generally take low gate values on both datasets. To further justify this observation, we list word tokens with the highest and the lowest gate values in Table 6. Note that since WDW-R dataset always requires the model to find a correct person name, high gate values are assigned to person names. These results prove our initial hypothesis that the gate values efficiently expose the words to the relevant sentence-level matching information, which is needed to solve given questions.
5.2.2 Gate Value by hops
Once each word token learns sentence-level semantics, it is natural for the model to narrow down the focus on word-level in order to find the exact answer tokens.
To visualize this intuition, we draw a boxplot for gate value distributions over three hops obtained by GA Reader with the WDW-R dataset. To obtain a single representative gate value for a passage, we average the values over dimensions and words. Figure 6 shows that the average gate value tends to be smaller in later hops. This observation is consistent with our assumption that the model focuses more on word-level information as it approaches to the answer prediction stage.
6 Related Works
In recent years, various tasks on machine comprehension have been attracting considerable attention from the research community. Among the studied tasks are Automatic SummarizationCheng and Lapata (2016); Nallapati et al. (2017); Cao et al. (2016), Machine Translation Bahdanau et al. (2014); Kalchbrenner et al. (2016)2016); Agerri and Rigau (2016) and Question Answering Wang et al. (2017); Xiong et al. (2017).
Several studies have been proposing end-to-end neural networks models for QA. Most of the state-of-the-art models in QA utilize the attention mechanism Bahdanau et al. (2014) on top of RNN for matching words from a passage to the question or self-matching for better encoding of the passage Wang et al. (2017). For instance, Dynamic Coattention Network Xiong et al. (2017) employs a co-attentive encoder that iteratively explores the relationship between question and the given context. Bi-Directional Attention Flow networks Seo et al. (2017) also utilize the attention mechanism in two different directions to capture the intricate dynamics of questions and contexts.
While most state-of-the-art models in QA are based on word-level attention mechanism, in Natural Language Inference (NLI) tasks, sentence-level matching methods are popular. For capturing sentence-level semantics, TBCNN Mou et al. (2015) introduces the sentence matching module architecture which is widely in use in other studies of the entailment task Liu et al. (2016); Choi et al. (2017); Chen et al. (2017). In these studies, the sentence matching module is shown to be effective in extracting relations between sentences. However, this approach is not widely adopted in QA area and to the best of our knowledge only a few works has adopted this method Samothrakis et al. (2017).
Across the models, Long Short-term MemoryHochreiter and Schmidhuber (1997) networks or Gated Recurrent Unit Cho et al. (2014) are widely used as building blocks for the efficient text encoding. The gating mechanism existing in both algorithms allows neural networks models to calculate and select the right amount of necessary information out of two or more distinct modules. Another application is gating between character and word level embedding vectors for improved representation of hidden states Yang et al. (2016); Miyamoto and Cho (2016). It has been demonstrated that by a fine-grained gating mechanism, the model benefits from the strengths of both character-level and word-level representations.
In this paper, we propose the novel question-aware sentence gating networks. The sentence gating, as our experiments revealed, efficiently and dynamically allows words to take different amount of contextual information toward more semantically meaningful word representations. Our experiments on two QA datasets, SQuAD and Cloze-style, demonstrate that the proposed sentence gate improves performance over baseline models with the meaningful behavior of its gate score. Future work involves extending our algorithm to other tasks in machine comprehension.
- Agerri and Rigau (2016) Rodrigo Agerri and German Rigau. 2016. Robust multilingual named entity recognition with shallow semi-supervised features. Artificial Intelligence, 238:63–82.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Cao et al. (2016) Ziqiang Cao, Wenjie Li, Sujian Li, Furu Wei, and Yanran Li. 2016. Attsum: Joint learning of focusing and summarization with neural attention. arXiv preprint arXiv:1604.00125.
- Chen et al. (2017) Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Recurrent neural network-based sentence encoder with gated attention for natural language inference. arXiv preprint arXiv:1708.01353.
- Cheng and Lapata (2016) Jianpeng Cheng and Mirella Lapata. 2016. Neural summarization by extracting sentences and words. arXiv preprint arXiv:1603.07252.
- Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
- Choi et al. (2017) Jihun Choi, Kang Min Yoo, and Sang-goo Lee. 2017. Unsupervised learning of task-specific tree structures with tree-lstms. arXiv preprint arXiv:1707.02786.
- Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.
- Dhingra et al. (2017) Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William Cohen, and Ruslan Salakhutdinov. 2017. Gated-attention readers for text comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1832–1846.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- Hu et al. (2017) Minghao Hu, Yuxing Peng, and Xipeng Qiu. 2017. Reinforced mnemonic reader for machine comprehension. CoRR, abs/1705.02798.
- Huang et al. (2017) Hsin-Yuan Huang, Chenguang Zhu, Yelong Shen, and Weizhu Chen. 2017. Fusionnet: Fusing via fully-aware attention with application to machine comprehension. arXiv preprint arXiv:1711.07341.
- Kadlec et al. (2016) Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Jan Kleindienst. 2016. Text understanding with the attention sum reader network. arXiv preprint arXiv:1603.01547.
- Kalchbrenner et al. (2016) Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. 2016. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360.
- Liu et al. (2017) Rui Liu, Wei Wei, Weiguang Mao, and Maria Chikina. 2017. Phase conductor on multi-layered attentions for machine comprehension. arXiv preprint arXiv:1710.10504.
- Liu et al. (2016) Yang Liu, Chengjie Sun, Lei Lin, and Xiaolong Wang. 2016. Learning natural language inference using bidirectional lstm model and inner-attention. arXiv preprint arXiv:1605.09090.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
- Miyamoto and Cho (2016) Yasumasa Miyamoto and Kyunghyun Cho. 2016. Gated word-character recurrent language model. arXiv preprint arXiv:1606.01700.
- Mou et al. (2015) Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Yan, and Zhi Jin. 2015. Natural language inference by tree-based convolution and heuristic matching. arXiv preprint arXiv:1512.08422.
Nallapati et al. (2017)
Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.
Summarunner: A recurrent neural network based sequence model for extractive summarization of documents.In AAAI, pages 3075–3081.
- Onishi et al. (2016) Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gimpel, and David McAllester. 2016. Who did what: A large-scale person-centered cloze dataset. arXiv preprint arXiv:1608.05457.
- Pan et al. (2017) Boyuan Pan, Hao Li, Zhou Zhao, Bin Cao, Deng Cai, and Xiaofei He. 2017. Memen: Multi-layer embedding with memory networks for machine comprehension. arXiv preprint arXiv:1707.09098.
Pennington et al. (2014)
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014.
Glove: Global vectors for word representation.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
- Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
- Samothrakis et al. (2017) Spyridon Samothrakis, Tom Vodopivec, Michael Fairbank, and Maria Fasli. 2017. Convolutional-match networks for question answering. International Joint Conferences on Artificial Intelligence.
- Seo et al. (2017) Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In Proceedings of the International Conference on Learning Representations (ICLR).
Srivastava et al. (2014)
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Dropout: A simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research, 15(1):1929–1958.
- Wang and Jiang (2016) Shuohang Wang and Jing Jiang. 2016. Machine comprehension using match-lstm and answer pointer. arXiv preprint arXiv:1608.07905.
- WANG and JIANG (2017) Shuohang WANG and Jing JIANG. 2017. Machine comprehension using match-lstm and answer pointer.
- Wang et al. (2017) Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 189–198.
- Xiong et al. (2017) Caiming Xiong, Victor Zhong, and Richard Socher. 2017. Dynamic coattention networks for question answering. In Proceedings of the International Conference on Learning Representations (ICLR).
- Yang et al. (2016) Zhilin Yang, Bhuwan Dhingra, Ye Yuan, Junjie Hu, William W Cohen, and Ruslan Salakhutdinov. 2016. Words or characters? fine-grained gating for reading comprehension. arXiv preprint arXiv:1611.01724.