Question aware Open Information Extraction (Question aware Open IE) system takes question and passage as inputs and extracts a semi-structure answer in tuple format from passage which can answer the question. Question aware Open IE is both an Open IE task and a question answering task. From an Open IE view, the Open IE system extracts all possible tuples. For example, in Table 1, Open IE system aims to extract four answer tuples from the passage, which are independent of any questions. A question aware Open IE only extracts one answer tuple which can answer the question. From a question answering view, the answer of the search engine is a passage; the answer for Machine Reading Comprehension tasks like SQuAD , TriviaQA  and NewsQA  is a span from the passage; the answer of MS MACRO  is a generated sentence. Different to them, the answer for a question aware Open IE is a semi-structure tuple which is shorter than the passage and longer than the span. It has a semantic role for each part which is easier for understanding for downstream task.
|Question||how many albums has the doors sold|
|Passage||although the doors’ active career ended in 1973 , their popularity has persisted. according to the riaa, they have sold over 100 million records worldwide, making them one of the best-selling bands of all time.|
Open IE Result
|(the doors active career; ended; in 1973) (their popularity; has persisted; although the doors active career ended in 1973) (they; making; them one of the best-selling bands of all time) (they; have sold; 100 million records worldwide)|
Answer (Question aware Open IE)
|(they; have sold; 100 million records worldwide)|
The current solution for a question aware Open IE has two approaches, the extractive method and the generative method. The extractive method extracts all possible answer tuples as candidates from the passage, independent of the question, by Open IE models. It then ranks all the candidates by a matching model between candidate and question. The coverage of the extraction step is crucial for the final performance because it is a twostep method and the extraction step is independent of the question. Since the first step is extraction, most of the words in the result will come from the passage.
The generative method concatenates the question and passage as input, and then generates the answer tuple as a concatenated sequence or generates each field one by one. The generative method uses the question and passage at the same stage and does not rely on an extraction model, so it has better interaction between question and passage. But while it removes the extraction step, it does not use the facts that most of the answer word is from passage any more.
To better use passage information in generation, we propose a two-stage decoder model for this task which can have more interaction between question and passage by containing a tagging decoder and a correction decoder. At the first stage, the tagging decoder will tag the words which may be useful for answer generation. The output of this step could form a coarse answer. At the second stage, the correction decoder generates a new answer with a step by step decoder. A correction decoder can reorder and add new words to output a fluency answer. Then we joint train two decoders at the same time.
We evaluated our model on the WebAssertion dataset . Our model achieves a 59.32 BLEU score which is better than previously generative methods.
|Question||where is smallville filmed|
|Passage||smallville was primarily filmed in and around vancouver , british columbia , with local businesses and buildings substituting for smallville locations .|
|Answer||smallville; was filmed; in british columbia; with local businesses|
|Tagging Result||smallville, was primarily filmed, vancouver, british columbia, with local businesses|
|Correction Result||smallville, was filmed, in vancouver, british columbia with local businesses|
2 Related Work
Open Information Extraction (Open IE) [1, 2] aims to extract all (subject, predicate, arguments) tuples from a sentence. To solve this challenge, TextRunner  and WOE  use a self-supervised approach. Then many of the methods use a rule based approach like ReVerb , OLLIE , KrakeN , ClauseIE  PropS . Open IE4111https://github.com/dair-iitd/OpenIE-standalone extracts tuples from Semantic Role Labeling structures. Stanford Open Information Extraction  uses natural logic inference to extract a shorter argument. Recently, Stanovsky et al.  have proposed a supervised method for Open IE by formulating it as a sequential labeling task. Compared to Open IE, our task has an additional question, so our tagging decoder needs to contain an interactive layer between question and passage. Our tagging decoder is similar to 
, since they have the same output and trained by supervised learning. However, we have an additional correction decoder to improve answer quality and can handle an answer field that is not a span.
Current Machine Reading Comprehension (MRC) like SQuAD , TriviaQA  and NewsQA  focus on selecting a span from passage as the answer. Most MRC models [17, 18, 11] generate answers by predicting the start and end points of a span. The MS MACRO  dataset needs to generate a sequence which is not a span of a passage. Tan et al.  solve it by selecting a span from the passage at first, then generates an answer based on the question, passage and selected span. Similar to Tan et al. , we also used the idea of coarse to fine generation, but the answer of our task is not a span or sentence. Our answer has structure and each field has a semantic role. The arguments have dynamic length. Each field of it does not have to be a span, although most of the words are from a passage. because of this we use a sequential labeling method to tag each word in a passage instead of predicting the start and end point of a span. The two stages of our model could be jointly trained.
For Question aware Open IE, Yan et al.  propose two methods, an extractive method and a generative method. The extractive method extracts all answer tuples from the passage first, and ranks them with a matching model between answer candidates and questions. The generative model takes the concatenation of question and passage as input, and generates the representation of each answer field at first, and then generates each field based on its representation.
3 Our Approach
In this section, we formulate the Question aware Open IE problem and briefly introduce our model. Then we separately introduce each part of our model including the encoder, tagging decoder, and correction decoder.
3.1 Problem Formulation
The Question aware Open IE is a task which given a question containing words and a passage containing words , and output a semi-structured answer which can answer the question based on passage. The answer consists of a subject, a predicate, and one or more arguments. We represent the answer as , . Each answer field is a natural language word sequence.
3.2 Model Overview
Our model consists of three parts which contains an encoder, tagging decoder and correction decoder. We show our model in Figure 1. We used two same-structure encoders to encode question and passage separately. Then the tagging decoder interacts between the encoded question and passage and tags each word in passage about its semantic role in the answer. The tagging decoder tags all passage words at same time. The correction decoder then generates an answer based on the tagging result. The correction decoder generate answer words one by one. Our intuition is to use tagging decoder to highlight the words in an article and use correction decoder to generate a fluent answer based on tagging results. As our approach’s potentials, we show not only our approach being effective over this dataset but also the correction decoder in correcting the missing words which are not tagged as ground truth.
We use an example in Table 2 to show our idea. The is not a span of passage, the ”in” is far from ”british columbia”, but most of the words in the answer are from the passage. The first stage of our model is to tag keywords from the passage. In this case, our tagging decoder tags all location words as argument such as ”vancouver” and ”british columbida”. But the tagging result misses ”in”. The second stage is to generate a fluent answer based on the tagging result. In this case, our correction decoder adds ”in” compared to the tagging result. We also could noticed that model also able to remove positive adverb "primarily". Based on our case study, the correct model is good at word ordering and small post-editing guided by language model.
The encoder of our model contains a question encoder and a passage encoder, which is been used to encode the question and passage separately. These two encoders have the same structure but different weights in implementation. The encoder is composed of two basic building blocks, Multi-Head Attention Block and Feed Forward Block . We will introduce these two building blocks and how to build an encoder with them.
3.3.1 Multi-Head Attention Block (MHBlock)
The core layer of the Multi-Head Attention Block is the Multi-Head Attention Layer . The input of Multi-Head Attention Layer contains query(), key() and value(). All the inputs are matrices. , , . The output of Multi-Head Attention Layer is a matrix too. . We represent this layer as a function .
Intuitively, this layer is a soft dictionary lookup layer in vector space, and all the operation unit is vector. The dictionary in computer science is a set of key value pairs, lookup in dictionary is to find the key which equals to query and return corresponding value as output. In Multi-Head Attention, there arekey value pairs, each key is a vector with dimension and each value is a vector with dimension . The queries will have corresponding output. For each query, we will calculate attention score to each key, and use attention score as weight to calculate the weighted sum of value. The weighted sum is the output. More details are at Vaswani et al. (2017).
Multi-Head Attention Block has same input as Multi-Head Attention Layer. But it requires the
. The inputs will go through a Multi-Head Attention layer wrapped with residual connection. Then pass the output though a layer norm layer to get the final output.
3.3.2 Feed-Forward Network Block (FFNBlock)
The core layer of Feed Forward Block is Feed-Forward Network . Feed-Forward Network is a two-layer projection on each row of matrix.
The Feed-Forward Block has same input and output with Feed-Forward Network. We add the input and the output of Feed-Forward Network, then pass through a layer norm layer to get the final output.
3.3.3 Encoder Structure
An encoder is used to map a sequence of words into a sequence of hidden states. The question encoder and the passage encoder have the same structure. The input of the encoder is the word embedding of each word plus the position embedding. We use sine and cosine position embedding . The encoder is composed of a stack of identical layers. The output of the question encoder and the passage encoder are and respectively. For the question encoder:
is an embedding look up function, It takes the word id and output corresponding word embedding vector. is position embedding. is an intermediate result. Passage encoder has same structure, so we don’t formulate it again.
3.4 Tagging Decoder
A tagging decoder is used to generate the tagging probability distribution for each word in a passage given a question encode resultand a passage encode result . In this sub subsection, we will introduce the tag format Semantic BIO Tags, and tagging decoder structure. The output of the tagging decoder is a distribution of tags . Formally, for each word in passage, the tagging decoder outputs a distribution . is the tags for i-th passage word. We denote the result as
. In our model, we keep it as a continuous probability distribution so as to back propagate loss. If we want to give each word an explicit tag, we output the tag with maximum probability.
3.4.1 Semantic BIO Tags
We use semantic BIO tag like Stanovsky et al. (2018) to tag passage word. Each tag is combined by two parts, semantic tag for semantic role in answer and BIO tag for position in a field. The semantic role tag contains subject(S), predicate(P) and arguments(A). Since there are more than one argument, arguments also been distinguished by position as Ai for i-th argument. BIO tag contains Begin(B), Inside(I) and Outside(O). For each continuous subsequence belongs to same semantic role, we tag the first word to B, and tag the rest words to I. After tagging all continuous subsequence, we tag all the otherwords too. Then we add semantic role to BIO tags. If the semantic role is predicate,the tag will be extended to P-B, P-I. We showed an example at Table 2. The predicate has two words ”was filmed” but they are not consecutive. For each sub span, we tag the first word as P-B. Then both the tag of ”was” and ”filmed” are P-B.So the re maybe more than one word been tagged as P-B although there is only one predicate in answer. For the same reason, there are two A1-B for ”in” and ”british” in example.
3.4.2 Tagging Decoder ground truth
We need to create ground truth for tagging decoder training by ourself because the answer in dataset is in tuple format. Intuitively, when answer tuple was been created, some words were been selected from passage and copied to answer. Ideally, we want to tag these words out and let our model generate answer based on them too. Formally, we need to select out some continuous subsequences based on answer and tag them by previous tagging rules. Each subsequence must belong to same semantic role, but each semantic role may correspond to several subsequences. The key challenge of it is that one word may have multiple occurrences in passage. We proposed a rule-based solution for this problem. Intuitively, for adjacent words in answer, we prefer to match the adjacent words in passage. For each field of answer, we prefer to keep all matches as close as possible. For details, we match the fields in the order of arguments, subject and predicate, because arguments is longest, and predicate is shortest. Then for each field, we try to match all bi-gram in passage. We will keep the multiple occurrence if exist. Then we match as much as possible single word which haven’t covered by matched bi-gram. In this step, we will minimize the distance between rightmost word to leftmost word. In Open IE task, the predicate is shortest and often is unigram. So we match it at last and we prefer the predicate between subject and arguments.
3.4.3 Tagging Decoder Structure
Compared to the question encoder, the tagging decoder also needs to encode the passage. The difference is it needs to interactive with the question. We achieve this requirement by adding an additional attention layer from passage to question. The tagging decoder is composed of a stack of identical layers. Each layer consists of three sub-layers, self attention layer, passage to query encoding layer and feed forward layer. Self-attention layer is a Multi-Head Attention Block used to encode passage. The query, key and value of it is identical and is the output of previous layer. The passage to question layer is a Multi-Head Attention Block which is used to interactive with question. It’s query is the output of previous self-attention layer, the key and value is identical and is the output of question encoder . We also tried interactive layer like BIDAF (Seo et al. 2016), but there is no improvement compared to our model. Formally:
is the output of i-th layer, and is two intermediate results in same layer. is the final output of these layers.
Then we used linear projection and softmax on each item of to calculate the tag probability distribution of each passage word .
is the i-th row of . is a linear projection matrix.
We use a semantic BIO tag like Stanovsky et al. 1.  to tag passage words. Since there are multiple arguments, arguments are also distinguished by position as for the i-the argument. We need to create ground truth for tagging decoder training on our own because the answer in the dataset is in tuple format. Therefore we tag answer words which is in the passage and tag them as close as possible.
3.5 Correction Decoder
The correction decoder takes the output of the tagging decoder and as input, and generate a new answer. The correction decoder will generate answer words one by one like machine translation.
We concatenate the answer tuple to one string as the output of the correction decoder. Formally, we concatenate the tuple to a sequence of l words which is formatted as " <split> <split> <split> …<split> ". The "<split>" is an additional format word used to identify the semantic role. We use "<split>" to separate multiple tuples for structured answer representation. We concatenate them into one string by "<split>" tag as our string version answer. We can choose a structured version or a string version as our output. The only difference between a structured version and a string version is whether it has <split> tag.
The structure correction decoder is very similar to tagging decoder. it also is composed of a stack of identical layers. Each layer consists of same three sub-layers, except first layer is a masked self-attention layer. The input is the sum of answer word embedding and position embedding too. Different to tagging decoder, the first layer, masked self attention layer, contains additional memory mask. This memory mask only allows the hidden vector at position i to pay attention on hidden vector before position i. This is because the generative decoder is a step by step decoder.we only have hidden vector before position i when we generate i-th word. Answer to passage encoding layer also is a Multi-Head Attention layer. The query of it is the output of masked self attention layer. the key and value of it is identical and is concatenation of two parts, tagging decoder hidden state and tagging result .
In training, we could train the answer word decoding parallel by masked attention tricks. The only structure’s difference between tagging decoder and correction decoder is that word in correction decoder only could attend to previous words. Because in decoding, we must generate answer word one by one. Suppose we had generated the first j-1 words:
is a special word at the beginning of a sentence. is the output of i-th layer, and are two intermediate results in the same layer.
Then we generate j-th word by:
is the j-th row of . is a linear projection matrix.
In training, we create a ground truth for the tagging decoder and the correction decoder according to the previous method. These two decoders have separate loss but we train them jointly.
The loss of the tagging decoder is the negative log likelihood of ground truth tags.
is the ground truth tag of i-th word in passage. is the number of samples.
For generation, the loss is conditionally negative log probability of the ground truth answer.
is the i-th word of answer. is the number of samples.
The loss of our model is the weighted sum of tagging decoder loss and correction decoder loss.
is the weight of and will be tuned on validation set.
|Model||Answer (BLEU-4)||Subject (BLEU1)||Predicate (BLEU1)||Arguments (BLEU1)|
|Seq2Seq + Attention ||31.85||-||-||-|
|Tagging + Correction||59.32||63.40||67.50||61.01|
|w/o semantic tag (Only BIO tag)||58.78||62.61||66.36||59.72|
4.1 Dataset and Evaluation
We use the WebAssertions dataset (Yan et al. 2018) to evaluate our model. The WebAssertions is a Question aware Open IE dataset. To construct this dataset, Yan et al. (2018) collect queries from search engine logs as questions, retrieving and filtering related passages by search engine which cab directly answer the question. Then they extract answer tuples from the passage with Open IE model ClausIE. The labeler will judges whether the answer tuple has complete meaning and can answer the question. The answer tuple that has a positive label is the final answer. About 40% of answers contain a field which is not a span of the passage. For example, sometimes the answer will delete words in a passage. Some words in the correction ground truth don’t appear in the passage, and they thus don’t appear in the span that tags result. This dataset contains 358,427 (question, passage, answer) triples. We randomly split the WebAssertions dataset into training, validation, and test sets with 8:1:1 split. We use the validation set to tune the model and report the results on the test set.
We evaluate the quality of the entire answer and each semantic role. For entire answer, we concatenate the answer tuple to a string and split different role with the special word "<split>". Since there is only one subject and one predicate in answer, we can evaluate them directly. But there may be more than one argument, we concatenate them to one string by "<split>", just like the entire answer.
4.2 Implementation Details
4.2.1 Data Processing
We need a post process the output to get subject, predicate, and arguments separately. For tagging results, we collect all the words with same semantic tag to produce the corresponding answer field. The selected words are concatenated according to their order in passage. If no word is been tagged as one semantic role, then the result is an empty string. For the generated answer, we split them to a phrases list by the special split word. Then the first phrase is the subject, the second phrase is the predicate, and all other phrases are arguments. We use byte-pair encoding (BPE)  to handle the out of vocabulary problem in the correction decoder. BPE will split each word to several subwords. We control the corpus distinct subwords number. In the ground truth creation of semantic BIO tags, we match the continuous subsequence at the word level, and map the semantic tag to sub-word level and label BIO tag at subword level. After the model output tagging result, we collect the subword belonging to the same semantic role and undo BPE. We ignore the possibility of incomplete words and just let the model learn. For the generation, we split output to phrases list first and undo BPE on each phrase.
We tune our hyper-parameter on the validation dataset. The hidden size of our model is 512. We use shared vocabulary between question, passage, and answer. The BPE vocabulary size is 37000. We share the embedding weight for the question encoder, passage encoder, correction decoder, and pre-softmax linear transformation of the correction decoder. We use 8 head for Multi-Head Attention. The question and passage encoder layer number Ne is 2, the tagging decoder layer numberNt is 4. The correction decoder layer number is 4. The correction decoder layer number is 6. The weight of loss is set to 3. We use ADAM optimizer  to update model parameters and set the learning rate to 0.001.
Our proposed model is called Tagging + Generation. We compared it with three baselines.
Seq2Seq + Attention  This model formulates this task to a sequence to sequence problem. They concatenate question and passage to a string as input, and concatenate the tuple to a string as output. They insert special tag "<EOQ>" between question and passage, and special tag "," between field of tuple for format. This model uses a bidirectional GRU as encoder, GRU as decoder, and used attention mechanism.
Seq2Ast  This sequence to assertion model (Seq2Ast) has the same input process and encoder as Seq2Seq + Attention. The difference is this model used a hierarchical decoder which first generates a representation for each field by a tuple-level decoder, then generates the words for each field by a word-level decoder.
Tagging We remove the correction decoder and only train the tagging decoder.
The Yan et al.  also propose an extractive method, but it is not comparable with generative methods. This method extracts all possible answer tuples from the passage first. We use a ranking model to select the best answer as output. This dataset also is constructed by extracting tuples and their extraction model using the same extractor. The right answer is always in the ranking list. The key challenge with extractive methods is how to design the matching model. It is evaluated by ranking metrics, such as MAP, MRR. If we evaluate it with BLEU, it will reach 72.27. This result is higher that our result, but it is also reasonable because they leverage the dataset construct property. Our method does not rely on any Open IE model, so it still does not work well on dataset constructed in this way.
4.4 Experiment Result
The results are in Table 3. Both the result of the entire answer and each semantic role show the same trend. We see the following: (i) the Tagging + Correction model achieves the best results which proves the effectiveness of our model; (ii) the Tagging+ Correction model is better than the Seq2Seq model. It means that by tagging the keyword first improves generation quality, which we think is because the tagging decoder provides a guide for the correction decoder; (iii) the Tagging + Correction model is also better than the tagging decoder, which means the second step correction is necessary.
For the subject, predicate, and arguments column, we find the results for the predicate are better than for subject, and subject results are better than arguments. This may be because of the different properties of different semantic roles. The subject is often a noun phrase. The predicate is a verb and has an average length of 1.4. The arguments are modifying phrase, which are longest and most complicated. Intuitively, the property of one word is enough to determine whether it is a predicate. The property of two adjacent words is enough to determine the boundary of a noun phrase. But we may need more sophisticated sentence information to extract arguments like syntax tree. We will leave this as future work to improve our model.
We remove the question, so it becomes an Open IE problem. We denote it as w/o question. The entire answer result on BLEU will decrease by 2.6 compared with Tagging + Correction, and all the semantic role result will decrease too. This proves the Question aware Open IE cannot be solved as an Open IE task.
We also try to remove the semantic role in tags and only keeps the BIO tags. The results w/o semantic tag show that the BLEU of the entire answer will decrease 0.56, and BLEU of each semantic role also will decrease more than 1. This proves the semantic tag benefits from the correction decoder.
4.5 Case Study
We also do a case study to analysis our result. We randomly sample 50 samples in test dataset and predict with Tagging + Correction model. The summarization of the results is in Table 4.
|correct / exactly match||30%|
|correct / comparable||10%|
|correct / better||18%|
|correct / incomplete label||18%|
|wrong / wrong focus||12%|
|wrong / grammar problem||6%|
|wrong / lost key words||6%|
We find that about 76% of cases are correct. Comparable means the model output is comparable with ground truth and it is hard to tell which one is better. About 18% of cases are better than the ground truth. This is because the generated answer is shorter and clearer than the ground truth answer, especially on arguments. Another 18% of wrong cases are because of an incomplete label, which means there are more than one answer in the passage for the question. Based on these results, we see that it is hard to evaluate the Question aware Open IE because of the open definition of information extraction problem. There may be more than one answer in the passage and each answer may have multiple paraphrases. A better dataset could help to solve the ”better” and ”incomplete label” problems.
For the wrong output, about 12% of wrong cases are because of the wrong focus. This means the answer is not related to the question. 6% of cases are because of a grammar problem
, which means the answer is not fluent. This is because the language model of the correction decoder is still not good enough. 6% of cases are because of lost key words. In the future, we could try to improve the interaction between question and passage to improve the wrong focus and lost key words problem. In the future, we may try to improve the interaction between question and passage to improve the wrong focus and lost key words problem. We could also try to transfer learning to improve the language model.
In this paper, we introduce a two-stage decoder model to solve the question aware Open IE task. Because most of the answer words are from a passage, we use a tagging decoder to tag the key words in the passage first, and generate a refined answer with a correction decoder based on the output of the tagging decoder. The experiments on WebAssertions show that our method outperforms other pure generation models or tagging models. Our model does not rely on any Open IE tools which gives it good generalization ability. In the future, we will try more methods to improve our results like incorporate syntax information or more interaction methods. We will also consider creating a better dataset to accelerate research in this area.
-  Banko, Michele and Cafarella, Michael J and Soderland, Stephen and Broadhead, Matthew and Etzioni, Oren. Open information extraction from the web. Vol. 7. pages 2670–2676. IJCAI, 2014.
-  Etzioni, Oren and Banko, Michele and Soderland, Stephen and Weld, Daniel S. Open information extraction from the web. Vol. 51. pages 68–74. ACM, 2008.
-  Fader, Anthony and Soderland, Stephen and Etzioni, Oren. Identifying relations for open information extraction. In Proceedings of the conference on empirical methods in natural language processing, pages 1535–1545. Association for Computational Linguistics, 2011.
-  Schmitz, Michael and Bart, Robert and Soderland, Stephen and Etzioni, Oren and others. Open language learning for information extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 523–534. Association for Computational Linguistics, 2012.
-  Akbik, Alan and Löser, Alexander. Kraken: N-ary facts in open information extraction. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, pages 52–56. Association for Computational Linguistics, 2012.
-  Del Corro, Luciano and Gemulla, Rainer. Clausie: clause-based open information extraction. In Proceedings of the 22nd international conference on World Wide Web, pages 355–366. ACM, 2013.
-  Stanovsky, Gabriel and Michael, Julian and Zettlemoyer, Luke and Dagan, Ido. Supervised open information extraction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Vol. 1. pages 885–895. 2018.
-  Wu, Fei and Weld, Daniel S. Open information extraction using Wikipedia. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 118–127. Association for Computational Linguistics, 2010.
-  Stanovsky, Gabriel and Ficler, Jessica and Dagan, Ido and Goldberg, Yoav. Getting more out of syntax with props. In arXiv preprint arXiv:1603.01648, 2016.
-  Angeli, Gabor and Premkumar, Melvin Jose Johnson and Manning, Christopher D. Leveraging linguistic structure for open domain information extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1. pages 344–354. 2015.
-  Yu, Adams Wei and Dohan, David and Luong, Minh-Thang and Zhao, Rui and Chen, Kai and Norouzi, Mohammad and Le, Quoc V. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. In arXiv preprint arXiv:1804.09541, 2018.
-  Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Łukasz and Polosukhin, Illia. Attention is all you need. pages 5998–6008. Advances in Neural Information Processing Systems, 2017.
Yan, Zhao and Tang, Duyu and Duan, Nan and Liu, Shujie and Wang, Wendi and Jiang, Daxin and Zhou, Ming and Li, Zhoujun.
Assertion-based QA with Question-Aware Open Information Extraction.
Association for the Advancement of Artificial Intelligence, 2018.
-  Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002.
-  Philip Gage. A New Algorithm for Data Compression. C Users Journal, 1994
-  Kingma, Diederik P and Ba, Jimmy. Adam: A method for stochastic optimization. In arXiv preprint arXiv:1412.6980, 2014.
-  Wang, Shuohang and Jiang, Jing. Machine comprehension using match-lstm and answer pointer. In arXiv preprint arXiv:1608.07905, 2016.
-  Wang, Wenhui and Yang, Nan and Wei, Furu and Chang, Baobao and Zhou, Ming. Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. pages 189–198. 2017.
-  Ba, Jimmy Lei and Kiros, Jamie Ryan and Hinton, Geoffrey E. Layer normalization. In arXiv preprint arXiv:1607.06450, 2016.
-  Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392. 2016.
-  Joshi, Mandar and Choi, Eunsol and Weld, Daniel S and Zettlemoyer, Luke. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In arXiv preprint arXiv:1705.03551, 2017.
-  Trischler, Adam and Wang, Tong and Yuan, Xingdi and Harris, Justin and Sordoni, Alessandro and Bachman, Philip and Suleman, Kaheer. Newsqa: A machine comprehension dataset. In arXiv preprint arXiv:1611.09830, 2016.
-  Seo, Minjoon and Kembhavi, Aniruddha and Farhadi, Ali and Hajishirzi, Hannaneh. Bidirectional attention flow for machine comprehension. In arXiv preprint arXiv:1611.01603, 2016.
-  Nguyen, Tri and Rosenberg, Mir and Song, Xia and Gao, Jianfeng and Tiwary, Saurabh and Majumder, Rangan and Deng, Li. MS MARCO: A human generated machine reading comprehension dataset. In arXiv preprint arXiv:1611.09268, 2016.
-  Tan, Chuanqi and Wei, Furu and Yang, Nan and Du, Bowen and Lv, Weifeng and Zhou, Ming. S-net: From answer extraction to answer generation for machine reading comprehension. In arXiv preprint arXiv:1706.04815, 2017.