Selection-based question answering is the task of selecting a segment of text, or interchangeably a context, from a provided set of contexts that best answers a posed question. Let us define a context, as a single document section, a group of contiguous sentences, or a single sentence. Selection-based question answering is subdivided into answer sentence selection and answer triggering. Answer sentence selection is defined as ranking sentences that answer a question higher than the irrelevant sentences where there is at least a single sentence that answers the question in a provided set of candidate sentences. Answer triggering is defined as selecting any number of sentences from a set of candidate sentences that answers a question where the set of candidate sentences may or may not contain sentences that answer the question. Several corpora have been created for these tasks [1, 2, 3], allowing researchers to build effective question answering systems [4, 5, 6] with the aim of improving reading comprehension through understanding and reasoning of natural language. However, most of these datasets are constrained in the number of examples and scope of topics. We attempt to mitigate these limitations to allow for a more through reading comprehension evaluation of open-domain question answering systems.
This paper presents a new corpus with annotated question answering examples of various topics drawn from Wikipedia. An effective annotation scheme is proposed to create a large corpus that is both challenging and realistic. Questions are additionally annotated with its topic, type, and paraphrase that enable comprehensive analyses of system performance on the answer sentence selection and answer triggering tasks. Two recent state-of-the-art systems based on convolutional and recurrent neural networks are implemented to analyze this corpus and to provide strong baseline measures for future work. In addition, our systems are evaluated on another dataset, WikiQA, for a fair comparison to previous work. Our analysis suggests extensive ways of evaluating selection-based question answering, providing meaningful benchmarks to question answering systems. The contributions of this work include:111All our work will be publicly available on GitHub.
Ii Related Work
The TREC QA competition datasets have been a popular choice for evaluating answer sentence selection.222http://trec.nist.gov/data/qa.html  combined the TREC-[8-12] datasets for training and divided the TREC-13 dataset for development and evaluation. This dataset, known as QASent, has been used as the standard benchmark for answer sentence selection although it is rather small (277 questions with manually picked answer contexts).  introduced a lager dataset, WikiQA, consisting of questions collected from the user logs of the Bing search engine. Our corpus is similar to WikiQA but covers more diverse topics, consists of a larger number of questions (about 6 times larger for answer sentence selection and 2.5 times larger for answer triggering), and makes use of more contexts by extracting contexts from the entire article instead of from only the abstract.  distributed another dataset, InsuranceQA, including questions in the insurance domain. WikiQA introduced the task of answer triggering and was the only answer triggering dataset. Our corpus provides a new automatically generated answer triggering dataset.
Due to increasing complexity in question answering, deep learning has become a popular trend in solving difficult problems.
proposed a convolutional neural network with a single convolution layer, average pooling and logistic regression at the end for factoid question answering. Further, more convolutional neural network based frameworks have been proposed as solutions for question answering[8, 9, 10, 11, 12] Our convolutional neural network model is inspired by the previous work utilizing the tree-edit distance and the tree kernel [13, 14, 15], although we introduce a different way of performing subtree matching facilitating word embeddings. Our recurrent neural network models with attention are based on established state-of-the-art systems for answer sentence selection [16, 17].
Our annotation scheme provides a framework for any researcher to create a large, diverse, pragmatic, and challenging dataset for answer sentence selection and answer triggering, while maintaining a low cost using crowdsourcing.
Iii-a Data Collection
A total of 486 articles are uniformly sampled from the following 10 topics of the English Wikipedia, dumped on August, 2014:
Arts, Country, Food, Historical Events,
Movies, Music, Science, Sports, Travel, TV.
These are the most prevalent topics categorized by DBPedia.333http://dbpedia.org The original data is preprocessed into smaller chunks. First, each article is divided into sections using the section boundaries provided in the original dump.444https://dumps.wikimedia.org/enwiki Each section is segmented into sentences by the open-source toolkit, NLP4J.555https://github.com/emorynlp/nlp4j In our corpus, documents refer to individual sections in the Wikipedia articles.
|Total # of articles||486|
|Total # of sections||8,481|
|Total # of sentences||113,709|
|Total # of tokens||2,810,228|
Iii-B Annotation Scheme
Four annotation tasks are conducted in sequence on Amazon Mechanical Turk for answer sentence selection (Tasks 1-4), and a single task is conducted for answer triggering using only Elasticsearch (Task 5; see Figure 1 for the overview).
|Topic: TV, Article: Criminal Minds, Section: Critical reception|
|1. The premiere episode was met with mixed reviews, receiving a score of 42 out of 100 on aggregate review site|
|Metacritic, indicating “mixed or average” reviews.|
|2. Dorothy Rabinowitz said, in her review for the Wall Street Journal, that “From the evidence of the first few episodes,|
|Criminal Minds may be a hit, and deservedly”…|
|3. The New York Times was less than positive, saying “The problem with Criminal Minds is its many confusing maladies,|
|applied to too many characters” and felt that “as a result, the cast seems like a spilled trunk of broken toys, with which|
|the audience - and perhaps the creators - may quickly become bored.”|
|4. The Chicago Tribune reviewer, Sid Smith, felt that the show “May well be worth a look” though he too criticized|
|“the confusing plots and characters”.|
|Task 1||How was the premiere reviewed?|
|Task 2||Who felt that Criminal Minds had confusing characters?|
|Task 3.1||How were the initial reviews?|
|Task 3.2||Who was confused by characters on Criminal Minds?|
|Task 4.3.1||How were the initial reviews in Criminal Minds?|
|Task 1||1,824||154||1,978||44.99||23.65||28.88||71 sec.||$ 0.10|
|Task 2||1,828||148||1,976||44.64||23.20||28.62||64 sec.||$ 0.10|
|Task 3||3,637||313||3,950||38.03||19.99||24.41||41 sec.||$ 0.08|
|Task 4||682||55||737||31.09||19.41||21.88||54 sec.||$ 0.08|
Approximately two thousand sections are randomly selected from the 486 articles in Section III-A. All the selected sections consist of 3 to 25 sentences; we found that annotators experienced difficulties accurately and timely annotating longer sections. For each section, annotators are instructed to generate a question that can be answered in one or more sentences in the provided section, and select the corresponding sentence or sentences that answer the question. The annotators are provided with the instructions, the topic, the article title, the section title, and the list of numbered sentences in the section (Table II).
Annotators are asked to create another set of 2K questions from the same selected sections excluding the sentences selected as answers in Task 1. The goal of Task 2 is to generate questions that can be answered from sentences different from those used to answer questions generated in the Task 1. The annotators are provided with the same information as in Task 1, except that the sentences used as the answer contexts in Task 1 are crossed out (line 1 in Table II). Annotators are instructed not to use these sentences to generate new questions.
Although our instruction encourages the annotators to create questions in their own words, annotators will generate questions with some lexical overlap with the corresponding contexts. The intention of this task is to mitigate the effects of annotators’ tendency to generating questions with similar vocabulary and phrasing to answer contexts. This is a necessary step in creating a corpus that evaluates reading comprehension rather than ability to model word co-occurrences. The annotators are provided with the previously generated questions and answer contexts and are instructed to paraphrase these questions using different terms.
Most questions generated by Tasks 1-3 are of high quality, that is they can be answered by a human when given the corresponding contexts; however, there are some questions that are ambiguous in meaning and difficult for humans to answer correctly. These difficult questions often incorrectly assume that the related sections are provided with the questions. For instance, it is impossible to answer the question from Task 3.1 in Table II unless the related section is provided with the question. These ambiguous questions are sent back to the annotators for revision.
Elasticsearch is used to find ambiguous questions,666www.elastic.co/products/elasticsearch a Lucene-based open-source search engine. First, an inverted index of 8,481 sections is built, where each section is considered a document. Each question is queried to this search engine. If the answer context is not included within the top 5 sections in the search result, the question is considered ‘suspicious’ although it may not be ambiguous. Among 7,904 questions generated by Tasks 1-3, 1,338 of them are found to be suspicious. These questions are sent to the annotators, and rephrased by the annotators if deemed necessary.
By using the previously generated answer sentence selection data, the answer triggering corpus can be automatically generated again using Elasticsearch. To generate answer contexts for answer triggering, all 14M sections from the entire English Wikipedia are indexed, and each question from Tasks 1-4 is queried. Every sentence in the top 5 highest scoring sections from Elasticsearch are collected as candidates, which may or may not include the answer context that resolves the question.
Iii-C Corpus Analysis
The entire annotation took about 130 hours, costing in total; each mturk job took on average approximately 1 minute and costed about . A total of 7,904 questions were generated from Tasks 1-4, where 92.2% of them found their answers in single sentences. It is clear that Task 3 was effective in reducing the percentage of overlapping words between question and answer pairs (about 4%; f in Table III). The questions from Task 3 can be used to develop paraphrasing models as well. Multiple pilot studies on different tasks were conducted to analyze quality and cost; Tasks 1-4 were proved to be the most effective in the pilot studies. Following , we paid incentives to those who submitted outstanding work, which improved the overall quality of our annotation.
Our corpus could be compared to WikiQA that was created with the intent of providing a challenging dataset for selection-based question answering . Questions in this dataset were collected from the user logs of the Bing search engine, and associated with the specific sections in Wikipedia, namely the first sections known as the abstracts. We aim to provide a similar yet more exhaustive dataset by broadening the scope to all sections. A notable difference was found between these two corpora for overlapping words (about 11% difference), which was expected due to the artificial question generation in our scheme. Although questions taken from the search queries are more natural, real search queries are inacessible to most researchers. The new annotation scheme proposed here can prove useful for researchers needing to create a corpus for selection-based QA.
Our answer triggering dataset contains 5 times more answer candidates per question than WikiQA because WikiQA includes only sections clicked on by users. Manual selection is eliminated from our framework, making our corpus more practical. In WikiQA, 40.76% of the questions have corresponding answer contexts for answer triggering, as compared to 39.25% in ours.
Two models using convolutional neural networks are developed, one is our replication of the best model in , and the other is an improved model using subtree matching (Section IV-A). Two more models using recurrent neural networks are developed, one is our replication of the attentive pooling model in , and the other is a simpler model using one-way attention (Section IV-B). These are inspired by the latest state-of-the-art approaches, providing sensible evaluations.
Iv-a Convolutional Neural Networks
Our CNN model is motivated by 
. First, a convolutional layer is applied on the image of text using the hyperbolic tangent activation function. The image consists of rows standing for consecutive words in two sentences, the question () and the answer candidate (), where the words are represented by their embeddings 
. For our experiments, we use the image of 80 rows (40 for question and answer, respectively). If any of the question or answer is longer than 40 tokens, the rest is being cut from the input. Next, the max pooling is applied,777We also experimented with the average pooling as , which led to a marginally lower accuracy.
and the sentence vectors forand are generated. Unlike  who performed the dot product between these two vectors, we added another hidden layer to learn their weights. Finally, the sigmoid activation function is applied and the entire network is trained using the binary cross-entropy.
Next, we use a logistic regression model, where the CNN score from the output layer is used as one of the features. Other features in the logistic regression are the number of overlapping words between and , say , normalized by the IDF, and the question length. While the logistic regression model could be merged directly with our CNN model, it has been empirically shown that it is more effective to construct this last phase as a separate model.
For the answer sentence selection task, the predictions for each question are treated as a ranking and the MAP and MRR scores are being calculated (Section V-B). On the other hand, in the answer triggering task (Section V-C) a threshold is applied on each predicted question by the logistic regression; the candidate with the highest score is considered the answer if it is above the threshold found during development; otherwise, the model assumes no existence of the answer context in this document for that question. Figure 2 shows the overview of our CNN and LR model.
We propose a subtree matching mechanism for measuring the contextual similarity between two sentences. All sentences are automatically parsed by the NLP4J dependency parser . First, a set of co-occurring words between and , say , is created. For each , ’s parents (, ), siblings (, ), and children (, ) are extracted from the dependency slices of and . When the word-forms are used as the comparator, returns if and have the same form; otherwise, . When the word embeddings are used as the comparator,
returns the cosine similarity betweenand . The function takes a list of scores and returns either the sum, avg, or max of the scores. Finally, the triplet is used as the additional features to the logistic regression model. Algorithm 1 presents the entire process in detail. Although our subtree matching mechanism adds only 3 more features, our experiments show significant performance gains for both the answer sentence selection and answer triggering strengthening our hypothesis that to solve question answering problems more effectively, deeper contextual similarity is required.
Iv-B Recurrent Neural Network
Our RNN model is based on the bidirectional Long Short-Term Memory (LSTM) using attentive pooling introduced by
, except that our network uses a gated recurrent unit (GRU;) instead of LSTM. From our preliminary experiments, we found that GRU converged faster than LSTM while achieving similar performance for these tasks. Let , , where is the question and is the answer candidate, and returns the embedding of a word . Embeddings are encoded by a single bidirectional GRU that consists of the forward () and the backward () GRUs, each with hidden units. Given , outputs the vector concatenation of the hidden states of and :
Let represent the dimensionality of the output of . Then, sentence embedding matrices and are generated by as and .
Attentive Pooling (AP) is a framework-independent two-way attention mechanism that jointly learns a similarity measure between and . AP learns the similarity measure over the hidden states of and . The AP matrix has a bilinear form and is followed by a hyperbolic tangent non-linearity, where :
The importance vectors and are generated from the column-wise and row-wise max pooling over , respectively:
The normalized attention vectors and are created by applying the softmax activation function on and :
The final representations and for and are created using the dot products of the sentence representations and their corresponding attention vectors. The score is computed for each pair using cosine similarity:
Our one-way attention model is a simplified version of the attentive pooling model above, which is most similar to the global attention model introduced by . We did not use the one-way attention from  to avoid deviating the attention mechanism significantly. Replacing with , the last hidden state of , becomes the importance vector . Again, we create the normalized attention vector by applying the softmax activation function. The final representations are and .
V-a SelQA: Selection-based QA Corpus
Table IV shows the distributions of our corpus, called SelQA. Our corpus is split into training (70%), development (10%), and evaluation (20%) sets. The answer triggering data (AT) is significantly larger than the answer sentence selection data (ASS), due to the extra sections added by Task 5 (Section III-B).
V-B Answer Sentence Selection
Table V shows results from ours and the previous approaches on WikiQA. Two metrics are used, mean average precision (MAP) and mean reciprocal rank (MRR), for the evaluation of this task. CNN is our replication of the best model in . CNN and CNN are the CNN models using the subtree matching in Section IV-A, where the comparator of is either the word form or the word embedding respectively, and = avg. The subtree matching models consistently outperforms the baseline model. Note that among the three metrics of , avg, sum, and max, avg outperformed the others in our experiments for the answer sentence selection task although no significant differences were found. RNN and RNN are the RNN models using the one-way attention and the attentive pooling in Section IV-B. Note that RNN converged much faster than RNN at the same learning rate and fixed number of parameters in our experiments, implying that two-way attention assists with optimization.
|CNN: avg + word||70.75||71.46||67.40||69.30|
|CNN: avg + emb||69.22||70.18||68.78||70.82|
|Yang et al. ||-||-||65.20||66.52|
|Santos et al. ||-||-||68.86||69.57|
|Miao et al. ||-||-||68.86||70.69|
|Yin et al. ||-||-||69.21||71.08|
|Wang et al. ||-||-||70.58||72.26|
It is interesting to see how CNN and RNN outperform CNN and RNN respectively on the development set, but not on the evaluation set. This result may be explained by the larger percentage of overlapping words in the development set, enabling the simpler models perform more effectively.
|CNN: avg + word||85.04||86.17||84.00||84.94|
|CNN: avg + emb||85.70||86.67||84.66||85.68|
Table VI shows the results achieved by our models on SelQA. CNN outperforms the other CNN models, indicating the power of subtree matching coupled with word embeddings. RNN outperforms RNN, indicating the importance of attention over the questions. Unlike the results on WikiQA in Table V, CNN and RNN show the best performance on both the development and evaluation sets, implying the robustness of these models on our corpus.
Table VII shows the MRR scores from our models on SelQA with respect to different topics. All models show strength on topics such as ‘Country’ and ‘Historical Events’, which is comprehensible since questions in these topics tend to be deterministic. On the other hand, most models show weakness on topics such as ‘TV’, ‘Arts’, or ‘Music’. This may be due to the fact that not many overlapping words are found between question and answer pairs in these documents, which also consist of many segments caused by bullet points.
Table VIII shows comparisons between questions from Tasks 1 and 2 (original) and Task 3 (paraphrase) in Section III-B. As expected, noticeable performance drop is found for the paraphrased questions, which have much fewer overlapping words to the answer contexts than the original questions.
Table IX shows the MRR scores with respect to question types. The CNN models show strength on the ‘who’ type, whereas the RNN models show strength on the ‘when’ type. Each model varies on showing their weakness, which we will explore in the future. Finally, Figure 4 shows the performance difference with respect to question and section lengths. All models except for RNN tend to perform better as questions become longer. This makes sense since longer questions are usually more informative. On the other hand, models generally perform worse as sections become longer, which also makes sense because the models have to select the answer contexts from larger pools.
V-C Answer Triggering
Due to the nature of answer triggering, metrics used for evaluating answer sentence selection are not used here, because those metrics assume that models are always provided with contexts including the answers. Broadly speaking, the answer sentence selection task is a raking problem, while answer triggering is a binary classification task with additional constraints. Thus, the F1-score on the question level was proposed by  as the evaluation for this task, which we follow.
Table X shows the answer triggering results on WikiQA. Note that RNN using one-way attention was dropped for these experiments because it did not show comparable performance against the others for this task. Interestingly, the CNN model with = max outperformed the other metrics for answer triggering, although avg was found to be the most effective for answer sentence selection. The CNN subtree matching models consistently gave over 2% improvements to the baseline model.
|CNN: max + word||44.53||45.24||44.88||29.77||42.39||34.97|
|CNN: max + emb||43.07||46.83||44.87||29.77||42.39||34.97|
|CNN: max + emb+||44.44||44.44||44.44||29.43||48.56||36.65|
|Yang et al. ||-||-||-||27.96||37.86||32.17|
In addition, CNN was experimented by retraining word embeddings (emb+), which performed slightly worse on the development set, but gave another 1.68% improvement on the evaluation set.888Retraining word embeddings was not found to be useful for answer sentence selection. RNN showed a very similar result to , which was surprising since it performed so much better for answer sentence selection. This can be due to a lack of hyper-parameter optimization, which we leave as a future work.
|CNN: max + word||48.15||47.99||48.07||52.22||47.30||49.64|
|CNN: max + emb||49.32||48.99||49.16||53.69||48.38||50.89|
|CNN: max + emb+||47.16||47.32||47.24||52.14||47.14||49.51|
Table XI shows the answer triggering results on SelQA. Unlike the results on WikiQA (Table X), CNN outperforms CNN on our corpus. On the other hand, RNN shows a similar score to  as it does on WikiQA. CNN using subtree matching gives over a 5% improvement to the baseline model, which is significant.
Table XII shows the accuracies on SelQA with respect to different topics. The accuracy is measured on the subset of questions that contain at least one answer among candidates; the top ranked sentence is taken and checked for the correct answer. Similar to answer sentence selection, CNN
stills shows strength on topics such as ‘Country’ and ‘Historical Events’, but the trend is not as clear for the other models. The worst performing topics are ‘TV’, ‘Music’ and ‘Art’. Such a noticeable difference might be caused by the unusual semantic sentence constructions of the text. Sections in these categories often contain listings, bullet-pointed texts etc., which is problematic for the models to properly take care of. How to correctly understand and solve question from such context will be a challenge to the future systems. Also, interestingly, the standard deviation is much smaller for RNN(3.9%) compared to the CNN models (10-12%) although RNN’s overall performance is lower.
Table XIV shows the accuracies on SelQA with respect to question types. Interestingly, each model shows different strength on different types, which may suggest a possibility of an ensemble model. Finally, Figure 5 shows the performance difference with respect to question and section lengths for the answer triggering task. All the models tend to perform better as questions become longer. Similarly as in the answer sentence selection task, since longer questions are more informative, it is understandable. Interestingly, once the section becomes longer, the accuracy increases. We hypothesize that such a behavior might be caused by the fact that it is easier for the models to decide whether the context of the section is the same as the context of the question when there is more information (sentences) in the section. Thus, this phenomenon is related to the task of answer triggering, where the model not only choose the sentence with the answer, but must decide if the context matches first.
In this paper we present a new benchmark for two major question answering tasks: answer sentence selection and answer triggering. Several systems using neural networks are developed for the analysis of our corpus. Our analysis shows different aspects about the current QA approaches, beneficial for further enhancement.
Researchers devoted to relatively small datasets reveal useful characteristics of the question answering tasks. Techniques that result in improvements on smaller datasets are often significantly diminished with larger datasets. Current hardware trends and the availability of larger datasets make large scale question answering more accessible.
We plan to continue our work on providing large scale corpora for open-domain question answering. Also, we intend to continue working towards providing context-aware frameworks for question answering.
We gratefully acknowledge the support from Infosys Ltd. Any contents in this material are those of the authors and do not necessarily reflect the views of Infosys Ltd.
M. Wang, N. A. Smith, and T. Mitamura, “What is the Jeopardy Model? A
Quasi-Synchronous Grammar for QA,” in
Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, ser. EMNLP-CoNLL’07, 2007, pp. 22–32.
-  Y. Yang, W.-t. Yih, and C. Meek, “WIKIQA: A Challenge Dataset for Open-Domain Question Answering,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, ser. EMNLP’15, 2015, pp. 2013–2018.
-  M. Feng, B. Xiang, M. R. Glass, L. Wang, and B. Zhou, “Applying Deep Learning to Answer Selection: A Study and An Open Task,” in IEEE Workshop on Automatic Speech Recognition and Understanding, 2015, pp. 813–820.
-  L. Yu, K. M. Hermann, P. Blunsom, and S. Pulman, “Deep Learning for Answer Sentence Selection,” in Proceedings of the NIPS Deep Learning Workshop, 2014.
-  D. Wang and E. Nyberg, “A Long Short-Term Memory Model for Answer Sentence Selection in Question Answering,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, ser. ACL’15, 2015, pp. 707–712.
-  A. Severyn and A. Moschitti, “Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks,” in Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’15, 2015, pp. 373–382.
-  L. Yu, K. M. Hermann, P. Blunsom, and S. Pulman, “Deep learning for answer sentence selection,” arXiv preprint arXiv:1412.1632, 2014.
-  M. Iyyer, J. Boyd-Graber, L. Claudino, R. Socher, and H. Daumé III, “A neural network for factoid question answering over paragraphs,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 633–644.
-  L. Dong, F. Wei, M. Zhou, and K. Xu, “Question answering over freebase with multi-column convolutional neural networks,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, vol. 1, 2015, pp. 260–269.
-  W. Yin, H. Schütze, B. Xiang, and B. Zhou, “Abcnn: Attention-based convolutional neural network for modeling sentence pairs,” arXiv preprint arXiv:1512.05193, 2015.
-  W.-t. Yih, X. He, and C. Meek, “Semantic parsing for single-relation question answering,” in Proceedings of ACL, 2014.
-  P. Blunsom, E. Grefenstette, N. Kalchbrenner et al., “A convolutional neural network for modelling sentences,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014.
-  M. Heilman and N. A. Smith, “Tree Edit Models for Recognizing Textual Entailments, Paraphrases, and Answers to Questions,” in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, ser. HLT’10, 2010, pp. 1011–1019.
-  M. Wang and C. Manning, “Probabilistic Tree-Edit Models with Structured Latent Variables for Textual Entailment and Question Answering,” in Proceedings of the 23rd International Conference on Computational Linguistics, ser. COLING’10, 2010, pp. 1164–1172.
-  A. Severyn and A. Moschitti, “Automatic Feature Engineering for Answer Selection and Extraction,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, ser. EMNLP’13, 2013, pp. 458–467.
-  M. Tan, B. Xiang, and B. Zhou, “LSTM-based Deep Learning Models for Non-factoid Answer Selection,” arXiv, vol. arXiv:1511.04108, 2015.
-  C. N. d. Santos, M. Tan, B. Xiang, and B. Zhou, “Attentive pooling networks,” CoRR, vol. abs/1602.03609, 2016. [Online]. Available: http://arxiv.org/abs/1602.03609
-  C.-J. Ho, A. Slivkins, S. Suri, and J. W. Vaughan, “Incentivize High Quality Crowdwork,” in Proceedings of the 24th World Wide Web Conference, ser. WWW’15, 2015.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed Representations of Words and Phrases and their Compositionality,” inProceedings of Advances in Neural Information Processing Systems 26, ser. NIPS’13, 2013, pp. 3111–3119.
-  J. D. Choi and A. McCallum, “Transition-based Dependency Parsing with Selectional Branching,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ser. ACL’13, 2013, pp. 1052–1062.
-  K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, ser. EMNLP’14, 2014, pp. 1724–1734.
T. Luong, H. Pham, and C. D. Manning, “Effective Approaches to Attention-based Neural Machine Translation,” inProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, ser. EMNLP’15, 2015, pp. 1412–1421.
-  Y. Miao, L. Yu, and P. Blunsom, “Neural Variational Inference for Text Processing,” arXiv, vol. arXiv:1511.06038, 2015.
-  W. Yin, H. Schütze, B. Xiang, and B. Zhou, “ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs,” arXiv, vol. arXiv:1512.05193, 2015.
-  Z. Wang, H. Mi, and A. Ittycheriah, “Sentence Similarity Learning by Lexical Decomposition and Composition,” arXiv, vol. arXiv:1602.07019, 2016.