Conclusion-Supplement Answer Generation for Non-Factoid Questions

by   Makoto Nakatsuji, et al.

This paper tackles the goal of conclusion-supplement answer generation for non-factoid questions, which is a critical issue in the field of Natural Language Processing (NLP) and Artificial Intelligence (AI), as users often require supplementary information before accepting a conclusion. The current encoder-decoder framework, however, has difficulty generating such answers, since it may become confused when it tries to learn several different long answers to the same non-factoid question. Our solution, called an ensemble network, goes beyond single short sentences and fuses logically connected conclusion statements and supplementary statements. It extracts the context from the conclusion decoder's output sequence and uses it to create supplementary decoder states on the basis of an attention mechanism. It also assesses the closeness of the question encoder's output sequence and the separate outputs of the conclusion and supplement decoders as well as their combination. As a result, it generates answers that match the questions and have natural-sounding supplementary sequences in line with the context expressed by the conclusion sequence. Evaluations conducted on datasets including "Love Advice" and "Arts Humanities" categories indicate that our model outputs much more accurate results than the tested baseline models do.



There are no comments yet.


page 1

page 2

page 3

page 4


Answer Generation through Unified Memories over Multiple Passages

Machine reading comprehension methods that generate answers by referring...

Can AI Generate Love Advice?: Toward Neural Answer Generation for Non-Factoid Questions

Deep learning methods that extract answers for non-factoid questions fro...

Automated Question Answer medical model based on Deep Learning Technology

Artificial intelligence can now provide more solutions for different pro...

Tag and Correct: Question aware Open Information Extraction with Two-stage Decoding

Question Aware Open Information Extraction (Question aware Open IE) take...

VANiLLa : Verbalized Answers in Natural Language at Large Scale

In the last years, there have been significant developments in the area ...

How Good is Artificial Intelligence at Automatically Answering Consumer Questions Related to Alzheimer's Disease?

Alzheimer's Disease (AD) is the most common type of dementia, comprising...

Context-Dependent Semantic Parsing over Temporally Structured Data

We describe a new semantic parsing setting that allows users to query th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Question Answering (QA) modules play particularly important roles in recent dialog-based Natural Language Understanding (NLU) systems, such as Apple’s Siri and Amazon’s Echo. Users chat with AI systems in natural language to get the answers they are seeking. QA systems can deal with two types of question: factoid and non-factoid ones. The former sort asks, for instance, for the name of a thing or person such as “What/Who is ?”. The latter sort includes more diverse questions that cannot be answered by a short fact. For instance, users may ask for advice on how to make a long-distance relationship work well or for opinions on public issues. Significant progress has been made in answering factoid questions [25, 31]; however, answering non-factoid questions remains a challenge for QA modules.

Long short term memory (LSTM) sequence-to-sequence models [20, 24, 1] try to generate short replies to the short utterances often seen in chat systems. Evaluations have indicated that these models have the possibility of supporting simple forms of general knowledge QA, e.g. “Is the sky blue or black?”, since they learn commonly occurring sentences in the training corpus. Recent machine reading comprehension (MRC) methods [11, 15] try to return a single short answer to a question by extracting answer spans from the provided passages. Unfortunately, they may generate unsatisfying answers to regular non-factoid questions because they can easily become confused when learning several different long answers to the same non-factoid question, as pointed out by [5, 26].

This paper tackles a new problem: conclusion-supplement answer generation for non-factoid questions. Here, the conclusion consists of sentences that directly answer the question, while the supplement consists of information supporting the conclusion, e.g., reasons or examples. Such conclusion-supplement answers are important for helping questioners decide their actions, especially in NLU. As described in [3], users prefer a supporting supplement before accepting an instruction (i.e., a conclusion). Good debates also include claims (i.e., conclusions) about a topic and supplements to support them that will allow users to reach decisions [16]. The following example helps to explain how conclusion-supplement answers are useful to users: “Does separation by a long distance ruin love?” Current methods tend to answer this question with short and generic replies, such as, “Distance cannot ruin true love”. The questioner, however, is not likely to be satisfied with such a trite answer and will want to know how the conclusion was reached. If a supplemental statement like “separations certainly test your love” is presented with the conclusion, the questioner is more likely to accept the answer and use it to reach a decision. Furthermore, there may be multiple answers to a non-factoid question. For example, the following answer is also a potential answer to the question: “distance ruins most relationships. You should keep in contact with him”. The current methods, however, have difficulty generating such conclusion-supplement answers because they can become easily confused when they try to learn several different and long answers to a non-factoid question.

To address the above problem, we propose a novel architecture, called the ensemble network. It is an extension of existing encoder-decoder models, and it generates two types of decoder output sequence, conclusion and supplement. It uses two viewpoints for selecting the conclusion statements and supplementary statements. (Viewpoint 1) The context present in the conclusion decoder’s output is linked to supplementary-decoder output states on the basis of an attention mechanism. Thus, the context of the conclusion sequence directly impacts the decoder states of the supplement sequences. This, as a result, generates natural-sounding supplementary sequences. (Viewpoint 2) The closeness of the question sequence and conclusion (or supplement) sequence as well as the closeness of the question sequence with the combination of conclusion and supplement sequences is considered. By assessing the closeness at the sentence level and sentence-combination level in addition to at the word level, it can generate answers that include good supplementary sentences following the context of the conclusion. This avoids having to learn several different conclusion-supplement answers assigned to a single non-factoid question and generating answers whose conclusions and supplements are logically inconsistent with each other.

Community-based QA (CQA) websites tend to provide answers composed of conclusion and supplementary statements; from our investigation, 77% of non-factoid answers (love advice) in the Oshiete-goo ( dataset consist of these two statement types. The same is true for 82% of the answers in the Yahoo non-factoid dataset111 related to the fields of social science, society & culture and arts & humanities. We used the above-mentioned CQA datasets in our evaluations, since they provide diverse answers given by many responders. The results showed that our method outperforms existing ones at generating correct and natural answers. We also conducted an love advice service222 in Oshiete goo to evaluate the usefulness of our ensemble network.

Related work

The encoder-decoder framework learns how to transform one representation into another. Contextual LSTM (CLSTM) incorporates contextual features (e.g., topics) into the encoder-decoder framework [4, 17]. It can be used to make the context of the question a part of the answer generation process. HieRarchical Encoder Decoder (HRED) [17]

extends the hierarchical recurrent encoder-decoder neural network into the dialogue domain; each question can be encoded into a dense context vector, which is used to recurrently decode the tokens in the answer sentences. Such sequential generation of next statement tokens, however, weakens the original meaning of the first statement (question). Recently, several models based on the Transformer

[23], such as for passage ranking [12, 9] and answer selection [18], have been proposed to evaluate question-answering systems. There are, however, few Transformer-based methods that generate non-factoid answers.

Figure 1: Neural conclusion-supplement answer generation model.

Recent neural answer selection methods for non-factoid questions [2, 14, 22] learn question and answer representations and then match them using certain similarity metrics. They use open datasets stored at CQA sites like Yahoo! Answers since they include many diverse answers given by many responders and thus are good sources of non-factoid QA training data. The above methods, however, can only select and extract answer sentences, they do not generate them.

Recent machine reading comprehension methods try to answer a question with exact text spans taken from provided passages [30, 15, 27, 6]. Several studies on the MS-MARCO dataset [21, 11, 26] define the task as using multiple passages to answer a question where the words in the answer are not necessarily present in the passages. Their models, however, require passages other than QA pairs for both training and testing. Thus, they cannot be applied to CQA datasets that do not have such passages. Furthermore, most of the questions in their datasets only have a single answer. Thus, we think their purpose is different from ours; generating answers for non-factoid questions that tend to demand diverse answers.

There are several complex QA tasks such as those present in the TREC complex interactive QA tasks333 jimmylin/ciqa/ or DUC444 complex QA tasks. Our method can be applied to those non-factoid datasets if an access fee is paid.


This section describes our conclusion-supplement answer generation model in detail. An overview of its architecture is shown in Figure 1.

Given an input question sequence , the proposal outputs a conclusion sequence , and supplement sequence . The goal is to learn a function mapping from to and . Here, denotes a one-of- embedding of the -th word in an input sequence of length . () denotes a one-of- embedding of the -th word in an input sequence of length ().


The encoder converts the input into a question embedding, , and hidden states, .

Since the question includes several pieces of background information on the question, e.g. on the users’ situation, as well as the question itself, it can be very long and composed of many sentences. For this reason, we use the BiLSTM encoder, which encodes the question in both directions, to better capture the overall meaning of the question. It processes both directions of the input, and , sequentially. At time step , the encoder updates the hidden state by:

where is an LSTM unit, and and are hidden states output by the forward-direction LSTM and backward-direction LSTM, respectively.

We also want to reflect sentence-type information such as conclusion type or supplement type in sequence-to-sequence learning to better understand the conclusion or supplement sequences. We achieve this by adding a sentence type vector for conclusion or for supplement to the input gate, forget gate output gate, and cell memory state in the LSTM model. This is equivalent to processing a composite input [, ] or [, ] in the LSTM cell that concatenates the word embedding and sentence-type embedding vectors. We use this modified LSTM in the above BiLSTM model as:

When encoding the question to decode the supplement sequence, is input instead of in the above equation.

The BiLSTM encoder then applies a max-pooling layer to all hidden vectors to extract the most salient signal for each word. As a result, it generates a fixed-sized distributed vector representation for the conclusion,

, and another for the supplement, . and are different since the encoder is biased by the corresponding sentence-type vector, or .

As depicted in Figure 1, the BiLSTM encoder processes each word with a sentence-type vector (i.e. or ) and the max-pooling layer to produce the question embedding or . These embeddings are used as context vectors in the decoder network for the conclusion and supplement.


The decoder is composed of a conclusion decoder and supplement decoder. Here, let be the hidden state of the -th LSTM unit in the conclusion decoder. Similar to the encoder, the decoder also decodes a composite input [, ] in an LSTM cell that concatenates the conclusion word embedding and sentence-type embedding vectors. It is formulated as follows:

where denotes the conclusion decoder LSTM,

the probability of word

given by a softmax layer,

the -th conclusion decoded token, and the word embedding of . The supplement decoder’s hidden state is computed in the same way with ; however, it is updated in the ensemble network described in the next subsection.

As depicted in Figure 1, the LSTM decoder processes tokens according to question embedding or , which yields a bias corresponding to the sentence-type vector, or . The output states are then input to the ensemble network.

Ensemble network

The conventional encoder-decoder framework often generates short and simple sentences that fail to adequately answer non-factoid questions. Even if we force it to generate longer answers, the decoder output sequences become incoherent when read from the beginning to the end.

The ensemble network solves the above problem by (1) passing the context from the conclusion decoder’s output sequence to the supplementary decoder hidden states via an attention mechanism, and (2) considering the closeness of the encoder’s input sequence to the decoders’ output sequences as well as the closeness of the encoder’s input sequence to the combination of decoded output sequences.

(1) To control the context, we assess all the information output by the conclusion decoder and compute the conclusion vector, . is a sentence-level representation that is more compact, abstractive, and global than the original decoder output sequence. To get it, we apply BiLSTM to the conclusion decoder’s output states ; i.e., , where word representation matrix holds the word representations in its columns. At time step , the BiLSTM encoder updates the hidden state by:

where and are the hidden states output by the forward LSTM and backward LSTM in the conclusion encoder, respectively. It applies a max-pooling layer to all hidden vectors to extract the most salient signal for each word to compute the embedding for conclusion . Next, it computes the context vector at the -th step by using the -th output hidden state of the supplement decoder, , weight matrices, and

, and a sigmoid function,


This computation lets our ensemble network extract a conclusion-sentence level context. The resulting supplement sequences follow the context of the conclusion sequence. Finally, is computed as:


can be , , or , which represent three gates (e.g., input , forget , and output ). denotes a cell memory vector. and denote attention parameters.

(2) To control the closeness at the sentence level and sentence-combination level, it assesses all the information output by the supplement decoder and computes the supplement vector, , in the same way as it computes . That is, it applies BiLSTM to the supplement decoder’s output states ; i.e., , where the word representations are found in the columns of . Next, it applies a max-pooling layer to all hidden vectors in order to compute the embeddings for supplement . Finally, to generate the conclusion-supplement answers, it assesses the closeness of the embeddings for the question to those for the answer sentences ( or ) and their combination and

. The loss function for the above metrics is described in the next subsection.

As depicted in Figure 1, the ensemble network computes the conclusion embedding , the attention parameter weights from to the decoder output supplement states (dotted lines represent attention operations), and the supplement embedding . Then, and are input to the loss function together with the question embedding .

Loss function of ensemble network

Our model uses a new loss function rather than generative supervision, which aims to maximize the conditional probability of generating the sequential output . This is because we think that assessing the closeness of the question and an answer sequence as well as the closeness of the question to two answer sequences is useful for generating natural-sounding answers.

The loss function is for optimizing the closeness of the question and conclusion and that of the question and supplement as well as for optimizing the closeness of the question with the combination of the conclusion and supplement. The training loss is expressed as the following hinge loss, where is the output decoder vector for the ground-truth answer, is that for an incorrect answer randomly chosen from the entire answer space, is a constant margin, and is set equal to :

The key idea is that checks whether or not the conclusion, supplement, and their combination have been well predicted. In so doing, can optimize not only the prediction of the conclusion or supplement but also the prediction of the combination of conclusion and supplement.

The model is illustrated in the upper part of Figure 1; is input to compute the closeness and sequence combination losses.


The training loss is used to check and the cross-entropy loss in the encoder-decoder model. In the following equation, the conclusion and supplement sequences are merged into one sequence of length , where .


is a parameter to control the weighting of the two losses. We use adaptive stochastic gradient descent (AdaGrad) to train the model in an end-to-end manner. The loss of a training batch is averaged over all instances in the batch.

Figure 1 illustrates the loss for the ensemble network and the cross-entropy loss.


Compared methods

We compared the performance of our method with those of (1) Seq2seq

, a seq2seq attention model proposed by

[1]; (2) CLSTM, i.e., the CLSTM model [4]; (3) Trans, the Transformer [23], which has proven effective for common NLP tasks. In these three methods, conclusion sequences and supplement sequences are decoded separately and then joined to generate answers. They give more accurate results than methods in which the conclusion sequences and supplement sequences are decoded sequentially. We also compared (4) HRED, a hierarchical recurrent encoder-decoder model [17] in which conclusion sequences and supplement sequences are decoded sequentially to learn the context from conclusion to supplement; (5) NAGMWA, i.e., our neural answer generation model without an attention mechanism. This means that NAGMWA does not pass in Eq. (1) to the decoder, and conclusion decoder and supplement decoder are connected only via the loss function . In the tables and figures that follow, NAGM means our full model.


Our evaluations used the following two CQA datasets:


The Oshiete-goo dataset includes questions stored in the “love advice” category of the Japanese QA site, Oshiete-goo. It has 771,956 answers to 189,511 questions. We fine-tuned the model using a corpus containing about 10,032 question-conclusion-supplement (q-c-s) triples. We used 2,824 questions from the Oshiete-goo dataset. On average, the answers to these questions consisted of about 3.5 conclusions and supplements selected by human experts. The questions, conclusions, and supplements had average lengths of 482, 41, and 46 characters, respectively. There were 9,779 word tokens in the questions and 6,317 tokens in answers; the overlap was 4,096.


We also used the Yahoo nfL6 dataset, the largest publicly available English non-factoid CQA dataset. It has 499,078 answers to 87,361 questions. We fine-tuned the model by using questions in the “social science”, “society & culture”, and “arts & humanities” categories, since they require diverse answers. This yielded 114,955 answers to 13,579 questions. We removed answers that included some stop words, e.g. slang words, or those that only refer to some URLs or descriptions in literature, since such answers often become noise when an answer is generated. Human experts annotated 10,299 conclusion-supplement sentences pairs in the answers.

In addition, we used a neural answer-sentence classifier to classify the sentences into conclusion or supplement classes. It first classified the sentences into supplements if they started with phrases such as “this is because” or “therefore”. Then, it applied a BiLSTM with max-pooling to the remaining unclassified sentences,

, and generated embeddings for the un-annotated sentences, . After that, it used a logistic sigmoid function to return the probabilities of mappings to two discrete classes: conclusion and supplement. This mapping was learned by minimizing the classification errors using the above 10,299 labeled sentences. As a result, we automatically acquired 70,000 question-conclusion-supplement triples from the entire answers. There were 11,768 questions and 70,000 answers. Thus, about 6 conclusions and supplements on average were assigned to a single question. The questions, conclusions, and supplements had average lengths of 46, 87, and 71 characters, respectively. We checked the performance of the classifier; human experts checked whether the annotation results were correct or not. They judged that it was about 81% accurate (it classified 56,762 of 70,000 sentences into correct classes). There were 15,690 word tokens in questions and 124,099 tokens in answers; the overlap was 11,353.

Oshiete-goo nfL6
0 1 2 0 1 2
ROUGE-L 0.251 0.299 0.211 0.330 0.402 0.295
BLEU-4 0.098 0.158 0.074 0.062 0.181 0.023
Table 1: Results when changing .
Oshiete-goo nfL6
NAGM w/o ste NAGM w/o ste
ROUGE-L 0.299 0.235 0.402 0.349
BLEU-4 0.158 0.090 0.181 0.067
Table 2: Results when using sentence-type embeddings.
ROUGE-L 0.238 0.260 0.278 0.210 0.291 0.299
BLEU-4 0.092 0.121 0.087 0.042 0.147 0.158
Table 3: ROUGE-L/BLEU-4 for Oshiete-goo.
ROUGE-L 0.291 0.374 0.338 0.180 0.383 0.402
BLEU-4 0.081 0.141 0.122 0.055 0.157 0.181
Table 4: ROUGE-L/BLEU-4 for nfL6.
(1) (2) (3) (4) (1) (2) (3) (4)
21 18 27 34 47 32 11 10
Table 6: Human evaluation (nfL6).
(1) (2) (3) (4) (1) (2) (3) (4)
30 3 27 40 50 23 16 11
Table 5: Human evaluation (Oshiete-goo).


We conducted three evaluations using the Oshiete-goo dataset; we selected three different sets of 500 human-annotated test pairs from the full dataset. In each set, we trained the model by using training pairs and input questions in test pairs to the model. We repeated the experiments three times by randomly shuffling the train/test sets.

For the evaluations using the nfL6 dataset, we prepared three different sets of 500 human-annotated test q-c-s triples from the full dataset. We used 10,299 human-annotated triples to train the neural sentence-type classifier. Then, we applied the classifier to the unlabeled answer sentences. Finally, we evaluated the answer generation performance by using three sets of machine-annotated 69,500 triples and 500 human-annotated test triples.

After training, we input the questions in the test triples to the model to generate answers for both datasets. We compared the generated answers with the correct answers. The results described below are average values of the results of three evaluations.

The softmax computation was slow since there were so many word tokens in both datasets. Many studies [29, 28, 24] restricted the word vocabulary to one based on frequency. This, however, narrows the diversity of the generated answers. Since diverse answers are necessary to properly reply to non-factoid questions, we used bigram tokens instead of word tokens to speed up the computation without restricting the vocabulary. Accordingly, we put 4,087 bigram tokens in the Oshiete-goo dataset and 11,629 tokens in the nfL6 dataset.

To measure performance, we used human judgment as well as two popular metrics [20, 28, 1]

for measuring the fluency of computer-generated text: ROUGE-L

[8] and BLEU-4 [13]. ROUGE-L is used for measuring the performance for evaluating non-factoid QAs [19], however, we also think human judgement is important in this task.

Parameter setup

For both datasets, we tried different parameter values and set the size of the bigram token embedding to 500, the size of LSTM output vectors for the BiLSTMs to , and number of topics in the CLSTM model to 15. We tried different margins, , in the hinge loss function and settled on . The iteration count was set to .

We varied in Eq. (2) from 0 to 2.0 and checked the impact of by changing . Table 1 shows the results. When is zero, the results are almost as poor as those of the seq2seq model. On the other hand, while raising the value of places greater emphasis on our ensemble network, it also degrades the grammaticality of the generated results. We set to 1.0 after determining that it yielded the best performance. This result clearly indicates that our ensemble network contributes to the accuracy of the generated answers.

A comparison of our full method NAGM with the one without the sentence-type embedding (we call this method w/o ste) that trains separate decoders for two types of sentences is shown in Table 2. The result indicated that the existence of the sentence type vector, or , contributes the accuracy of the results since it distinguishes between sentence types.



The results for Oshiete-goo are shown in Table 3 and those for nfL6 are shown in Table 4. They show that CLSTM is better than Seq2seq. This is because it incorporates contextual features, i.e. topics, and thus can generate answers that track the question’s context. Trans is also better than Seq2seq, since it uses attention from the question to the conclusion or supplement more effectively than Seq2seq. HRED failed to attain a reasonable level of performance. These results indicate that sequential generation has difficulty generating subsequent statements that follow the original meaning of the first statement (question).

NAGMWA is much better than the other methods except NAGM, since it generates answers whose conclusions and supplements as well as their combinations closely match the questions. Thus, conclusions and supplements in the answers are consistent with each other and avoid confusion made by several different conclusion-supplement answers assigned to a single non-factoid questions. Finally, NAGM is consistently superior to the conventional attentive encoder-decoders regardless of the metric. Its ROUGE-L and BLEU-4 scores are much higher than those of CLSTM. Thus, NAGM generates more fluent sentences by assessing the context from conclusion to supplement sentences in addition to the closeness of the question and sentences as well as that of the question and sentence combinations.

Table 7: Example answers generated by CLSTM and NAGM. #1 is for Oshiete-goo and #2 for nfL6.

Human evaluation

Following evaluations made by crowdsourced evaluators [7], we conducted human evaluations to judge the outputs of CLSTM and those of NAGM. Different from [7], we hired human experts who had experience in Oshiete-goo QA community service. Thus, they were familiar with the sorts of answers provided by and to the QA community.

The experts asked questions, which were not included in our training datasets, to the AI system and rated the answers; one answer per question. The experts rated the answers as follows: (1) the content of the answer matched the question, and the grammar was okay; (2) the content was suitable, but the grammar was poor; (3) the content was not suitable, but the grammar was okay; (4) both the content and grammar were poor. Note that our evaluation followed the DUC-style strategy555 Here, we mean “grammar” to cover grammaticality, non-redundancy, and referential clarity in the DUC strategy, whereas we mean the “content matched the questions” to refer to “focus” and “structure and coherence” in the DUC strategy. The evaluators were given more than a week to carefully evaluate the generated answers, so we consider that their judgments are reliable. Each expert evaluated 50 questions. We combined the scores of the experts by summing them. They did not know the identity of the system in the evaluation and reached their decisions independently.

Table 6 and Table 6 present the results. The numbers are percentages. Table 7 presents examples of questions and answers. For Oshiete-goo results, the original Japanese and translated English are presented. The questions are very long and include long background descriptions before the questions themselves.

These results indicate that the experts were much more satisfied with the outputs of NAGM than those of CLSTM. This is because, as can be seen in Table 7, NAGM generated longer and better question-related sentences than CLSTM did. NAGM generated grammatically good answers whose conclusion and supplement statements are well matched with the question and the supplement statement naturally follows the conclusion statement.

Generating answers missing from the corpus

The encoder-decoder network tends to re-generate answers in the training corpus. On the other hand, NAGM can generate answers not present in the corpus by virtue of its ensemble network that considers contexts and sentence combinations.

Table 7 lists some examples. For example, answer #1 generated by NAGM is not in the training corpus. We think it was generated from the parts in italics in the following three sentences that are in the corpus: (1) “I think that it is better not to do anything from your side. If there is no reaction from him, it is better not to do anything even if there is opportunity to meet him next.” (2) “I think it may be good for you to approach your lover. Why don’t you think positively about it without thinking too pessimistically?” (3) “Why don’t you tell your lover that you usually do not say what you are thinking. I think that it is important to communicate the feelings to your lover; how you like or care about him/her especially when you are quarreling with each other.

The generation of new answers is important for non-factoid answer systems, since they must cope with slight differences in question contexts from those in the corpus.

Online evaluation in “Love Advice” service

Our ensemble network is currently being used in the love advice service of Oshiete goo [10]. The service uses only the ensemble network to ensure that the service offers high-quality output free from grammar errors. We input the sequences in our evaluation corpus instead of the decoder output sequences into the ensemble network. Our ensemble network then learned the optimum combination of answer sequences as well as the closeness of the question and those sequences. As a result, it can construct an answer that corresponds to the situation underlying the question. In particular, 5,702 answers created by the AI, whose name is Oshi-el (Oshi-el means teaching angel), using our ensemble network in reply to 33,062 questions entered from September 6th, 2016 to November 17th, 2019, were judged by users of the service as good answers. Oshi-el output good answers at about twice the rate of the average human responder in Oshiete-goo who answered more than 100 questions in the love advice category. Thus, we think this is a good result.

Furthermore, to evaluate the effectiveness of the supplemental information, we prepared 100 answers that only contained conclusion sentences during the same period of time. As a result, users rated the answers that contained both conclusion and supplement sentences as good 1.6 times more often than those that contained only conclusion sentences. This shows that our method successfully incorporated supplemental information in answering non-factoid questions.


We tackled the problem of conclusion-supplement answer generation for non-factoid questions, an important task in NLP. We presented an architecture, ensemble network, that uses an attention mechanism to reflect the context of the conclusion decoder’s output sequence on the supplement decoder’s output sequence. The ensemble network also assesses the closeness of the encoder input sequence to the output of each decoder and the combined output sequences of both decoders. Evaluations showed that our architecture was consistently superior to conventional encoder-decoders in this task. The ensemble network is now being used in the “Love Advice,” service as mentioned in the Evaluation section.

Furthermore, our method, NAGM, can be generalized to generate much longer descriptions other than conclusion-supplement answers. For example, it is being used to generate Tanka, which is a genre of classical Japanese poetry that consists of five lines of words666, in the following way. The first line is input by a human user to NAGM as a question, and NAGM generates second line (like a conclusion) and third line (like a supplement). The third line is again input to NAGM as a question, and NAGM generates the fourth line (like a conclusion) and fifth line (like a supplement).


  • [1] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. External Links: Link Cited by: Introduction, Compared methods, Methodology.
  • [2] C. dos Santos, L. Barbosa, D. Bogdanova, and B. Zadrozny (2015-07) Learning hybrid representations to retrieve semantically equivalent questions. In Proc. ACL-IJCNLP’15, pp. 694–699. Cited by: Related work.
  • [3] R. Ennis (1991) Critical thinking: a streamlined conception. In Teaching philosophy, pp. 5–25. Cited by: Introduction.
  • [4] S. Ghosh, O. Vinyals, B. Strope, S. Roy, T. Dean, and L. Heck (2016) Contextual LSTM (CLSTM) models for large scale NLP tasks. CoRR abs/1602.06291. Cited by: Related work, Compared methods.
  • [5] R. Jia and P. Liang (2017) Adversarial examples for evaluating reading comprehension systems. In Proc. EMNLP’17, pp. 2021–2031. Cited by: Introduction.
  • [6] M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer (2017) TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. CoRR abs/1705.03551. Cited by: Related work.
  • [7] J. Li, W. Monroe, A. Ritter, D. Jurafsky, M. Galley, and J. Gao (2016)

    Deep reinforcement learning for dialogue generation

    In Proc. EMNLP’16, pp. 1192–1202. Cited by: Human evaluation.
  • [8] C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out: In: Proc. ACL-04 Workshop, pp. 74–81. Cited by: Methodology.
  • [9] X. Liu, K. Duh, and J. Gao (2018) Stochastic answer networks for natural language inference. CoRR abs/1804.07888. Cited by: Related work.
  • [10] M. Nakatsuji (2018) Can ai generate love advice? neural conclusion-supplement answer construction for non-factoid questions. In GPU Technology Conference 2018 – San Jose, CA, Cited by: Online evaluation in “Love Advice” service.
  • [11] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng (2016) MS MARCO: A human generated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with NIPS 2016, Cited by: Introduction, Related work.
  • [12] R. Nogueira, W. Yang, J. Lin, and K. Cho (2019) Document expansion by query prediction. CoRR abs/1904.08375. Cited by: Related work.
  • [13] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proc. ACL’02, pp. 311–318. Cited by: Methodology.
  • [14] X. Qiu and X. Huang (2015)

    Convolutional neural tensor network architecture for community-based question answering

    In Proc. IJCAI’15, pp. 1305–1311. Cited by: Related work.
  • [15] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100, 000+ questions for machine comprehension of text. CoRR abs/1606.05250. Cited by: Introduction, Related work.
  • [16] R. Rinott, L. Dankin, C. A. Perez, M. M. Khapra, E. Aharoni, and N. Slonim (2015) Show me your evidence - an automatic method for context dependent evidence detection. In Proc. EMNLP’15, pp. 440–450. Cited by: Introduction.
  • [17] I. V. Serban, A. Sordoni, Y. Bengio, A. C. Courville, and J. Pineau (2016) Building end-to-end dialogue systems using generative hierarchical neural network models. In Proc. AAAI’16, pp. 3776–3784. Cited by: Related work, Compared methods.
  • [18] T. Shao, Y. Guo, Z. Hao, and H. Chen (2019-02) Transformer-based neural network for answer selection in question answering. IEEE Access PP, pp. 1–1. Cited by: Related work.
  • [19] H. Song, Z. Ren, S. Liang, P. Li, J. Ma, and M. de Rijke (2017) Summarizing answers in non-factoid community question-answering. In Proc. WSDM ’17, pp. 405–414. Cited by: Methodology.
  • [20] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Proc. NIPS’14, pp. 3104–3112. Cited by: Introduction, Methodology.
  • [21] C. Tan, F. Wei, N. Yang, W. Lv, and M. Zhou (2017) S-net: from answer extraction to answer generation for machine reading comprehension. CoRR abs/1706.04815. Cited by: Related work.
  • [22] M. Tan, C. N. dos Santos, B. Xiang, and B. Zhou (2016) Improved representation learning for question answer matching. In Proc. ACL’16, pp. 464–473. Cited by: Related work.
  • [23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. External Links: Link Cited by: Related work, Compared methods.
  • [24] O. Vinyals and Q. V. Le (2015) A neural conversational model. CoRR abs/1506.05869. Cited by: Introduction, Methodology.
  • [25] M. Wang, N. A. Smith, and T. Mitamura (2007) What is the jeopardy model? a quasi-synchronous grammar for qa. In Proc. EMNLP-CoNLL’07, pp. 22–32. Cited by: Introduction.
  • [26] Y. Wang, K. Liu, J. Liu, W. He, Y. Lyu, H. Wu, S. Li, and H. Wang (2018) Multi-passage machine reading comprehension with cross-passage answer verification. In Proc. ACL’18, pp. 1918–1927. Cited by: Introduction, Related work.
  • [27] Y. Yang, W. Yih, and C. Meek (2015) WikiQA: A challenge dataset for open-domain question answering. In Proc. EMNLP’15, pp. 2013–2018. Cited by: Related work.
  • [28] Z. Yang, Y. Yuan, Y. Wu, W. W. Cohen, and R. Salakhutdinov (2016) Review networks for caption generation. In Proc. NIPS’16, pp. 2361–2369. Cited by: Methodology, Methodology.
  • [29] J. Yin, X. Jiang, Z. Lu, L. Shang, H. Li, and X. Li (2016) Neural generative question answering. In Proc. IJCAI’16, pp. 2972–2978. Cited by: Methodology.
  • [30] A. W. Yu, D. Dohan, M. Luong, R. Zhao, K. Chen, M. Norouzi, and Q. V. Le (2018) QANet: combining local convolution with global self-attention for reading comprehension. In Proc. ICLR’18, Cited by: Related work.
  • [31] L. Yu, K. M. Hermann, P. Blunsom, and S. Pulman (2014) Deep learning for answer sentence selection. CoRR abs/1412.1632. Cited by: Introduction.