Can AI Generate Love Advice?: Toward Neural Answer Generation for Non-Factoid Questions

12/06/2019
by   Makoto Nakatsuji, et al.
0

Deep learning methods that extract answers for non-factoid questions from QA sites are seen as critical since they can assist users in reaching their next decisions through conversations with AI systems. The current methods, however, have the following two problems: (1) They can not understand the ambiguous use of words in the questions as word usage can strongly depend on the context. As a result, the accuracies of their answer selections are not good enough. (2) The current methods can only select from among the answers held by QA sites and can not generate new ones. Thus, they can not answer the questions that are somewhat different with those stored in QA sites. Our solution, Neural Answer Construction Model, tackles these problems as it: (1) Incorporates the biases of semantics behind questions into word embeddings while also computing them regardless of the semantics. As a result, it can extract answers that suit the contexts of words used in the question as well as following the common usage of words across semantics. This improves the accuracy of answer selection. (2) Uses biLSTM to compute the embeddings of questions as well as those of the sentences often used to form answers. It then simultaneously learns the optimum combination of those sentences as well as the closeness between the question and those sentences. As a result, our model can construct an answer that corresponds to the situation that underlies the question; it fills the gap between answer selection and generation and is the first model to move beyond the current simple answer selection model for non-factoid QAs. Evaluations using datasets created for love advice stored in the Japanese QA site, Oshiete goo, indicate that our model achieves 20 than the strong baselines. Our model is practical and has already been applied to the love advice service in Oshiete goo.

READ FULL TEXT VIEW PDF

Authors

10/08/2019

Generating Highly Relevant Questions

The neural seq2seq based question generation (QG) is prone to generating...
04/29/2022

Answer Consolidation: Formulation and Benchmarking

Current question answering (QA) systems primarily consider the single-an...
11/12/2015

LSTM-based Deep Learning Models for Non-factoid Answer Selection

In this paper, we apply a general deep learning (DL) framework for the a...
11/25/2019

Conclusion-Supplement Answer Generation for Non-Factoid Questions

This paper tackles the goal of conclusion-supplement answer generation f...
04/12/2022

ASQA: Factoid Questions Meet Long-Form Answers

An abundance of datasets and availability of reliable evaluation metrics...
11/15/2017

Good and safe uses of AI Oracles

An Oracle is a design for potentially high power artificial intelligence...
08/13/2019

Generative Question Refinement with Deep Reinforcement Learning in Retrieval-based QA System

In real-world question-answering (QA) systems, ill-formed questions, suc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, dialog-based natural language understanding systems such as Apple’s Siri, IBM’s Watson, Amazon’s Echo, and Wolfram Alpha have spread through the market. In those systems, Question Answering (QA) modules are particularly important since people want to know many things in their daily lives. Technically, there are two types of questions in QA systems: factoid questions and non-factoid ones. The former are asking, for instance, for the name of a person or a location such that “What/Who is ?”. The latter are more diverse questions which cannot be answered by a short fact. They range from advice on making long distance relationships work well, to requests for opinions on some public issues. Significant progress has been made at answering factoid questions (Wang et al. (2007); Yu et al. (2014)), however, retrieving answers for non-factoid questions from the Web remains a critical challenge in improving QA modules. The QA community sites such as Yahoo! Answers and Quora can be sources of training data for the non-factoid questions where the goal is to automatically select the best of the stored candidate answers.

Recent deep learning methods have been applied to this non-factoid answer selection task using datasets stored in the QA sites resulting in state-of-the-art performance (Yu et al. (2014); Tan et al. (2015); Qiu and Huang (2015); Feng et al. (2015); Wang and Nyberg (2015); Tan et al. (2016)). They usually compute closeness between questions and answers by the individual embeddings obtained using a convolutional model. For example, Tan et al. (2016)

builds the embeddings of questions and those of answers based on bidirectional long short-term memory (biLSTM) models, and measures their closeness by cosine similarity. It also utilizes an efficient attention mechanism to generate the answer representation following the question context. Their results show that their model can achieve much more accurate results than the strong baseline (

Feng et al. (2015)). The current methods, however, have the following two problems when applying them to real applications:

(1) They can not understand the ambiguous use of words written in the questions as words are used in quite different ways following the context in which they appear (e.g. the word “relationship” used in a question submitted to “Love advice” category is quite different from the same word submitted to “Business advice” category). This makes words important for a specific context likely to be disregarded in the following answer selection process. As a result, the answer selection accuracies become weak for real applications.

(2) They can only select from among the answers stored in the QA systems and can not generate new ones. Thus, they can not answer the questions that are somewhat different from those stored in the QA systems even though it is important to cope with such differences when answering non-factoid questions (e.g. questions in the “Love advice” category are often different due to the situation and user even though they share the same topics.). Furthermore, the answers selected from QA datasets often contain a large amount of unrelated information. Some other studies have tried to create short answers to the short questions often seen in chat systems (Vinyals and Le (2015); Serban et al. (2015)). Our target, non-factoid questions in QA systems, are, however, much longer and more complicated than those in chat systems. As described in their papers, the above methods, unfortunately, create unsatisfying answers to such non-factoid questions.

Figure 1: Main ideas: (a) word embeddings with semantics and (b) a neural answer construction.

To solve the above problems, this paper proposes a neural answer construction model; it fills the gap between answer selection and generation and is the first model to move beyond the current simple answer selection model for non-factoid QAs. It extends the above mentioned biLSTM model since it is language independent and free from feature engineering, linguistic tools, or external resources. Our model takes the following two ideas:

(1) Before learning answer creation, it incorporates semantic biases behind questions (e.g. titles or categories assigned to questions) into word vectors while computing vectors by using QA documents stored across semantics. This process emphasizes the words that are important for a certain context. As a result, it can select the answers that suit the contexts of words used in the questions as well as the common usage of words seen across semantics. This improves the accuracies of answer selections. For example, in Fig.

1-(a), there are two questions in category “Family” and “Love advice”. Words marked with rectangles are category specific (i.e. “son” and “homework” are specifically observed in “Family” while “distance”, “relationship”, and “lovers” are found in “Love advice”.) Our method can emphasize those words. As a result, answers that include the topics, “son” and “homework”, or topics, “distance”, “relationship”, and “lovers”, will be scored highly for the above questions in the following answer selection task.

(2) The QA module designer first defines the abstract scenario of answer to be created; types of sentences that should compose the answer and their occurrence order in the answer (e.g. typical answers in “Love advice” are composed in the order of the sentence types “sympathy”, “conclusion”, “supplementary for conclusion”, and “encouragement”). The sentence candidates can be extracted from the whole answers by applying sentence extraction methods or sentence type classifiers (

Schmidt et al. (2014); Zhang et al. (2008); Nishikawa et al. (2010); Chen et al. (2010)). It next simultaneously learns the closeness between questions and sentences that may include answers as well as combinational optimization of those sentences. Our method also uses an attention mechanism to generate sentence representations according to the prior sentence; this extracts important topics in the sentence and tracks those topics in subsequent sentences. As a result, it can construct answers that have natural sentence flow whose topics correspond to the questions. Fig. 1

-(b) explains the proposed neural-network by using examples. Here, the QA module designer first defines the abstract scenario for the answer as in the order of “conclusion” and “supplement”. Thus, there are three types of inputs “question”, “conclusion”, and “supplement”. It next runs biLSTMs over those inputs separately; it learns the order of word vectors such that “relationships” often appears next to “distance”. It then computes the embedding for the question, that for conclusion, and that for supplement by max-pooling over the hidden vectors output by biLSTMs. Finally, it computes the closeness between question and conclusion, that between question and supplement, and combinational optimization between conclusion and supplement with the attention mechanism, simultaneously (dotted lines in Fig.

1-(b) represent attention from conclusion to supplement).

We evaluated our method using datasets stored in the Japanese QA site Oshiete goo111http://oshiete.goo.ne.jp. In particular, our evaluations focus on questions stored in the “Love advice” category since they are representative non-factoid questions: the questions are often complicated and most questions are very long. The results show that our method outperforms the previous methods including the method by (Tan et al. (2016)); our method accurately constructs answers by naturally combining key sentences that are highly close to the question.

2 Related work

Previous works on answer selection normally require feature engineering, linguistic tools, or external resources. Recent deep learning methods are attractive since they demonstrate superior performance compared to traditional machine learning methods without the above mentioned tiresome procedures. For example, (

Wang and Nyberg (2015); Hu et al. (2014)) construct a joint feature vector on both question and answer and then convert the task into a classification or ranking problem. (Feng et al. (2015); Yu et al. (2014); dos Santos et al. (2015); Qiu and Huang (2015)) learn the question and answer representations and then match them by certain similarity metrics. Recently, Tan et al. (2016) took the latter approach and achieved more accurate results than the current strong baselines (Feng et al. (2015); Bendersky et al. (2011)

). They, however, can only select answers and not generate them. Other than the above, recent neural text generation methods (

Serban et al. (2015); Vinyals and Le (2015)) can also intrinsically be used for answer generation. Their evaluations showed that they could generate very short answer for factoid questions, but not the longer and more complicated answers demanded by non-factoid questions. Our Neural Answer Construction Model fills the gap between answer selection and generation for non-factoid QAs. It simultaneously learns the closeness between questions and sentences that may include answers as well as combinational optimization of those sentences. Since the sentences themselves in the answer are short, they can be generated by neural conversation models like (Vinyals and Le (2015));

As for word embeddings with semantics, some previous methods use the semantics behind words by using semantic lexicons such as WordNet and Freebase (

Xu et al. (2014); Bollegala et al. (2016); Faruqui et al. (2015); Johansson and Nieto Piña (2015)). They, however, do not use the semantics behind the question/answer documents; e.g. document categories. Thus, they can not well catch the contexts in which the words appear in the QA documents. They also require external semantic resources other than QA datasets.

3 Preliminary

Here, we explain QA-LSTM (Tan et al. (2015)), the basic discriminative framework for answer selection based on LSTM, since we base our ideas on its framework.

We first explain the LSTM and introduce the terminologies used in this paper. Given input sequence , where is -th word vector, -th hidden vector is updated as:

There are three gates (input , forget , and output ), and a cell memory vector .

is the sigmoid function.

, , and are the network parameters to be learned. Single-direction LSTMs are weak in that they fail to make use of the contextual information from the future tokens. BiLSTMs use both the previous and future context by processing the sequence in two directions, and generate two sequences of output vectors. The output for each token is the concatenation of the two vectors from both directions, i.e. .

In the QA-LSTM framework, given input pair where is a question and is a candidate answer, it first retrieves the word embeddings (WEs) of both and . Next, it separately applies a biLSTM over the two sequences of WEs. Then, it generates fixed-sized distributed vector representations for (or for ) by computing max pooling over all the output vectors and then concatenating the resulting vectors on both directions of the biLSTM. Finally, it uses cosine similarity to score the input pair.

It then defines the training objective as the hinge loss of:

where is an output vector for ground truth answer, is that for an incorrect answer randomly chosen from the entire answer space, and

is a margin. It treats any question with more than one ground truth as multiple training examples. Finally, batch normalization is performed on the representations before computing cosine similarity (

Ioffe and Szegedy (2015)).

4 Method

We first explain our word embeddings with semantics.

4.1 Word embeddings with document semantics

This process is inspired by paragraph2vec (Le and Mikolov (2014)); an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents.

First, we explain paragraph2vec model. It averages the paragraph vector with several word vectors from a paragraph and predicts the following word in the given context. It trains both word vectors and paragraph vectors by stochastic gradient descent and backpropagation (

Rumelhart et al. (1988)). While paragraph vectors are unique among paragraphs, the word vectors are shared.

Next, we introduce our method that incorporates the semantics behind QA documents into word embeddings (WEs) in the training phase. The idea is simple. Please see Fig. 2. It averages the vector of category token and the vectors of title tokens, which are assigned to the QA documents, with several of the word vectors present in those documents. It then predicts the following word in the given context. Here, title tokens are defined by nouns that are extracted from titles assigned to the question. Multiple title tokens can be extracted from a title while one category token is assigned to a question. Those tokens are shared among datasets in the same category. It trains the category vector and title vectors as well as word vectors in QA documents as per paragraph2vec model. Those additional vectors are used as semantic biases for learning WEs. They are useful in emphasizing the words following the contexts of particular categories or titles. This improves the accuracies of answer selection described later as explained in Introduction.

Figure 2: Learning word vectors biased with semantics.

For example, in Fig. 2, it can incorporate semantic biases from category “Love advice” into the words (e.g. “Will”, “distance”, “relationship”, “ruin”, “love” and so on) in the question in “Love advice”. Thus, it can well apply the biases from category “Love advice” to the words (e.g. “distance” and “relationship”) if they specifically appear in “Love advice”. On the other hand, words that appear in several categories (e.g. “will”) are biased with several categories and thus will not be emphasized.

4.2 Neural Answer Construction Model

Here, we explain our model. We first explain our approach and then the algorithm.

Approach

It takes the following three approaches:

  • Design the abstract scenario for the answer: The answer is constructed according to the order of the sentence types defined by the designer. For example, there are the sentence types such as sentence that states sympathy with the question, sentence that states a conclusion to the question, sentence that supplements the conclusion, and sentence that states encouragement to the questioner. This is inspired by the automated web service composition framework (Rao and Su (2005)) where the requester should build an abstract process before the web service composition planning starts. In our setting, the process is the scenario of answer and the service is the sentence in the scenario. Thus, our method can construct an answer by binding concrete sentences to fit the scenario.

    For example, the scenario for love advice can be designed as follows: it begins with a sympathy sentence (e.g. “You are struggling too.”), next it states a conclusion sentence (e.g. “I think you should make a declaration of love to her as soon as possible.”), then it supplements the conclusion by a supplemental sentence (e.g. “If you are too late, she maybe fall in love with someone else.”), and finally it ends with an encouragement sentence (e.g. “Good Luck!”).

  • Joint neural network to learn sentence selection and combination: Our model computes the combination optimization among sentences that may include the answer as well as the closeness between question and sentences within a single neural network. This improves answer sentence selection; our model can avoid the cases in which the combination of sentences are not good enough though the scores of closeness between the question and each sentence are high. It also can let the parameter tuning simpler than the model that separates the network for sentence selection and that for sentence combination. The image of this neural-network is depicted in Fig. 1-(b). Here, it learns the closeness between sentence “Will distance relationship ruin love?” and “Distance cannot ruin true love”, the closeness between “Will distance relationship ruin love?” and “Distance certainly tests your love.”, and the combination between “Distance cannot ruin true love’ and “Distance certainly tests your love.”.

  • Attention mechanism to improve the combination of sentences : Our method extracts important topics in the conclusion sentence and emphasizes those topics in the supplemental sentence in the training phase; this is inspired by (Tan et al. (2016)) who utilizes an attention mechanism to generate the answer representation following the question context. As a result, it can combine conclusions with the supplements following the contexts written in the conclusion sentences. This makes the story in the created answers very natural. In Fig. 1-(b), our attention mechanism extracts important topics (e.g. topic that represents “distance”) in the conclusion sentence “Distance cannot ruin true love” and emphasizes those topics in computing the representation of the supplement sentence “Distance certainly tests your love.”.

Procedure

0:  Pairs of question, conclusion, and supplement, {(, , and )}.
0:  Parameters set by the algorithm.
1:  for , , while  do
2:     for each pair  do
3:         Computes and by biLSTMs and max pooling.
4:         Computes by biLSTM and max pooling.
5:         for each -th hidden vector for supplement do
6:            Computes by Eq. (1).
7:         end for
8:         Computes by max pooling.
9:         Computes by Eq. (2).
10:     end for
11:  end for
Algorithm 1 A neural answer construction model

The core part of the answer is usually the conclusion sentence and its supplemental sentence. Thus, for simplicity, we here explain the procedure of our model in selecting and combining the above two types of sentences. As the reader can imagine, it can easily be applied to four sentence types. Actually, our love advice service by AI in oshiete-goo was implemented for four types of sentences, sympathy, conclusion, supplement, and encouragement (see Evaluation section). The model is illustrated in Fig. 1-(b) in which the input pair is where is the question, is a candidate conclusion sentence, and is a candidate supplemental sentence. The word embeddings (WEs) for words in , , and are extracted in the way described in the previous subsection. The procedure of our model is as follows (please see the Algorithm 1 also.):

(1) It iterates the following procedures (2) to (7) times (line 1 in the algorithm).

(2) It picks up each pair (, , and ) in the dataset (line 2 in the algorithm).

In the following steps (3) and (4), the same biLSTM is applied to both and to compute the closeness between and . Similarly, the same biLSTM is applied to both and . However, the biLSTM for computing closeness between and differs from that between and since and have different characteristics.

(3) It separately applies a biLSTM over the two sequences of WEs, and , and computes the max pooling over the -th hidden vector for question and that for conclusion . As a result, it acquires the question embedding, and the conclusion embedding, (line 3 in the algorithm).

(4) It also separately applies a biLSTM over the two sequences of WEs, and , and computes the max pooling over the -th hidden vector for question to acquire the question embedding, (line 4 in the algorithm). is different from since our method does not share the sub-network used for computing closeness between and and that between and as described above.

(5) It applies the attention mechanism from conclusion to supplement. Specifically, given the output vector of biLSTM on the supplemental side at time step , , and the conclusion embedding, , the updated vector for each conclusion token is formulated as below (line 6 in the algorithm):

(1)

, , and are attention parameters. Conceptually, the attention mechanism gives more weights on words that include important topics in the conclusion sentence.

(6) It computes the max pooling over and acquires the supplemental embedding, (line 8 in the algorithm).

(7) It computes the closeness between question and conclusion and that between question and supplement as well as the optimization combination between conclusion and supplement. The training objective is given as (line 9 in the algorithm):

where is the concatenation of two vectors, and , is , is an output vector for a ground truth answer, and is that for an incorrect answer randomly chosen from the entire answer space. In the above equation, the first (or second) term presents the loss that occurs when both question-conclusion pair (q-c) and question-supplemental pair (q-s) are correct while q-c (or q-s) is correct but q-s (or q-c) is incorrect. The third term computes the loss that occurs when both q-c and q-s are correct while both q-c and q-s are incorrect. The fourth (or fifth) term computes the loss that occurs when q-c (or q-s) is correct but q-s (or q-c) is incorrect while both q-c and q-s are incorrect. is constant margin and is a parameter controlling the margin. Thus, the resulting margin for the third term is larger than those for other terms. In this way, by considering the case when either conclutions or supplements are incorrect or not, this equation optimizes the combinations among conclusion and supplement. In addition, it can take the closeness between question and conclusion (or supplement) in consideration by cosine similarity.

The parameter sets for question-conclusion matching, for question-supplement matching, and for conclusion-supplement attention are trained during the iterations. After the model is trained, our method uses to score the input (, , ) pair and constructs an answer that has a conclusion and its supplement.

QA-LSTM Attentive-LSTM Semantic-LSTM Construction Our method
0.8472 0.8196 0.8499 0.8816 0.8846
0.8649 0.844566 0.8734 0.8884 0.8909
0.8653 0.8418 0.8712 0.8827 0.8845
0.8603 0.8358 0.8658 0.8618 0.8647
Table 1: Comparison of AP for answer selection.
QA-LSTM Attentive-LSTM Semantic-LSTM Construction Our method
0.3262 0.3235 0.3664 0.3813 0.3901
0.3753 0.3694 0.4078 0.5278 0.5308
0.3813 0.3758 0.4133 0.5196 0.5271
0.3827 0.3777 0.4151 0.4838 0.4763
Table 2: Comparison of AP for answer construction.
QA-LSTM (1) (2) (3) (4) Our method (1) (2) (3) (4)
57 66 41 36 116 51 13 20
Table 3: Comparison of human evaluation results.
Questions Answers created by QA-LSTM Answers created by Our method

I’m afraid to confess my love to her, what should I do?
You should wait until you feel excited. If you interact with her indifferently, it will be difficult to develop any relation with her. It is better to concentrate on how to confess your love to her. I understand you are struggling since you love her very much.
A guy I like says to me “I like you at home” kiddingly. It may be the ordinary gentleness. Some hope? You don’t have to test his love immediately. Unless he likes you, he would not have gone to see a movie with you. Yes, there is some hope. You can understand his reaction more easily if your understanding of each other is deeper.
I seldom meet an interesting person. I worry about how to become close to him. Should I approach to him positively? Try to select your words correctly. Unless you confess your love to him, it is difficult to convey your emotion to him. You should confess your love to him. Unless you confess your love to him, it is difficult to convey your emotion to him.
Table 4: Examples of answers created by QA-LSTM and those by Our method.

5 Evaluation

We used our method to select or construct answers to the questions stored in “Love advice” category.

5.1 Dataset

We evaluated our method using a dataset stored in Japanese online QA service Oshiete-goo. First, the word embeddings were built by using 189,511 questions and their 771,956 answers stored in 16 categories including “Love Advice”, “Traveling”, and “Health Care”. 6,250 title tokens were extracted from the titles. Then, we evaluated answer selection and construction tasks by using a corpus containing about 5,000 question-conclusion-supplement sentences. Conclusions and supplement sentences were extracted by human experts from answers. The readers could use sentence extraction methods (Schmidt et al. (2014); Zhang et al. (2008); Nishikawa et al. (2010); Chen et al. (2010)) or neural conversation models like (Vinyals and Le (2015)) to semi-automatically extract/generate those sentences.

5.2 Compared methods

We compared the accuracy of the following five methods:

  • QA-LSTM proposed by (Tan et al. (2015)).

  • Attentive LSTM: introduces an attention mechanism from question to answer and is evaluated as the current best answer selection method Tan et al. (2016).

  • Semantic LSTM: performs answer selection by using our word embeddings biased with semantics.

  • Construction: performs our proposed answer construction without attention mechanism.

  • Our method: performs our answer construction with attention mechanism from conclusion to supplement.

5.3 Methodology and parameter setup

We randomly divided the dataset into two halves, training dataset and predicted one, and conducted two-fold cross validation. Results shown later are the average values.

Both for answer selection and construction, we used Average Precision (AP) against the top-K ranked answers in the results because we consider that the most highly ranked answers are important for users. If the number of ranked items is , the number of correct answers among the top-j ranked items , and the number of all correct answers (paired with the questions) , AP is defined as follows:

For answer construction, we checked whether each method could recreate the original answers. As the reader easily can understand, this is a much more difficult task than answer selection and thus the values of AP will be smaller than the results for answer selection.

We tried word vectors and qa vectors of different sizes, and finally set the word vector size to and the LSTM output vectors for biLSTMs to

. We also tried different margins in the hinge loss function, and fixed the margin,

, to and to . The iteration count was set to . For our method, the embeddings for questions, those for conclusions, and those for supplements were pretrained by Semantic LSTM before answer construction since this enhances the overall accuracy.

We did not use attention mechanism from question to answer for Semantic LSTM, Construction and Our method. This is because, as we present in the results subsection, the lengths of questions are much longer than those of answer sentences, and thus the attention mechanism from question to answer became noise for sentence selection.

5.4 Results

We now present the results of the evaluations.

Answer Selection

We first compare the accuracy of methods for answer selection. The results are shown in Table 1. QA-LSTM and Attentive LSTM are worse than Semantic-LSTM. This indicates that Semantic-LSTM can incorporate semantic information (titles/categories) into word embeddings; it can emphasize words according to the context they appeared and thus the matching accuracy between question vector and conclusion (supplement) vector was improved. Attentive LSTM is worse than QA-LSTM as described above. Construction and Our method are better than Semantic-LSTM. This is because they can avoid the combinations of sentences that are not good enough even though the scores of closeness between questions and sentences are high. This implies that, if the combination is not good, the selection of answer sentences also tends to be erroneous. Finally, Our method, which provides sophisticated selection/combination strategies, yielded higher accuracy than the other methods. It achieved 4.4% higher accuracy than QA-LSTM (QA-LSTM marked 0.8472 while Our method marked 0.8846.).

Answer Construction

We then compared the accuracy of the methods for answer construction. Especially for the answer construction task, the top-1 result is most important since many QA applications show only the top-1 answer. The results are shown in Table 2. There is no answer construction mechanism in QA-LSTM, Attentive-LSTM, and Semantic-LSTM. Thus we simply merge the conclusion and supplement, each of which has the highest similarity with the question by each method. QA-LSTM and Attentive LSTM are much worse than Semantic-LSTM. This is because the sentences output by Semantic-LSTM are selected by utilizing the words that are emphasized for a context for “Love advice” (i.e. category and titles). Construction is better than Semantic-LSTM since it simultaneously learns the optimum combination of sentences as well as the closeness between the question and sentences. Finally, Our method is better than Construction. This is because it well employs the attention mechanism to link conclusion and supplement sentences and thus the combinations of the sentences are more natural than those of Construction. Our method achieved 20% higher accuracy than QA-LSTM (QA-LSTM marked 0.3262 while Our method marked 0.3901.).

The computation time for our method was less than two hours. All experiments were performed on NVIDIA TITAN X/Tesla M40 GPUs, and all methods were implemented by Python in the Chainer framework. Thus, our method well suits real applications. In fact, it is already being used in the love advice service of Oshiete goo 222http://oshiete.goo.ne.jp/ai.

Human evaluation

The outputs of QA-LSTM and Our method were judged by two human experts. The experts entered the questions, which were not included in our evaluation datasets, to the AI system and rated the created answers based on the following scale: (1) the conclusion and supplement sentences as well as their combination were good, (2) the sentences were good in isolation but their combination was not good, (3) One of the selections (conclusion or supplement) was good but their combination was not good, and (4) both sentences and their combination were not good. The answers were judged as good if they satisfied the following two points: (A) the contents of answer sentences correspond to the question. (B) the story between conclusion and supplement is natural.

The results are shown in Table 3. Table 4 presents examples of the questions and answers constructed (they were originally Japanese and translated into English for readability. The questions are summarized since the original ones were very long.). The readers can also see Japanese answers from our service URL presented the above. Those results indicate that the experts were much more satisfied with the outputs of Our method than those by QA-LSTM; 58 % of the answers created by Our method were classified as (1). This is because, as can be see in Table 4, Our method can naturally combine the sentences as well as select sentences that match the question. It well coped with the questions that were somewhat different from those stored in the evaluation dataset.

Actually, when the public used our love advice service (Karáth (2017); Nakatsuji (2018, 2016)), it was surprising to find that the 455 answers created by the AI whose name is oshi-el (uses Our method) were judged as Good answers by users from among the 1,492 questions entered from September 6th to November 5th333This service started on September 6th, 2016.. The rate of getting Good answers by oshi-el is twice that of the average human user in oshiete-goo when we focus on users who answered more than 100 questions in love advice category. Thus, we think this is a good result.

6 Conclusion

This is the first study that create answers for non-factoid questions. Our method incorporates the biases of semantics behind questions into word embeddings to improve the accuracy of answer selection. It then simultaneously learns the optimum combination of answer sentences as well as the closeness between questions and sentences. Our evaluation shows that our method achieves 20 % higher accuracy in answer construction than the method based on the current best answer selection method. Our model presents an important direction for future studies on answer generation. Since the sentences themselves in the answer are short, they can be generated by neural conversation models as we have done recently (Nakatsuji and Okui (2020)); this means that our model can be extended to generate complete answers once the abstract scenario is made.

References

  • M. Bendersky, D. Metzler, and W. B. Croft (2011) Parameterized concept weighting in verbose queries. In Proc. SIGIR’11, pp. 605–614. Cited by: §2.
  • D. Bollegala, M. Alsuhaibani, T. Maehara, and K. Kawarabayashi (2016) Joint word representation learning using a corpus and a semantic lexicon. In Proc. AAAI’16, pp. 2690– 2696. Cited by: §2.
  • B. Chen, L. Zhu, D. Kifer, and D. Lee (2010) What is an opinion about? exploring political standpoints using opinion scoring model. In Proc. AAAI’10, pp. 1007–1012. Cited by: §1, §5.1.
  • C. dos Santos, L. Barbosa, D. Bogdanova, and B. Zadrozny (2015) Learning hybrid representations to retrieve semantically equivalent questions. In Proc. ACL-IJCNLP’15, pp. 694–699. Cited by: §2.
  • M. Faruqui, J. Dodge, S. K. Jauhar, C. Dyer, E. Hovy, and N. A. Smith (2015) Retrofitting word vectors to semantic lexicons. In Proc. NAACL HLT’15, pp. 1606–1615. Cited by: §2.
  • M. Feng, B. Xiang, M. R. Glass, L. Wang, and B. Zhou (2015) Applying deep learning to answer selection: A study and an open task. CoRR abs/1508.01585. Cited by: §1, §2.
  • B. Hu, Z. Lu, H. Li, and Q. Chen (2014) Convolutional neural network architectures for matching natural language sentences. In Proc. NIPS’14, pp. 2042–2050. Cited by: §2.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proc. ICML’15, Vol. 37, pp. 448–456. Cited by: §3.
  • R. Johansson and L. Nieto Piña (2015) Embedding a semantic network in a word space. In Proc. NAACL HLT’15, pp. 1428–1433. Cited by: §2.
  • K. Karáth (2017) AI agony aunt gives love advice online. New Scientist 233, pp. 12. External Links: Document Cited by: §5.4.
  • Q. V. Le and T. Mikolov (2014) Distributed representations of sentences and documents. In Proc. ICML’14, pp. 1188–1196. Cited by: §4.1.
  • M. Nakatsuji and S. Okui (2020) Conclusion-supplement answer generation for non-factoid questions. In Proc. AAAI’20, Cited by: §6.
  • M. Nakatsuji (2016) Dear oshieru: ai tips on love, nhk world. https://www3.nhk.or.jp/nhkworld/en/news/editors/7/dearoshieruaitipsonlove/index.html. Cited by: §5.4.
  • M. Nakatsuji (2018) Can ai generate love advice? neural conclusion-supplement answer construction for non-factoid questions. on-demand.gputechconf.com/gtc/2018/video/S8301/. In GPU Technology Conference 2018 – San Jose, CA, Cited by: §5.4.
  • H. Nishikawa, T. Hasegawa, Y. Matsuo, and G. Kikui (2010)

    Opinion summarization with integer linear programming formulation for sentence extraction and ordering

    .
    In Proc. COLING’10, pp. 910–918. Cited by: §1, §5.1.
  • X. Qiu and X. Huang (2015)

    Convolutional neural tensor network architecture for community-based question answering

    .
    In Proc. IJCAI’15, pp. 1305–1311. Cited by: §1, §2.
  • J. Rao and X. Su (2005) A survey of automated web service composition methods. In Proc. SWSWPC’05, pp. 43–54. Cited by: 1st item.
  • D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1988) Neurocomputing: foundations of research. J. A. Anderson and E. Rosenfeld (Eds.), pp. 696–699. Cited by: §4.1.
  • S. Schmidt, S. Schnitzer, and C. Rensing (2014) Domain-independent sentence type classification: examining the scenarios of scientific abstracts and scrum protocols. In Proc. i-KNOW ’14, pp. 5:1–5:8. Cited by: §1, §5.1.
  • I. V. Serban, A. Sordoni, Y. Bengio, A. C. Courville, and J. Pineau (2015) Hierarchical neural network generative models for movie dialogues. CoRR abs/1507.04808. Cited by: §1, §2.
  • M. Tan, C. N. dos Santos, B. Xiang, and B. Zhou (2016) Improved representation learning for question answer matching. In Proc. ACL’16, pp. 464–473. Cited by: §1, §1, §2, 3rd item, 2nd item.
  • M. Tan, B. Xiang, and B. Zhou (2015) LSTM-based deep learning models for non-factoid answer selection. CoRR abs/1511.04108. Cited by: §1, §3, 1st item.
  • O. Vinyals and Q. V. Le (2015) A neural conversational model. CoRR abs/1506.05869. Cited by: §1, §2, §5.1.
  • D. Wang and E. Nyberg (2015) A long short-term memory model for answer sentence selection in question answering. In Proc. ACL-IJCNLP’15, pp. 707–712. Cited by: §1, §2.
  • M. Wang, N. A. Smith, and T. Mitamura (2007) What is the jeopardy model? a quasi-synchronous grammar for qa. In Proc. EMNLP-CoNLL’07, pp. 22–32. Cited by: §1.
  • C. Xu, Y. Bai, J. Bian, B. Gao, G. Wang, X. Liu, and T. Liu (2014) RC-NET: A general framework for incorporating knowledge into word representations. In Proc. CIKM’14, pp. 1219–1228. Cited by: §2.
  • L. Yu, K. M. Hermann, P. Blunsom, and S. Pulman (2014) Deep learning for answer sentence selection. CoRR abs/1412.1632. Cited by: §1, §1, §2.
  • J. Zhang, C. Zong, and S. Li (2008) Sentence type based reordering model for statistical machine translation. In Proceedings of the 22Nd International Conference on Computational Linguistics - Volume 1, pp. 1089–1096. Cited by: §1, §5.1.