Most search queries are motivated by some underlying question (Kotov and Zhai, 2010). Today’s users are accustomed to expressing the questions they have in mind using keyword queries (Zhao et al., 2011). Keyword queries, however, can be notoriously ambiguous and may be interpreted in multiple ways. For example, given the keyword query “10th president India,” the question perhaps most users would want to ask is “Who was the 10th President of India?”. Nevertheless, some users may be interested in a particular aspect of the query topic, like “In which year did the 10th President of India leave office?” or “What do people say about the 10th President of India?”. By determining the underlying question, we can obtain a more accurate representation of the user’s information need. This, in turn, can lead to improved retrieval performance and a better overall search experience. We envisage a search interface that allows users to refine their queries with automatically generated natural language questions; see Fig. 1. We note that similar functionality is already offered, for certain queries, in major Web search engines (see Fig. 9). Those services, however, are limited to suggesting existing questions to which answers are known to exist. Importantly, we are not aiming to retrieve existing questions from community question-answering archives (Xue et al., 2008; Gao et al., 2013). Our goal is to automatically generate a natural language question that most likely represents the user’s underlying information need. This is seen as a feedback mechanism that can more naturally engage users into explicitly clarifying their information needs. How those natural language questions are actually utilized in a retrieval system (e.g., via query expansion (Kotov and Zhai, 2010)) is beyond the scope of this study.
In this paper, we address the keyword-to-question (K2Q) task: generating a natural language question from a keyword query. K2Q has generated considerable attention recently, see, e.g., (Kotov and Zhai, 2010; Dror et al., 2013; Zhao et al., 2011; Zheng et al., 2011). Most existing works employ a template-based approach, where common question patterns are extracted from existing keyword-question pairs. These template-based methods are inherently limited in their ability to generalize to previously unseen queries. Instead, we propose to address the K2Q task using state-of-the-art neural machine translation (sequence-to-sequence) approaches. One challenge we face is that training such neural models requires massive amounts of training data (i.e., hand-labeled keyword-question pairs). While such training data could be mined from query and click logs, there are two main issues. First, such click data is not always available (e.g., in a cold start scenario). Second, it is limited to keyword-question pairs that have received sufficiently many clicks; long-tail queries or newly posted questions will not have that. The above considerations give rise to the main research objective of the present work: How can we generate synthetic data for training a neural machine translation approach for the K2Q task?
We present a novel approach for generating synthetic training data from a seed set of hand-labeled keyword-question pairs, and subsequently use this data for learning neural machine translation models to solve the K2Q task (Sect. 2).
We introduce several generative models for producing synthetic keyword queries from natural language questions (Sect. 3.1).
We develop two filtering mechanisms, which are essential for ensuring that the synthetic training data we feed into the neural network is of high-quality (Sect. 3.2).
We evaluate our synthetic data generation approach on the end-to-end K2Q task using both automatic and manual evaluation (Sect. 7).
The overall goal in this paper is to tackle the keyword-to-question (K2Q) problem using neural networks. I.e., the task is to translate a keyword query (referred to as keyword for short) to a natural language question (question for short). To be able to use neural networks for this task, massive amounts of training data are needed. The main idea of our paper is to use a small seed set of hand-labeled training data to generate large amounts of synthetic training data. Specifically, the seed training data, , consists of keyword-question pairs, . This, along with a large question corpus, , is utilized to generate synthetic training data, , which also consists of keyword-question pairs, . The neural machine translation models will then be trained using . The overview of our framework is shown in Fig. 2. It entails three main steps, which we shall detail below.
First, we train a keyword query generation model (KQGM), , using keyword-question pairs from the seed training data. We aim to simulate real users’ querying behavior: given a natural language question, generate a keyword query that a user would likely issue when seeking an answer to that question. We explore various generative models; these have only a few free parameters, which can be easily learned from the seed training data .
Second, we utilize a large question corpus , collected from community question answering forums, and employ the keyword query generation model to generate (a large set of) simulated keyword-question pairs. These will constitute our synthetic training data . However, since not all the automatically generated keyword-question pairs are of high quality, we employ a keyword query filter (KQF) and a training data filter (TDF). These filters are pivotal elements in our approach; we shall detail them in Sect. 3.2.
Finally, we train a neural machine translation (NMT) model for the K2Q task by feeding it with the synthetic training data . We consider three neural networks: basic encoder-decoder NMT (Sutskever et al., 2014), NMT with attention mechanism (Bahdanau et al., 2014), and NMT with copying mechanism (Gu et al., 2016). We shall detail these networks in Sect. 4.
3. Synthetic Data Generation
This section details the our synthetic training data generation method, which is the most important contribution of this paper. The process takes as input (i) a small seed training data set, consisting of hand-labeled keyword-query pairs, and (ii) a large set of natural language questions. The output is a large set of automatically generated keyword-question pairs, with high enough quality to train robust neural models. Our approach consists of two main components: a keyword query generation model (Sect. 3.1) and filtering mechanisms (Sect. 3.2).
3.1. Keyword Query Generation Model
Prior work has seen successful attempts at generating synthetic queries for web and microblog known-item search, both for evaluation and for training purposes (Azzopardi and de Rijke, 2006; Azzopardi et al., 2007; Berendsen et al., 2013). The overall idea is to construct a generative model that can produce a query, similar to a real query that a user would issue, for finding a particular item. We take the algorithm proposed by Azzopardi et al. (2007) as our starting point (§3.1.1) and extend it at several points to fit our problem setting: (i) we impose a number of restrictions as well as introduce new elements to the generative process (§3.1.2), (ii) we propose a paraphrase-based variation that considers multiple ways of formulating the same question (§3.1.3), and (iii) we add phrase support, so as not to break up meaningful word sequences (§3.1.4).
In known-item search it is assumed that the user wants to find a particular item (document, question, tweet, etc.) that she has seen before in the corpus. Therefore, the user constructs a keyword query by recalling terms that would help her identify this item. In automatic query construction this user behavior is simulated using generative models.
Formally, let us assume that the user seeks to find (recall) the natural language question . The query length
is selected with probability. Then, a keyword query is constructed by sampling terms from , which is the model of
. The prior probability distribution
can be easily estimated by considering query lengths in a representative sample (e.g., a query log). The quality of the synthetic queries crucially depends on the distribution, as it determines which terms will be sampled. Azzopardi et al. (2007) define using the standard language modeling approach:
Accordingly, term generation is a mixture between sampling a term from the given item with probability , and from the corpus with probability , where the influence of the collection model is controlled by the smoothing parameter . The latter likelihood is calculated using:
where denotes the collection term frequency of term , and is the vocabulary of terms in the corpus.
To simulate different types of user querying behavior, three plausible term selection strategies have been proposed to estimate : (i) popular selection, (ii) discriminative selection, and (iii) their combination (Azzopardi and de Rijke, 2006; Azzopardi et al., 2007).
(i) Popular: Assuming that more frequent terms are more likely to be used as query terms, is calculated by Eq. (3), where is the number of occurrences of in .
(ii) Discriminative: Assuming that the user may select query terms that can better discriminate the item she is looking for from other items in the corpus, is calculated using Eq. (4), where is a binary indicator function that is if occurs in and otherwise. is the same as before, cf. Eq. (2).
(iii) Combination: Combining the popular and discriminative strategies into a single model, is calculated by Eq. (5), where is the document (here: question) frequency of term and is the total number of items in the corpus.
3.1.2. Our Keyword Generation Algorithm
Note that the original algorithm in (Azzopardi et al., 2007) has been developed for known-item (document) search. We need to modify and extend it at several points to be able to use it for the K2Q task we are addressing.
For known-item search, an item is selected randomly from the corpus, and then a keyword query is generated from that item. The process is repeated as many times as the number of queries to be created. In our problem scenario the items are natural language questions, where each of them needs to be paired with a keyword query. That is, we do not sample items, but we generate a query for each item in the corpus. This is the first modification we make to the algorithm (line 3 in Algorithm 1).
The second change concerns the length of keyword queries. In (Azzopardi et al., 2007)
, the length of the query is drawn from a Poisson distribution, with the mean set according to the average length in a set of human-generated queries. For us, the length of the keyword query also depends on the corresponding natural language question. Given a question with length, it is reasonable to assume that users will always prefer to issue a keyword query that is shorter than . Thus, we include this additional constraint and sample a query length with , where (line 5 in Algorithm 1).
Third, keyword queries typically do not contain question words, such as “how,” “what,” “where,” “who,” “why,” “when,” etc. Thus, we do not sample question words in our generation process.
Fourth, our algorithm does not only sample terms but also samples phrases for generating synthetic queries. Thus, we avoid breaking up word sequences that function together as a meaningful unit. It means that could be either a term or a phrase in the generative process (line 7 in Algorithm 1). We describe our phrase detection mechanism in §3.1.4.
Fifth, according to our statistics on a sample of queries,111The Yahoo! L16 Webscope Dataset, which contains many real keyword queries from users of Yahoo Answers. only 3.9% of all keyword queries include the same term more than once, suggesting that queries with repeated terms are atypical. Thus, we find it reasonable to avoid sampling the same term more than once in our keyword query generation process (line 9 in Algorithm 1).
3.1.3. Paraphrase-Based Querying Model
Users may use different words to express the same meaning. This should be taken into consideration in the keyword query generation process. Imagine the following case, where a particular user has seen the question “Who is the author of the pooh?” in a community question answering forum (e.g., Yahoo! Answers or Quora), then, after several days, she tries to recall the search terms to find an answer to this question. If she still remembers the exact words from the question, she may issue “the pooh author” as a query. Otherwise, she may recollect a paraphrase of the question, like “Who is winnie the pooh’s creator?” and, based on that, formulates the keyword query “winnie the pooh creator.” Furthermore, different users may recall different paraphrases during their querying process. Thus, it is natural to sample terms from paraphrases of the same question when generating keyword queries. We realize this idea by defining the term generation model as a three component mixture:
where is a set of paraphrases of question and defines the likelihood of selecting term from the paraphrases. All paraphrases in are concatenated together into a single large document, then may be calculated by one of three strategies we described in the previous section. The model in Eq. (6) has two parameters, . As tends to one, it assumes that the user definitely remembers the terms of the original question. As tends to one, it assumes that user does not recall the terms from the original question but knows how to paraphrase it. As both and tend to zero, it means that user knows that the question exists but does not remember any terms from the original question nor from any of its paraphrases.
3.1.4. Phrase Detection
We sample not only terms but also phrases, in order to avoid breaking up continuous word sequences that constitute meaningful units. Specifically, we follow the method proposed by Mikolov et al. (2013) for detecting phrases. Words that belong to the same phrase are grouped together into a new term. For example, the question “how fast is a 2004 honda crf 230” is converted to “how fast is a 2004 honda_crf_230” after phrase detection. This way, KQGM is able to directly sample honda_crf_230, instead of sampling three independent terms.
3.2. Filtering Mechanisms
To ensure that high-quality synthetic data is generated for training neural translation models, we propose two filtering mechanisms. One operates on the level of individual questions and selects the best keyword query, from a pool of candidate queries generated for a given question (§3.2.1). The other filter is applied over the entire set of synthetic query-question pairs and filters out low-quality instances (§3.2.2).
3.2.1. Keyword Query Filter
Given the probabilistic nature of query length selection (line 5 in Algorithm 1) and term selection (line 7 in Algorithm 1), the keyword query generation model may produce very different keyword queries for the same question. These keywords may vary a lot in terms of quality, from appropriate to inadequate. For example, given the question “what happens inside a refracting telescope,” the query generation model can give rise to a good keyword query, “happens inside refracting telescope,” or to a rather bad one, “inside colors type,” using the very same parameters.
The idea is to remedy this behavior by generating, for each question, a set of candidate keyword queries (i.e., running the model multiple times), and then selecting the single most suitable query. We propose to achieve this using a so called keyword query filter (KQF), shown in Fig. 3. The intuition behind this ranking-based filtering approach is that the better the generated keyword query is, the more effectively it can retrieve the original question from the question corpus. (It is worth pointing out that our algorithm will always generate a keyword query that is shorter than the corresponding question, i.e., it is never the same as the question.)
We start with generating a set of candidate keywords for a given question using KQGM. Then, we issue each candidate keyword query against an index containing all questions in our corpus, and retrieve the top- highest scoring questions, . Specifically, we employ the Sequential Dependence Model (SDM) retrieval method (Metzler and Croft, 2005). Finally, we select the best candidate keyword for the input question according to its reciprocal rank:
where is the rank of in the ranked list .
3.2.2. Training Data Filter
Even after applying the keyword query filter, there may still exist low-quality training instances in , which would misdirect the learning process. Therefore, we propose a training data filter (TDF) to filter out low quality instances. TDF, shown in Fig. 4, takes a set of synthetic query-question pairs as input, and returns a subset that contains the top- pairs with the highest quality score. We use retrieval precision as a quality indicator, which expresses to what extent is a proper keyword for question :
where denotes the set of relevant questions retrieved by the keyword query using the SDM retrieval method (Metzler and Croft, 2005), and denotes the set of paraphrase questions for . In short, TDF ranks all generated query-question pairs according to , then selects the top- highest scoring ones to form the filtered subset .
4. Neural Machine Translation
Neural machine translation (NMT) aims to directly model the conditional probability of translating a source sequence to a target sequence . Thus, it lends itself naturally to implement our K2Q task using NMT, by taking the keyword query as the source sequence and the natural language question as the target sequence . In the rest of this section, we detail three NMT networks we use in our experiments.
4.1. Encoder-Decoder NMT
The classical architecture of NMT is Encoder-Decoder recurrent neural networks (RNNs)(Sutskever et al., 2014), which consists of two components:
, a RNN to compute a context vector representationfor the input sequence, , by iterating the following equations:
where is a one-hot representation of the th word in the input sequence, and is the hidden state vector of encoder RNN at time
. The activation function1997) or Gated Recurrent (GRU) (Chung et al., 2014) unit. The context vector is defined by , which is an operation on all hidden states. In this paper, indicates an operation choosing the last hidden state .
(ii) Decoder, another RNN to decompress the context vector and output the target sequence, , through a conditional language model:
where is a one-hot representation of the th word in the output sequence; denotes the hidden state vector of the decoder RNN at time ; can be the same as encoder activation function, , or a different non-linear activation function;
is a softmax classifier. Given a set of keyword-question pairs, the encoder and decoder are jointly trained to maximize the conditional log-likelihood.
4.2. Attention Mechanism
, and has recently been also successfully applied in natural language processing(Bahdanau et al., 2014; Yin et al., 2016). The basic idea behind it is that humans pay attention to specific parts, rather than the whole input, when performing visual and linguistic tasks. The attentional NMT (Bahdanau et al., 2014) uses a dynamically changing context vector instead of a fixed context vector during the decoding process. The dynamically changing context vector is computed with a weighted sum of the source hidden states according to:
where is an attention function that scores the corresponding attentional strength. Usually, is parameterized with a feedforward neural network. Further, denotes the hidden state of the encoder at time , and denotes the attentional strength that the target word is related to a source word .
4.3. Copying Mechanism
The copying mechanism was first proposed by Gu et al. (2016) for handling out-of-vocabulary words, by selecting appropriate words from the input text. We employ the copying mechanism to assign higher probability to words that appear in the input text. This way we naturally capture the fact that questions tend to keep important words from the keyword query. By incorporating the copying mechanism into NMT, the probability of generating word in the output sequence becomes:
The first part is the probability of generating the term from vocabulary (cf. Eq. (12)). The second component is the probability of copying it from the source sequence:
where denotes all words in the source sequence. is a non-linear function and is a learned parameter matrix. We refer to (Gu et al., 2016) for further details.
Our approach needs a small set of hand-labeled keyword-question pairs and a large set of questions. We obtain these two datasets from WikiAnswers.222http://knowitall.cs.washington.edu/oqa/data/wikianswers/ WikiAnswers includes millions of questions asked by humans. Users have also identified groups of questions that are paraphrases of each other. These groups are considered paraphrase clusters (Fader et al., 2014).
Since we only care about natural language questions in this work, we employ the heuristics proposed byDror et al. (2013) to filter out non-natural language questions. Specifically, we keep only questions that start with “WH words” or auxiliary verbs. Additionally, we restrict ourselves to questions consisting of 5-12 terms (most frequent query length), based on question length distribution statistics of WikiAnswers, see Fig. 5(a). We end up with 3,168,878 paraphrase clusters, with 26.05 questions per cluster on average. In the remainder of the paper, when we write WikiAnswers, we refer to this preprocessed subset of the collection.
Small Set of Keyword-Question Pairs ()
In order to get the small set of hand-labeled keyword-question pairs, we randomly pick 200 clusters from the 3,168,878 paraphrase clusters. From each of those paraphrase clusters, we sample five questions randomly. We employ five human annotators, who each receive only one question from each of the 200 paraphrase clusters. The annotators then manually create keyword queries from their questions. We then have 200 paraphrase clusters, each with five questions’ paraphrases and corresponding keyword queries (where each paraphrase is labeled by a different annotator), a total of 1000 hand-labeled pairs.
Large Set of Questions ()
To get the large set of questions, we randomly sample a single question from each of the remaining paraphrase clusters. This amounts to 3,168,678 questions. The hand-labeled questions do not appear in this set.
6. Experimental Setup
6.1. Keyword Query Generation Model
The following settings are used in our experiments:
Query length: The prior probability of query length is calculated based on the small set of (hand-labeled) keyword-question pairs. According to statistics on user keyword queries from the Yahoo! L16 Webscope Dataset, most keyword queries contain between 3 and 7 terms, see Fig. 5(b). Thus, we only sample queries with length .
Collection Language Model: The collection language model probability of is computed based on the WikiAnswers dataset. For the paraphrase-based model, we need to know the paraphrases for a given question . In our dataset, this is readily available. We note that there also exist methods to detect paraphrases automatically (Bogdanova et al., 2015; Jiang et al., 2017).
6.2. Filtering Mechanisms
For our filters, we use the following settings:
Keyword query filter (§3.2.1): We generate candidate keywords for each question in the large set of questions using KQGM. The best of these is selected by KQF to be paired with .
Training data filter (§3.2.2): For a keyword-question pair we retrieve the top questions using and obtain the paraphrases from paraphrase cluster of .
6.3. Neural Networks
We implement the following three networks:
EDNet: Basic encoder-decoder NMT network.
AttNet: EDNet with attention mechanism.
CopyNet: AttNet plus copying mechanisms.
For all three networks, we choose the top 44K most frequent words in WikiAnswers as our vocabulary. We set the embedding dimension to 100, and initialize the word embeddings randomly with a uniform distribution in [-0.1,0.1]. We set the number of layers of both encoder and decoder RNNs to 1. Further, we use a bidirectional GRU(Bahdanau et al., 2014) unit with size 200 for encoder RNNs, and a GRU unit with size 400 for decoder RNNs. All networks are optimized using Adam (Kingma and Ba, 2014) with an initial learning rate of
, gradient clipping of, and dropout rate of .
6.4. Preliminary Study
Our synthetic data generation heavily depends on the generative model for creating keyword queries. Thus, we perform a preliminary study, using the small set of keyword-question pairs, , to analyze the performance of various KQGM configurations. Informed by this analysis, we can decide which of the three term selection strategies to use for KQGM in our main experiments.
6.4.1. Evaluation Metrics.
We use automatic metrics from text summarization, specifically, the widely used ROUGE-L metric(Lin, 2004). ROUGE-L not only awards credit to in-sequence unigram matches, but also captures word order in a natural way. Thus, it can effectively measure the degree of match between the synthetic and ground truth keyword queries. Recall that in our dataset, we have a set of paraphrases for each question. We wish to consider those paraphrases as well in our evaluation. Formally, let denote the generated keyword query corresponding to question ; denotes the paraphrase cluster of ; is the set of ground truth keywords corresponding to . For scoring , we consider the set of ground truth keywords in two different ways: (i) by computing the average ROUGE-L between and each ground truth keyword (Eq. (17)), and (ii) by considering only the best (highest scoring) ground truth keyword query (Eq. (18)).
We employ five-fold cross-validation for evaluation. To eliminate the effects of randomness that is involved in the process, we repeat 100 times, and report the means and standard deviations.
Table 1 shows the evaluation results for all KQGM configurations. Comparing the three term selection strategies (§3.1.1), we find that the Combination strategy always attains the best performance. With the same term selection strategy and KGQM, phrase detection brings noticeable improvements in both AvgRougeL and MaxRougeL (+5.28% and +4.22%, respectively). Comparing the paraphrase-based model with the baseline model, the former brings +10.66% improvements on average for AvgRougeL and +7.16% on average for MaxRougeL. The paraphrase-based model with phrase detection achieves the best overall performance, with 0.2521 AvgRougeL and 0.3843 MaxRougeL, which is superior to the best baseline configuration.
|Popular||0.1956 (0.0934)||0.3197 (0.1266)|
|Discrimination||0.1877 (0.1049)||0.2999 (0.1421)|
|Combination||0.2240 (0.0953)||0.3522 (0.1331)|
|Baseline model + phrase detection|
|Popular||0.2069 (0.1008)||0.3354 (0.1342)|
|Discrimination||0.2062 (0.1106)||0.3243 (0.1465)|
|Combination||0.2373 (0.1019)||0.3708 (0.1399)|
|Popular||0.2125 (0.0930)||0.3390 (0.1250)|
|Discrimination||0.2266 (0.1017)||0.3458 (0.1367)|
|Combination||0.2435 (0.0956)||0.3734 (0.1330)|
|Paraphrase-based model + phrase detection|
|Popular||0.2182 (0.1001)||0.3476 (0.1322)|
|Discrimination||0.2355 (0.1020)||0.3513 (0.1361)|
|Combination||0.2521 (0.1009)||0.3843 (0.1374)|
We test what influence the free parameters have on the performance of KQGMs. For the baseline model, we find that both AvgRougeL and MaxRougeL decrease as increases, see Figs. 6 and 6. For the paraphrase-based model, we find that both AvgRougeL and MaxRougeL increase with higher and values, see Figs. 6 and 6. This is not unexpected, since users prefer to use terms from the given question for the keyword query.
6.4.4. Observed Errors.
Based on manual inspection of synthetic keyword-question pairs, we find that the most prominent flaws in our synthetic data are extraneous terms in the KQGM-made keywords. For example, given the question “what is usage of erw pipe,” our KQGM generates a keyword query “erw pipe usage made meant,” where “made meant” are unnecessary terms.
6.5. Implemented Systems
6.5.1. Baseline systems.
We implement the SDM retrieval model (Metzler and Croft, 2005) and the state-of-the-art template-based method (TBM) (Dror et al., 2013) as baselines. The template-based K2Q method requires millions of hand-labeled keyword-question pairs from a query log, which we do not have access to. Thus, we use our simulated keyword-question pairs instead of hand-labeled keyword-question pairs and compute term similarity using word2vec vectors, instead of TF-IDF weighted context vectors. For the baseline systems, we retrieve the best matching question for each keyword query.
6.5.2. Neural systems.
We train a neural network model with synthetic data, then feed the keyword query into the trained neural network model, to generate the most probable question. Specifically, we use the best KQGM configuration (paraphrase-based model with combination selection strategy and phrase detection), along with the keyword query filter to generate synthetic data (a total of 3,168,678 keyword-question pairs). Then, we use the training data filter to rank all keyword-question pairs.
7. Experimental Results
This section reports our evaluation results for the K2Q task. First, in Sect. 7.1, we measure the quality of the generated questions using machine translation metrics. Then, in Sect. 7.2, we employ human judges to assess a sample of questions along two dimensions: relevance and grammar.
7.1. Automatic Evaluation
We use for the automatic evaluation of our K2Q methods, which comprises 1000 hand-labeled keyword-question pairs. Note that these keyword-question pairs have not been used for the training of neural K2Q models. Therefore, it is appropriate to use as a test dataset. We report on widely-used machine translation metrics: BLEU (Papineni et al., 2002) and different variants of ROUGE (Lin, 2004).
Table 2 presents the evaluation result for the baseline systems and for the three neural networks. Clearly, all NMT approaches perform better than the SDM baseline. As expected, the template-based method performs better than SDM, but it is still far behind CopyNet, which is the best neural method. Compared with the basic encoder-decoder NMT network, we find that the attention mechanism brings in noticeable improvements in ROUGE-L (+13.99%), ROUGE-1 (+9.78%), ROUGE-2 (+16.76%) and in BLEU (+20.59%) scores. Because of the extraneous terms issue (cf. §6.4.4) in our synthetic data, the attention mechanism plays a very important role in skipping those terms (by assigning small weights to extraneous terms in the decoding process). Additionally, the copying mechanism brings further minor improvements in ROUGE-L (+3.44%), ROUGE-1 (+5.67%), ROUGE-2 (+5.18%) and BLEU (+1.25%).
We seek to gain a better understanding of how the different elements of our synthetic data generation approach contribute to end-to-end performance on the K2Q task. For that reason, we train the best performing neural model (CopyNet) using different configurations for generating synthetic training data. We add components one by one, to see how they affect performance. Additionally, we vary the amount of training data used between 0.5M and 3M pairs. The results are shown in Fig 7.
Baseline: Baseline KQGM with the Combination term selection strategy (§3.1.1).
Par: Paraphrase-base KQGM with the Combination term selection strategy (§3.1.3).
Par+Ph: Phrase detection added on top (§3.1.4).
Par+Ph+KQF: Keyword query filter added on top (§3.2.1).
Par+Ph+KQF+TDF: Training data filter employed on top (§3.2.2).
The first three methods do not involve the keyword query filter. In those cases, we generate 20 candidate keyword queries for a given question and randomly select one of those. Only the last method uses TDF, which is a mechanism to select the top- highest quality training instances (keyword-question pairs) into . For the other methods, we randomly select instances from the entire synthetic training data set to form . We run methods that involve randomization three times and report the means.
From Fig. 7, we make the following observations. First, we find the results similar to that of the KQGM evaluation in Table 1. Among the three KQGMs, the Par+Ph model performs best. The paraphrase-based KQGM brings noticeable improvements compared to baseline-based KQGM in both ROUGE-L (+6.37% on average) and BLEU (+11.4% on average), while adding phrase detection on top of that only brings minor improvements in ROUGE-L (+0.13% on average) and BLEU (+0.71% on average).
Second, comparing the results of Par+Ph and Par+Ph+KQF, we find that the keyword query filter brings noticeable improvements in both ROUGE-L (+13.4% on average) and BLEU (+16.3% on average). Notice that by adding the keyword query filter, the performance of neural models improves with the size of the training data. Thus, the keyword query filter is an essential element in our synthetic data generation approach.
Third, we find that Par+Ph+KQF+TDF almost always performs better than Par+Ph+KQF, demonstrating that our training data filter is able to estimate the quality of the generated keyword-question pairs, and feed high-quality training instances into the neural networks. One noticeable exception (for both BLEU and ROUGE-L) is the leftmost data point (), where the performance of Par+Ph+KQF+TDF is much below that of Par+Ph+KQF. A further analysis reveals that this is caused by an “insufficient vocabulary” issue. This is illustrated on Fig. 8, where we plot the fraction of the total vocabulary (i.e., unique words in ) present in the training subset . We can observe that with only training instances, the Par+Ph+KQF+TDF model has built up only 74% of the vocabulary, as opposed to 94% by the Par+Ph+KQF model. Our training data filter, based on a retrieval method, performs well with frequent terms, but fails on rare terms. It appears that the TDF quality score estimator overvalues common terms and undervalues rare terms, when selecting the subset of instances for training.
Finally, as expected from TDF, it greatly benefits performance to use the high-quality training instances first; see the Par+Ph+KQF+TDF model for the 0.5M-1.5M range. In contrast, the last half million training instances yield little to no improvements. These results suggest that creating more high-quality keyword-question pairs might bring predictable improvements for neural K2Q models.
7.2. Manual Evaluation
We also perform a manual evaluation using a sample of 87 real user keyword queries with low query clarity333Query clarity ranges from to , where indicating “clear” and indicating “vague.” We only sample queries with clarity smaller than . from the Yahoo! Webscope L16 Dataset. All these queries originate from the query log of Yahoo Answers. For each keyword query, we generate 5 questions, each with a different method. That is, the SDM and TBM baselines, and the three neural networks.
Three human raters were asked to assess each question along two dimensions: (i) Relevance, which indicates whether the question is relevant to the keyword content-wise (ignoring grammar mistakes), and (ii) Grammar, which reflects the grammatical correctness. Table 3 shows our rating scheme. Raters were further asked to choose the best generated question from among the five alternatives. The number of wins were then aggregated for each of the five methods. If multiple methods generated the same question, then the point is added to all.
|2||The question is meaningful and matches given keyword|
|1||The question matches given keyword, more or less|
|0||The question either doesn’t make sense or matches given keyword|
|2||No grammar errors in question, it can be understood completely|
|1||Few grammar errors in question, but it can be understood|
|0||Too many grammar errors in question, it can not be understood|
|Cohen’s kappa score||0.499||0.498||0.637|
Table 4 shows the results of human judges; the reported scores are means. As expected, the SDM method scores highest on grammar, since it retrieves existing questions from the corpus. However, it achieves a very low score on relevance, since it can only retrieve questions that have been asked before (i.e., exist in the corpus). As in the automatic evaluation results, the attention mechanism brings in substantial improvements over the simple Encoder-Decoder model (both in terms of relevance and grammar). As anticipated, the copying mechanism leads to large improvements in terms of relevance (+40.3%); at the same time, the grammar score of CopyNet is only marginally lower than that of AttNet.
7.2.2. Case study
Table 5 provides some examples of generated questions. Clearly, SDM returns grammatically correct, but often irrelevant questions. CopyNet has the ability to capture the meaning of the keyword query, and generates somewhat monotonous but very relevant questions. The other two neural networks seem to capture the query intent only partially, and drift off in directions that are somewhat related to the topic of the query, yet irrelevant, e.g. “What are cute boobs?” and “What is most popular in england?.”
|Keyword 1 cute yaoi animes|
|[S] Do girls watch yaoi anime?|
|[T] Is it cute when yaoi are animes?|
|[E] Are there a good animes are cute?|
|[A] What are cute boobs?|
|[C] What are cute yaoi animes?|
|Keyword 2 average price movie ticket 1987|
|[S] What is the average ticket price for a super bowl ticket?|
|[T] What is the average price of movie ticket 1987?|
|[E] What is the average price for a 1987 ticket in 1987?|
|[A] What is the average price for a movie ticket?|
|[C] What is the average price of the movie ticket in 1987?|
|Keyword 3 popular jbs england|
|[S] How big are jbs feet?|
|[T] Who are popular sovereignty and jbs england related?|
|[E] What is the most popular in england?|
|[A] Who is popular in england?|
|[C] How popular is jbs in england?|
7.2.3. Comparison to Google
The “People also ask” service, provided by Google and illustrated on Fig. 9, is somewhat similar to our keyword-to-question task. Therefore, we considered including it in our evaluation for baseline comparison. When running our test queries through Google, we noticed that the “People also ask” panel is triggered only for 34% of our queries. As we already pointed out in the introduction, for each query, we wish to generate a natural language questions that most likely represents the user’s underlying information need. It appears that Google’s service is addressing a fundamentally different task, which is of suggesting questions related to the query that are asked by sufficiently many people and to which answers are known to exist. Because of these, we do not compare our methods against this service.
8. Related work
In this section we review related research from two areas: keyword-to-question and synthetic data generation.
8.1. Keyword to Question
Relevant work on K2Q systems include (Kotov and Zhai, 2010; Zheng et al., 2011; Zhao et al., 2011; Dror et al., 2013). All these systems follow a template-based approach, and are evaluated in terms of relevance, diversity, and grammatical correctness. While some differences exist among these systems, all consist of three main steps. First, they extract question templates from millions of keyword-question pairs by substituting keyword terms in questions with slots, and storing keyword-template pairs in a database . Second, given a new keyword query , they search similar keyword queries from , collect templates related to those similar queries, and instantiate those templates with for generating candidate questions. Finally, a parameterized ranking model is used to calculate the probabilities of those candidate questions being generated by the query , and to rank all candidate questions. Instead of template-based methods, we propose to address the K2Q task using state-of-art neural machine translation approaches.
8.2. Synthetic Data Generation
The idea of automatically generating synthetic data (pseudo test collections) for information retrieval (IR) has attracted some attention in past years (Azzopardi et al., 2007; Berendsen et al., 2013). To the best of our knowledge, utilizing simulated queries for evaluating IR was first suggested by Azzopardi and de Rijke (2006), who also proposed an algorithm for generating simulated queries (Azzopardi and de Rijke, 2006; Azzopardi et al., 2007)
. Their experimental results show that it is possible to generate simulated queries for web search with performance comparable to that of manual queries. Besides, the idea of generating pseudo test collections was also utilized for the training of supervised (learning-to-rank) retrieval models for web search(Asadi et al., 2011) and for ad-hoc search on domain-specific, semi-structured documents (Berendsen et al., 2012, 2013)
. It should be pointed out that automatically generated synthetic training data for deep learning had accomplished a great deal in computer vision(Handa et al., 2015; Zhang et al., 2015; Gan et al., 2015; Ros et al., 2016). Even though synthetic data is imperfect, these efforts show the feasibility of training robust and effective neural network models with noisy, but very large-scale data. In this paper, we have proposed a keyword query generation model and developed various filtering mechanisms, in order to create synthetic training data for training neural K2Q models.
In this work, we have studied the problem of translating keyword queries to natural language questions using neural approaches. To the best of our knowledge, this is the first application of neural machine translation methods to the keyword-to-question (K2Q) task. Perhaps the most innovative aspect of this work is the combination of keyword query generation models combined with various filtering mechanisms to create massive amounts of synthetic data for training neural models. Our empirical evaluation has demonstrated the effectiveness of our synthetic data generation approach for the K2Q task.
In this paper, we have generated only a single question for each keyword query, and evaluated it with respect to relevance and grammatical correctness. The same neural models, however, may also be used to generate a diverse list of questions for a given keyword query, with the help of techniques like beam search (Freitag and Al-Onaizan, 2017). For example, given the keyword query “Bible verse about education,” our neural models generated a range of diverse and meaningful questions, including:
“What is the fugitive slave verse about education?”
“What is the christ verse about education?”
“What is the sacred verse about education?”
“What does Bible verse say about education?”
In the future, we are interested in generating a diverse set of questions (i.e., the bottom part in Fig. 1) and comparing these with existing template-based methods with respect to diversity.
In summary, our methods have shown great potential and promise for creating synthetic training data that can be used to train robust neural models; future applications of this idea extend beyond the keyword-to-question task.
- Agarwal et al. ([n. d.]) Manish Agarwal, Rakshit Shah, and Prashanth Mannem. [n. d.]. Automatic Question Generation Using Discourse Cues. In Proc. of IUNLPBEA ’11.
- Asadi et al. (2011) Nima Asadi, Donald Metzler, Tamer Elsayed, and Jimmy Lin. 2011. Pseudo Test Collections for Learning Web Search Ranking Functions. In Proc. of SIGIR ’11. 1073–1082.
- Azzopardi and de Rijke (2006) Leif Azzopardi and Maarten de Rijke. 2006. Automatic Construction of Known-item Finding Test Beds. In Proc. of SIGIR ’06. 603–604.
- Azzopardi et al. (2007) Leif Azzopardi, Maarten de Rijke, and Krisztian Balog. 2007. Building Simulated Queries for Known-item Topics: An Analysis Using Six European Languages. In Proc. of SIGIR ’07. 455–462.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. http://arxiv.org/abs/1409.0473.
- Berendsen et al. (2012) Richard Berendsen, Manos Tsagkias, Maarten de Rijke, and Edgar Meij. 2012. Generating Pseudo Test Collections for Learning to Rank Scientific Articles. In Proc. of CLEF ’12. 42–53.
- Berendsen et al. (2013) Richard Berendsen, Manos Tsagkias, Wouter Weerkamp, and Maarten de Rijke. 2013. Pseudo Test Collections for Training and Tuning Microblog Rankers. In Proc. of SIGIR ’13. 53–62.
- Bogdanova et al. (2015) Dasha Bogdanova, Cícero Nogueira dos Santos, Luciano Barbosa, and Bianca Zadrozny. 2015. Detecting Semantically Equivalent Questions in Online User Forums. In Proc. of CoNLL ’15. 123–131.
- Chung et al. (2014) Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. http://arxiv.org/abs/1412.3555.
- Cohen (1960) Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement 20, 1 (1960), 37–46.
- Dror et al. (2013) Gideon Dror, Yoelle Maarek, Avihai Mejer, and Idan Szpektor. 2013. From Query to Question in One Click: Suggesting Synthetic Questions to Searchers. In Proc. of WWW ’13. 391–402.
- Fader et al. (2014) Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. 2014. Open Question Answering over Curated and Extracted Knowledge Bases. In Proc. of KDD ’14. 1156–1165.
- Freitag and Al-Onaizan (2017) Markus Freitag and Yaser Al-Onaizan. 2017. Beam Search Strategies for Neural Machine Translation. http://arxiv.org/abs/1702.01806.
- Gan et al. (2015) Zhe Gan, Ricardo Henao, David E. Carlson, and Lawrence Carin. 2015. Learning Deep Sigmoid Belief Networks with Data Augmentation. In Proc. of AISTATS ’15. 268–276.
- Gao et al. (2013) Yunjun Gao, Lu Chen, Rui Li, and Gang Chen. 2013. Mapping Queries to Questions: Towards Understanding Users’ Information Needs. In Proc. of SIGIR ’13. 977–980.
- Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O. K. Li. 2016. Incorporating Copying Mechanism in Sequence-to-Sequence Learning. http://arxiv.org/abs/1603.06393.
- Handa et al. (2015) Ankur Handa, Viorica Patraucean, Vijay Badrinarayanan, Simon Stent, and Roberto Cipolla. 2015. SceneNet: Understanding Real World Indoor Scenes With Synthetic Data. http://arxiv.org/abs/1511.07041.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9 (1997), 1735–1780.
- Itti et al. (1998) Laurent Itti, Christof Koch, and Ernst Niebur. 1998. A Model of Saliency-Based Visual Attention for Rapid Scene Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20, 11 (Nov. 1998), 1254–1259.
- Jiang et al. (2017) Lili Jiang, Shuo Chang, and Nikhil Dandekar. 2017. Semantic Question Matching with Deep Learning. https://engineering.quora.com/Semantic-Question-Matching-with-Deep-Learning.
- Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. http://arxiv.org/abs/1412.6980.
- Kotov and Zhai (2010) Alexander Kotov and ChengXiang Zhai. 2010. Towards Natural Question Guided Search. In Proc. of WWW ’10. 541–550.
- Lin (2004) Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Proc. of the ACL ’04 Workshop. 74–81.
- Metzler and Croft (2005) Donald Metzler and W. Bruce Croft. 2005. A Markov Random Field Model for Term Dependencies. In Proc. of SIGIR ’05. 472–479.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and Their Compositionality. In Proc. of NIPS ’13. 3111–3119.
- Mnih et al. (2014) Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent Models of Visual Attention. In Proc. of NIPS ’14. 2204–2212.
- Paletta et al. (2005) Lucas Paletta, Gerald Fritz, and Christin Seifert. 2005. Q-learning of Sequential Attention for Visual Object Recognition from Informative Local Descriptors. In Proc. of ICML ’05. 649–656.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proc. of ACL ’02. 311–318.
- Ros et al. (2016) German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M. Lopez. 2016. The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes. In Proc. of CVPR ’16.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to Sequence Learning with Neural Networks. http://arxiv.org/abs/1409.3215.
- Xue et al. (2008) Xiaobing Xue, Jiwoon Jeon, and W. Bruce Croft. 2008. Retrieval Models for Question and Answer Archives. In Proc. of SIGIR ’08. 475–482.
- Yin et al. (2016) Wenpeng Yin, Sebastian Ebert, and Hinrich Schütze. 2016. Attention-Based Convolutional Neural Network for Machine Comprehension. http://arxiv.org/abs/1602.04341.
- Zhang et al. (2015) Xi Zhang, Yanwei Fu, Andi Zang, Leonid Sigal, and Gady Agam. 2015. Learning Classifiers from Synthetic Data Using a Multichannel Autoencoder. http://arxiv.org/abs/1503.03163.
- Zhao et al. (2011) Shiqi Zhao, Haifeng Wang, Chao Li, Ting Liu, and Yi Guan. 2011. Automatically generating questions from queries for community based question answering. In Proc. of IJCNLP ’11. 929–937.
- Zheng et al. (2011) Zhicheng Zheng, Xiance Si, Edward Y Chang, and Xiaoyan Zhu. 2011. K2Q: Generating Natural Language Questions from Keywords with User Refinements.. In Proc. of IJCNLP ’11. 947–955.
- Zhou et al. (2017) Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan, Hangbo Bao, and Ming Zhou. 2017. Neural Question Generation from Text: A Preliminary Study. https://arxiv.org/abs/1704.01792.