Generating Synthetic Data for Neural Keyword-to-Question Models

07/14/2018 ∙ by Heng Ding, et al. ∙ Wuhan University University of Stavanger 0

Search typically relies on keyword queries, but these are often semantically ambiguous. We propose to overcome this by offering users natural language questions, based on their keyword queries, to disambiguate their intent. This keyword-to-question task may be addressed using neural machine translation techniques. Neural translation models, however, require massive amounts of training data (keyword-question pairs), which is unavailable for this task. The main idea of this paper is to generate large amounts of synthetic training data from a small seed set of hand-labeled keyword-question pairs. Since natural language questions are available in large quantities, we develop models to automatically generate the corresponding keyword queries. Further, we introduce various filtering mechanisms to ensure that synthetic training data is of high quality. We demonstrate the feasibility of our approach using both automatic and manual evaluation. This is an extended version of the article published with the same title in the Proceedings of ICTIR'18.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Most search queries are motivated by some underlying question (Kotov and Zhai, 2010). Today’s users are accustomed to expressing the questions they have in mind using keyword queries (Zhao et al., 2011). Keyword queries, however, can be notoriously ambiguous and may be interpreted in multiple ways. For example, given the keyword query “10th president India,” the question perhaps most users would want to ask is “Who was the 10th President of India?”. Nevertheless, some users may be interested in a particular aspect of the query topic, like “In which year did the 10th President of India leave office?” or “What do people say about the 10th President of India?”. By determining the underlying question, we can obtain a more accurate representation of the user’s information need. This, in turn, can lead to improved retrieval performance and a better overall search experience. We envisage a search interface that allows users to refine their queries with automatically generated natural language questions; see Fig. 1. We note that similar functionality is already offered, for certain queries, in major Web search engines (see Fig. 9). Those services, however, are limited to suggesting existing questions to which answers are known to exist. Importantly, we are not aiming to retrieve existing questions from community question-answering archives (Xue et al., 2008; Gao et al., 2013). Our goal is to automatically generate a natural language question that most likely represents the user’s underlying information need. This is seen as a feedback mechanism that can more naturally engage users into explicitly clarifying their information needs. How those natural language questions are actually utilized in a retrieval system (e.g., via query expansion (Kotov and Zhai, 2010)) is beyond the scope of this study.

Figure 1. Translating a keyword query to natural language question(s). Our focus is on the shaded area: generating the most common question for a keyword query. The bottom part, generating diverse questions, is left for future work.

In this paper, we address the keyword-to-question (K2Q) task: generating a natural language question from a keyword query. K2Q has generated considerable attention recently, see, e.g., (Kotov and Zhai, 2010; Dror et al., 2013; Zhao et al., 2011; Zheng et al., 2011). Most existing works employ a template-based approach, where common question patterns are extracted from existing keyword-question pairs. These template-based methods are inherently limited in their ability to generalize to previously unseen queries. Instead, we propose to address the K2Q task using state-of-the-art neural machine translation (sequence-to-sequence) approaches. One challenge we face is that training such neural models requires massive amounts of training data (i.e., hand-labeled keyword-question pairs). While such training data could be mined from query and click logs, there are two main issues. First, such click data is not always available (e.g., in a cold start scenario). Second, it is limited to keyword-question pairs that have received sufficiently many clicks; long-tail queries or newly posted questions will not have that. The above considerations give rise to the main research objective of the present work: How can we generate synthetic data for training a neural machine translation approach for the K2Q task?

The idea of generating synthetic data for training deep neural network has already been successfully applied for some computer vision tasks 

(Handa et al., 2015; Zhang et al., 2015; Gan et al., 2015; Ros et al., 2016). In information retrieval, prior work has studied the creation of pseudo test collections, i.e., automatically generating query-document pairs, for training and evaluating retrieval algorithms (Azzopardi and de Rijke, 2006; Azzopardi et al., 2007; Berendsen et al., 2013). Inspired by those studies, we propose an approach that automatically generates large amounts of simulated keyword-question pairs from a small set of hand-labeled keyword question pairs, and then learns a neural keyword-to-question model with such synthetic training data. The main technical contributions of this work are the following:

  1. We present a novel approach for generating synthetic training data from a seed set of hand-labeled keyword-question pairs, and subsequently use this data for learning neural machine translation models to solve the K2Q task (Sect. 2).

  2. We introduce several generative models for producing synthetic keyword queries from natural language questions (Sect. 3.1).

  3. We develop two filtering mechanisms, which are essential for ensuring that the synthetic training data we feed into the neural network is of high-quality (Sect. 3.2).

  4. We evaluate our synthetic data generation approach on the end-to-end K2Q task using both automatic and manual evaluation (Sect. 7).

2. Overview

The overall goal in this paper is to tackle the keyword-to-question (K2Q) problem using neural networks. I.e., the task is to translate a keyword query (referred to as keyword for short) to a natural language question (question for short). To be able to use neural networks for this task, massive amounts of training data are needed. The main idea of our paper is to use a small seed set of hand-labeled training data to generate large amounts of synthetic training data. Specifically, the seed training data, , consists of keyword-question pairs, . This, along with a large question corpus, , is utilized to generate synthetic training data, , which also consists of keyword-question pairs, . The neural machine translation models will then be trained using . The overview of our framework is shown in Fig. 2. It entails three main steps, which we shall detail below.

First, we train a keyword query generation model (KQGM), , using keyword-question pairs from the seed training data. We aim to simulate real users’ querying behavior: given a natural language question, generate a keyword query that a user would likely issue when seeking an answer to that question. We explore various generative models; these have only a few free parameters, which can be easily learned from the seed training data .

Second, we utilize a large question corpus , collected from community question answering forums, and employ the keyword query generation model to generate (a large set of) simulated keyword-question pairs. These will constitute our synthetic training data . However, since not all the automatically generated keyword-question pairs are of high quality, we employ a keyword query filter (KQF) and a training data filter (TDF). These filters are pivotal elements in our approach; we shall detail them in Sect. 3.2.

Finally, we train a neural machine translation (NMT) model for the K2Q task by feeding it with the synthetic training data . We consider three neural networks: basic encoder-decoder NMT (Sutskever et al., 2014), NMT with attention mechanism (Bahdanau et al., 2014), and NMT with copying mechanism (Gu et al., 2016). We shall detail these networks in Sect. 4.

Figure 2. The overview of our approach. A small set of hand-labeled training data () and a large scale question corpus () are input to the keyword query generation model (KQGM, parameterized by ). The output is passed through a keyword query filter (KQF) and a training data filter (TDF), resulting in a synthetic training data set (). The synthetic training data is used for learning the parameters () of a neural machine translation model (NMT).

3. Synthetic Data Generation

This section details the our synthetic training data generation method, which is the most important contribution of this paper. The process takes as input (i) a small seed training data set, consisting of hand-labeled keyword-query pairs, and (ii) a large set of natural language questions. The output is a large set of automatically generated keyword-question pairs, with high enough quality to train robust neural models. Our approach consists of two main components: a keyword query generation model (Sect. 3.1) and filtering mechanisms (Sect. 3.2).

3.1. Keyword Query Generation Model

Prior work has seen successful attempts at generating synthetic queries for web and microblog known-item search, both for evaluation and for training purposes (Azzopardi and de Rijke, 2006; Azzopardi et al., 2007; Berendsen et al., 2013). The overall idea is to construct a generative model that can produce a query, similar to a real query that a user would issue, for finding a particular item. We take the algorithm proposed by Azzopardi et al. (2007) as our starting point (§3.1.1) and extend it at several points to fit our problem setting: (i) we impose a number of restrictions as well as introduce new elements to the generative process (§3.1.2), (ii) we propose a paraphrase-based variation that considers multiple ways of formulating the same question (§3.1.3), and (iii) we add phrase support, so as not to break up meaningful word sequences (§3.1.4).

3.1.1. Baseline

In known-item search it is assumed that the user wants to find a particular item (document, question, tweet, etc.) that she has seen before in the corpus. Therefore, the user constructs a keyword query by recalling terms that would help her identify this item. In automatic query construction this user behavior is simulated using generative models.

Formally, let us assume that the user seeks to find (recall) the natural language question . The query length

is selected with probability

. Then, a keyword query is constructed by sampling terms from , which is the model of

. The prior probability distribution

can be easily estimated by considering query lengths in a representative sample (e.g., a query log). The quality of the synthetic queries crucially depends on the distribution

, as it determines which terms will be sampled. Azzopardi et al. (2007) define using the standard language modeling approach:


Accordingly, term generation is a mixture between sampling a term from the given item with probability , and from the corpus with probability , where the influence of the collection model is controlled by the smoothing parameter . The latter likelihood is calculated using:


where denotes the collection term frequency of term , and is the vocabulary of terms in the corpus.

To simulate different types of user querying behavior, three plausible term selection strategies have been proposed to estimate : (i) popular selection, (ii) discriminative selection, and (iii) their combination (Azzopardi and de Rijke, 2006; Azzopardi et al., 2007).

(i) Popular: Assuming that more frequent terms are more likely to be used as query terms, is calculated by Eq. (3), where is the number of occurrences of in .


(ii) Discriminative: Assuming that the user may select query terms that can better discriminate the item she is looking for from other items in the corpus, is calculated using Eq. (4), where is a binary indicator function that is if occurs in and otherwise. is the same as before, cf. Eq. (2).


(iii) Combination: Combining the popular and discriminative strategies into a single model, is calculated by Eq. (5), where is the document (here: question) frequency of term and is the total number of items in the corpus.


3.1.2. Our Keyword Generation Algorithm

Note that the original algorithm in (Azzopardi et al., 2007) has been developed for known-item (document) search. We need to modify and extend it at several points to be able to use it for the K2Q task we are addressing.

For known-item search, an item is selected randomly from the corpus, and then a keyword query is generated from that item. The process is repeated as many times as the number of queries to be created. In our problem scenario the items are natural language questions, where each of them needs to be paired with a keyword query. That is, we do not sample items, but we generate a query for each item in the corpus. This is the first modification we make to the algorithm (line 3 in Algorithm 1).

The second change concerns the length of keyword queries. In (Azzopardi et al., 2007)

, the length of the query is drawn from a Poisson distribution, with the mean set according to the average length in a set of human-generated queries. For us, the length of the keyword query also depends on the corresponding natural language question. Given a question with length

, it is reasonable to assume that users will always prefer to issue a keyword query that is shorter than . Thus, we include this additional constraint and sample a query length with , where (line 5 in Algorithm 1).

Third, keyword queries typically do not contain question words, such as “how,” “what,” “where,” “who,” “why,” “when,” etc. Thus, we do not sample question words in our generation process.

Fourth, our algorithm does not only sample terms but also samples phrases for generating synthetic queries. Thus, we avoid breaking up word sequences that function together as a meaningful unit. It means that could be either a term or a phrase in the generative process (line 7 in Algorithm 1). We describe our phrase detection mechanism in §3.1.4.

Fifth, according to our statistics on a sample of queries,111The Yahoo! L16 Webscope Dataset, which contains many real keyword queries from users of Yahoo Answers. only 3.9% of all keyword queries include the same term more than once, suggesting that queries with repeated terms are atypical. Thus, we find it reasonable to avoid sampling the same term more than once in our keyword query generation process (line 9 in Algorithm 1).

Data: , a set of known questions
Result: , a set of synthetic keyword-question pairs
1 begin
2       ;
3       for  do
4             ;
5             ;
6             for  in [1, ],  do
7                   ;
8                   ;
9                   ;
11             end for
12            ;
14       end for
16 end
Algorithm 1 Synthetic keyword-question generation

3.1.3. Paraphrase-Based Querying Model

Users may use different words to express the same meaning. This should be taken into consideration in the keyword query generation process. Imagine the following case, where a particular user has seen the question “Who is the author of the pooh?” in a community question answering forum (e.g., Yahoo! Answers or Quora), then, after several days, she tries to recall the search terms to find an answer to this question. If she still remembers the exact words from the question, she may issue “the pooh author” as a query. Otherwise, she may recollect a paraphrase of the question, like “Who is winnie the pooh’s creator?” and, based on that, formulates the keyword query “winnie the pooh creator.” Furthermore, different users may recall different paraphrases during their querying process. Thus, it is natural to sample terms from paraphrases of the same question when generating keyword queries. We realize this idea by defining the term generation model as a three component mixture:


where is a set of paraphrases of question and defines the likelihood of selecting term from the paraphrases. All paraphrases in are concatenated together into a single large document, then may be calculated by one of three strategies we described in the previous section. The model in Eq. (6) has two parameters, . As tends to one, it assumes that the user definitely remembers the terms of the original question. As tends to one, it assumes that user does not recall the terms from the original question but knows how to paraphrase it. As both and tend to zero, it means that user knows that the question exists but does not remember any terms from the original question nor from any of its paraphrases.

3.1.4. Phrase Detection

We sample not only terms but also phrases, in order to avoid breaking up continuous word sequences that constitute meaningful units. Specifically, we follow the method proposed by Mikolov et al. (2013) for detecting phrases. Words that belong to the same phrase are grouped together into a new term. For example, the question “how fast is a 2004 honda crf 230” is converted to “how fast is a 2004 honda_crf_230” after phrase detection. This way, KQGM is able to directly sample honda_crf_230, instead of sampling three independent terms.

3.2. Filtering Mechanisms

To ensure that high-quality synthetic data is generated for training neural translation models, we propose two filtering mechanisms. One operates on the level of individual questions and selects the best keyword query, from a pool of candidate queries generated for a given question (§3.2.1). The other filter is applied over the entire set of synthetic query-question pairs and filters out low-quality instances (§3.2.2).

3.2.1. Keyword Query Filter

Figure 3. The architecture of our keyword query filter (KQF). For a given question , holds the candidate keywords generated by KQGM, out of which a a single best keyword is selected. is the index containing all questions in our corpus. For each generated keyword , is the top- relevant questions retrieved from .

Given the probabilistic nature of query length selection (line 5 in Algorithm 1) and term selection (line 7 in Algorithm 1), the keyword query generation model may produce very different keyword queries for the same question. These keywords may vary a lot in terms of quality, from appropriate to inadequate. For example, given the question “what happens inside a refracting telescope,” the query generation model can give rise to a good keyword query, “happens inside refracting telescope,” or to a rather bad one, “inside colors type,” using the very same parameters.

The idea is to remedy this behavior by generating, for each question, a set of candidate keyword queries (i.e., running the model multiple times), and then selecting the single most suitable query. We propose to achieve this using a so called keyword query filter (KQF), shown in Fig. 3. The intuition behind this ranking-based filtering approach is that the better the generated keyword query is, the more effectively it can retrieve the original question from the question corpus. (It is worth pointing out that our algorithm will always generate a keyword query that is shorter than the corresponding question, i.e., it is never the same as the question.)

We start with generating a set of candidate keywords for a given question using KQGM. Then, we issue each candidate keyword query against an index containing all questions in our corpus, and retrieve the top- highest scoring questions, . Specifically, we employ the Sequential Dependence Model (SDM) retrieval method (Metzler and Croft, 2005). Finally, we select the best candidate keyword for the input question according to its reciprocal rank:


where is the rank of in the ranked list .

3.2.2. Training Data Filter

Figure 4. The architecture of our training data filter (TDF). denotes the entire set of keyword-question pairs generated by KQGM with KQF. For a pair , is a set of all paraphrases of and is the set of relevant questions retrieved from the question corpus in response to . denotes the top- pairs with highest quality score .

Even after applying the keyword query filter, there may still exist low-quality training instances in , which would misdirect the learning process. Therefore, we propose a training data filter (TDF) to filter out low quality instances. TDF, shown in Fig. 4, takes a set of synthetic query-question pairs as input, and returns a subset that contains the top- pairs with the highest quality score. We use retrieval precision as a quality indicator, which expresses to what extent is a proper keyword for question :


where denotes the set of relevant questions retrieved by the keyword query using the SDM retrieval method (Metzler and Croft, 2005), and denotes the set of paraphrase questions for . In short, TDF ranks all generated query-question pairs according to , then selects the top- highest scoring ones to form the filtered subset .

4. Neural Machine Translation

Neural machine translation (NMT) aims to directly model the conditional probability of translating a source sequence to a target sequence . Thus, it lends itself naturally to implement our K2Q task using NMT, by taking the keyword query as the source sequence and the natural language question as the target sequence . In the rest of this section, we detail three NMT networks we use in our experiments.

4.1. Encoder-Decoder NMT

The classical architecture of NMT is Encoder-Decoder recurrent neural networks (RNNs) 

(Sutskever et al., 2014), which consists of two components:

(i) Encoder

, a RNN to compute a context vector representation

for the input sequence, , by iterating the following equations:


where is a one-hot representation of the th word in the input sequence, and is the hidden state vector of encoder RNN at time

. The activation function

may be as simple as a sigmoid function or complex, e.g., a long short-term memory (LSTM) 

(Hochreiter and Schmidhuber, 1997) or Gated Recurrent (GRU) (Chung et al., 2014) unit. The context vector is defined by , which is an operation on all hidden states. In this paper, indicates an operation choosing the last hidden state .

(ii) Decoder, another RNN to decompress the context vector and output the target sequence, , through a conditional language model:


where is a one-hot representation of the th word in the output sequence; denotes the hidden state vector of the decoder RNN at time ; can be the same as encoder activation function, , or a different non-linear activation function;

is a softmax classifier. Given a set of keyword-question pairs, the encoder and decoder are jointly trained to maximize the conditional log-likelihood.

4.2. Attention Mechanism

The attention mechanism in neural networks has a long history in computer vision (Itti et al., 1998; Paletta et al., 2005; Mnih et al., 2014)

, and has recently been also successfully applied in natural language processing 

(Bahdanau et al., 2014; Yin et al., 2016). The basic idea behind it is that humans pay attention to specific parts, rather than the whole input, when performing visual and linguistic tasks. The attentional NMT (Bahdanau et al., 2014) uses a dynamically changing context vector instead of a fixed context vector during the decoding process. The dynamically changing context vector is computed with a weighted sum of the source hidden states according to:

(13) and

where is an attention function that scores the corresponding attentional strength. Usually, is parameterized with a feedforward neural network. Further, denotes the hidden state of the encoder at time , and denotes the attentional strength that the target word is related to a source word .

4.3. Copying Mechanism

The copying mechanism was first proposed by Gu et al. (2016) for handling out-of-vocabulary words, by selecting appropriate words from the input text. We employ the copying mechanism to assign higher probability to words that appear in the input text. This way we naturally capture the fact that questions tend to keep important words from the keyword query. By incorporating the copying mechanism into NMT, the probability of generating word in the output sequence becomes:


The first part is the probability of generating the term from vocabulary (cf. Eq. (12)). The second component is the probability of copying it from the source sequence:


where denotes all words in the source sequence. is a non-linear function and is a learned parameter matrix. We refer to (Gu et al., 2016) for further details.

5. Data

Our approach needs a small set of hand-labeled keyword-question pairs and a large set of questions. We obtain these two datasets from WikiAnswers.222 WikiAnswers includes millions of questions asked by humans. Users have also identified groups of questions that are paraphrases of each other. These groups are considered paraphrase clusters (Fader et al., 2014).


Since we only care about natural language questions in this work, we employ the heuristics proposed by 

Dror et al. (2013) to filter out non-natural language questions. Specifically, we keep only questions that start with “WH words” or auxiliary verbs. Additionally, we restrict ourselves to questions consisting of 5-12 terms (most frequent query length), based on question length distribution statistics of WikiAnswers, see Fig. 5(a). We end up with 3,168,878 paraphrase clusters, with 26.05 questions per cluster on average. In the remainder of the paper, when we write WikiAnswers, we refer to this preprocessed subset of the collection.

(a) Question length (WikiAnswers)
(b) Keyword query length (Yahoo! Webscope L16 Dataset)
Figure 5. Question and query length distributions. The X axes indicate length, the Y axes indicate the proportion.
Small Set of Keyword-Question Pairs ()

In order to get the small set of hand-labeled keyword-question pairs, we randomly pick 200 clusters from the 3,168,878 paraphrase clusters. From each of those paraphrase clusters, we sample five questions randomly. We employ five human annotators, who each receive only one question from each of the 200 paraphrase clusters. The annotators then manually create keyword queries from their questions. We then have 200 paraphrase clusters, each with five questions’ paraphrases and corresponding keyword queries (where each paraphrase is labeled by a different annotator), a total of 1000 hand-labeled pairs.

Large Set of Questions ()

To get the large set of questions, we randomly sample a single question from each of the remaining paraphrase clusters. This amounts to 3,168,678 questions. The hand-labeled questions do not appear in this set.

6. Experimental Setup

This section details various settings of three main components used in our approach, i.e., KQGM (Sect. 6.1), filtering mechanisms (Sect. 6.2), and NMT (Sect. 6.3).

6.1. Keyword Query Generation Model

The following settings are used in our experiments:

  • [leftmargin=*]

  • Query length: The prior probability of query length is calculated based on the small set of (hand-labeled) keyword-question pairs. According to statistics on user keyword queries from the Yahoo! L16 Webscope Dataset, most keyword queries contain between 3 and 7 terms, see Fig. 5(b). Thus, we only sample queries with length .

  • Collection Language Model: The collection language model probability of is computed based on the WikiAnswers dataset. For the paraphrase-based model, we need to know the paraphrases for a given question . In our dataset, this is readily available. We note that there also exist methods to detect paraphrases automatically (Bogdanova et al., 2015; Jiang et al., 2017).

  • Parameter Tuning: For the baseline model (§3.1.1), there is only one free parameter . The paraphrase-based model (§3.1.3) involves two parameters, and . We set the parameter values by performing an extensive (grid) search in steps of .

6.2. Filtering Mechanisms

For our filters, we use the following settings:

  • [leftmargin=*]

  • Keyword query filter3.2.1): We generate candidate keywords for each question in the large set of questions using KQGM. The best of these is selected by KQF to be paired with .

  • Training data filter3.2.2): For a keyword-question pair we retrieve the top questions using and obtain the paraphrases from paraphrase cluster of .

6.3. Neural Networks

We implement the following three networks:

  • [leftmargin=*]

  • EDNet: Basic encoder-decoder NMT network.

  • AttNet: EDNet with attention mechanism.

  • CopyNet: AttNet plus copying mechanisms.

For all three networks, we choose the top 44K most frequent words in WikiAnswers as our vocabulary. We set the embedding dimension to 100, and initialize the word embeddings randomly with a uniform distribution in [-0.1,0.1]. We set the number of layers of both encoder and decoder RNNs to 1. Further, we use a bidirectional GRU 

(Bahdanau et al., 2014) unit with size 200 for encoder RNNs, and a GRU unit with size 400 for decoder RNNs. All networks are optimized using Adam (Kingma and Ba, 2014) with an initial learning rate of

, gradient clipping of

, and dropout rate of .

6.4. Preliminary Study

Our synthetic data generation heavily depends on the generative model for creating keyword queries. Thus, we perform a preliminary study, using the small set of keyword-question pairs, , to analyze the performance of various KQGM configurations. Informed by this analysis, we can decide which of the three term selection strategies to use for KQGM in our main experiments.

6.4.1. Evaluation Metrics.

We use automatic metrics from text summarization, specifically, the widely used ROUGE-L metric 

(Lin, 2004). ROUGE-L not only awards credit to in-sequence unigram matches, but also captures word order in a natural way. Thus, it can effectively measure the degree of match between the synthetic and ground truth keyword queries. Recall that in our dataset, we have a set of paraphrases for each question. We wish to consider those paraphrases as well in our evaluation. Formally, let denote the generated keyword query corresponding to question ; denotes the paraphrase cluster of ; is the set of ground truth keywords corresponding to . For scoring , we consider the set of ground truth keywords in two different ways: (i) by computing the average ROUGE-L between and each ground truth keyword (Eq. (17)), and (ii) by considering only the best (highest scoring) ground truth keyword query (Eq. (18)).


We employ five-fold cross-validation for evaluation. To eliminate the effects of randomness that is involved in the process, we repeat 100 times, and report the means and standard deviations.

6.4.2. Summary.

Table 1 shows the evaluation results for all KQGM configurations. Comparing the three term selection strategies (§3.1.1), we find that the Combination strategy always attains the best performance. With the same term selection strategy and KGQM, phrase detection brings noticeable improvements in both AvgRougeL and MaxRougeL (+5.28% and +4.22%, respectively). Comparing the paraphrase-based model with the baseline model, the former brings +10.66% improvements on average for AvgRougeL and +7.16% on average for MaxRougeL. The paraphrase-based model with phrase detection achieves the best overall performance, with 0.2521 AvgRougeL and 0.3843 MaxRougeL, which is superior to the best baseline configuration.

Configuration AvgRougeL MaxRougeL
Baseline model
Popular 0.1956 (0.0934) 0.3197 (0.1266)
Discrimination 0.1877 (0.1049) 0.2999 (0.1421)
Combination 0.2240 (0.0953) 0.3522 (0.1331)
Baseline model + phrase detection
Popular 0.2069 (0.1008) 0.3354 (0.1342)
Discrimination 0.2062 (0.1106) 0.3243 (0.1465)
Combination 0.2373 (0.1019) 0.3708 (0.1399)
Paraphrase-based model
Popular 0.2125 (0.0930) 0.3390 (0.1250)
Discrimination 0.2266 (0.1017) 0.3458 (0.1367)
Combination 0.2435 (0.0956) 0.3734 (0.1330)
Paraphrase-based model + phrase detection
Popular 0.2182 (0.1001) 0.3476 (0.1322)
Discrimination 0.2355 (0.1020) 0.3513 (0.1361)
Combination 0.2521 (0.1009) 0.3843 (0.1374)
Table 1. Evaluation of various KQGM configurations. All numbers are obtained using 5-fold cross-validation. In parentheses are the standard deviations.

6.4.3. Parameters.

We test what influence the free parameters have on the performance of KQGMs. For the baseline model, we find that both AvgRougeL and MaxRougeL decrease as increases, see Figs. 6 and 6. For the paraphrase-based model, we find that both AvgRougeL and MaxRougeL increase with higher and values, see Figs. 6 and  6. This is not unexpected, since users prefer to use terms from the given question for the keyword query.

Figure 6. Influence of free parameters , , and on KQGM performance.

6.4.4. Observed Errors.

Based on manual inspection of synthetic keyword-question pairs, we find that the most prominent flaws in our synthetic data are extraneous terms in the KQGM-made keywords. For example, given the question “what is usage of erw pipe,” our KQGM generates a keyword query “erw pipe usage made meant,” where “made meant” are unnecessary terms.

6.5. Implemented Systems

6.5.1. Baseline systems.

We implement the SDM retrieval model (Metzler and Croft, 2005) and the state-of-the-art template-based method (TBM) (Dror et al., 2013) as baselines. The template-based K2Q method requires millions of hand-labeled keyword-question pairs from a query log, which we do not have access to. Thus, we use our simulated keyword-question pairs instead of hand-labeled keyword-question pairs and compute term similarity using word2vec vectors, instead of TF-IDF weighted context vectors. For the baseline systems, we retrieve the best matching question for each keyword query.

6.5.2. Neural systems.

We train a neural network model with synthetic data, then feed the keyword query into the trained neural network model, to generate the most probable question. Specifically, we use the best KQGM configuration (paraphrase-based model with combination selection strategy and phrase detection), along with the keyword query filter to generate synthetic data (a total of 3,168,678 keyword-question pairs). Then, we use the training data filter to rank all keyword-question pairs.

7. Experimental Results

This section reports our evaluation results for the K2Q task. First, in Sect. 7.1, we measure the quality of the generated questions using machine translation metrics. Then, in Sect. 7.2, we employ human judges to assess a sample of questions along two dimensions: relevance and grammar.

7.1. Automatic Evaluation

We use for the automatic evaluation of our K2Q methods, which comprises 1000 hand-labeled keyword-question pairs. Note that these keyword-question pairs have not been used for the training of neural K2Q models. Therefore, it is appropriate to use as a test dataset. We report on widely-used machine translation metrics: BLEU (Papineni et al., 2002) and different variants of ROUGE (Lin, 2004).

7.1.1. Results.

Table 2 presents the evaluation result for the baseline systems and for the three neural networks. Clearly, all NMT approaches perform better than the SDM baseline. As expected, the template-based method performs better than SDM, but it is still far behind CopyNet, which is the best neural method. Compared with the basic encoder-decoder NMT network, we find that the attention mechanism brings in noticeable improvements in ROUGE-L (+13.99%), ROUGE-1 (+9.78%), ROUGE-2 (+16.76%) and in BLEU (+20.59%) scores. Because of the extraneous terms issue (cf. §6.4.4) in our synthetic data, the attention mechanism plays a very important role in skipping those terms (by assigning small weights to extraneous terms in the decoding process). Additionally, the copying mechanism brings further minor improvements in ROUGE-L (+3.44%), ROUGE-1 (+5.67%), ROUGE-2 (+5.18%) and BLEU (+1.25%).

SDM 0.3650 0.4123 0.1940 0.2780
TBM 0.4357 0.5134 0.2056 0.2858
EDNet 0.4338 0.5236 0.2464 0.3045
AttNet 0.4945 0.5748 0.2877 0.3672
CopyNet 0.5115 0.6074 0.3026 0.3718
Table 2. Automatic evaluation results of baseline systems and three neural networks.

7.1.2. Analysis.

We seek to gain a better understanding of how the different elements of our synthetic data generation approach contribute to end-to-end performance on the K2Q task. For that reason, we train the best performing neural model (CopyNet) using different configurations for generating synthetic training data. We add components one by one, to see how they affect performance. Additionally, we vary the amount of training data used between 0.5M and 3M pairs. The results are shown in Fig 7.

  • [leftmargin=*]

  • Baseline: Baseline KQGM with the Combination term selection strategy (§3.1.1).

  • Par: Paraphrase-base KQGM with the Combination term selection strategy (§3.1.3).

  • Par+Ph: Phrase detection added on top (§3.1.4).

  • Par+Ph+KQF: Keyword query filter added on top (§3.2.1).

  • Par+Ph+KQF+TDF: Training data filter employed on top (§3.2.2).

The first three methods do not involve the keyword query filter. In those cases, we generate 20 candidate keyword queries for a given question and randomly select one of those. Only the last method uses TDF, which is a mechanism to select the top- highest quality training instances (keyword-question pairs) into . For the other methods, we randomly select instances from the entire synthetic training data set to form . We run methods that involve randomization three times and report the means.

From Fig. 7, we make the following observations. First, we find the results similar to that of the KQGM evaluation in Table 1. Among the three KQGMs, the Par+Ph model performs best. The paraphrase-based KQGM brings noticeable improvements compared to baseline-based KQGM in both ROUGE-L (+6.37% on average) and BLEU (+11.4% on average), while adding phrase detection on top of that only brings minor improvements in ROUGE-L (+0.13% on average) and BLEU (+0.71% on average).

Second, comparing the results of Par+Ph and Par+Ph+KQF, we find that the keyword query filter brings noticeable improvements in both ROUGE-L (+13.4% on average) and BLEU (+16.3% on average). Notice that by adding the keyword query filter, the performance of neural models improves with the size of the training data. Thus, the keyword query filter is an essential element in our synthetic data generation approach.

Figure 7. The influence of different components of our synthetic data generation approach on the end-to-end K2Q task. The x-axis represents the amount of training data (); the y-axis indicates the BLEU/ROUGE-L score.
Figure 8. Fraction of the total vocabulary (y-axis) captured within the subset of training instances selected by TDF (x-axis). I.e., unique words present in , relative to .

Third, we find that Par+Ph+KQF+TDF almost always performs better than Par+Ph+KQF, demonstrating that our training data filter is able to estimate the quality of the generated keyword-question pairs, and feed high-quality training instances into the neural networks. One noticeable exception (for both BLEU and ROUGE-L) is the leftmost data point (), where the performance of Par+Ph+KQF+TDF is much below that of Par+Ph+KQF. A further analysis reveals that this is caused by an “insufficient vocabulary” issue. This is illustrated on Fig. 8, where we plot the fraction of the total vocabulary (i.e., unique words in ) present in the training subset . We can observe that with only training instances, the Par+Ph+KQF+TDF model has built up only 74% of the vocabulary, as opposed to 94% by the Par+Ph+KQF model. Our training data filter, based on a retrieval method, performs well with frequent terms, but fails on rare terms. It appears that the TDF quality score estimator overvalues common terms and undervalues rare terms, when selecting the subset of instances for training.

Finally, as expected from TDF, it greatly benefits performance to use the high-quality training instances first; see the Par+Ph+KQF+TDF model for the 0.5M-1.5M range. In contrast, the last half million training instances yield little to no improvements. These results suggest that creating more high-quality keyword-question pairs might bring predictable improvements for neural K2Q models.

7.2. Manual Evaluation

We also perform a manual evaluation using a sample of 87 real user keyword queries with low query clarity333Query clarity ranges from to , where indicating “clear” and indicating “vague.” We only sample queries with clarity smaller than . from the Yahoo! Webscope L16 Dataset. All these queries originate from the query log of Yahoo Answers. For each keyword query, we generate 5 questions, each with a different method. That is, the SDM and TBM baselines, and the three neural networks.

7.2.1. Assessment

Three human raters were asked to assess each question along two dimensions: (i) Relevance, which indicates whether the question is relevant to the keyword content-wise (ignoring grammar mistakes), and (ii) Grammar, which reflects the grammatical correctness. Table 3 shows our rating scheme. Raters were further asked to choose the best generated question from among the five alternatives. The number of wins were then aggregated for each of the five methods. If multiple methods generated the same question, then the point is added to all.

R Rating scheme
2 The question is meaningful and matches given keyword
1 The question matches given keyword, more or less
0 The question either doesn’t make sense or matches given keyword
G Rating scheme
2 No grammar errors in question, it can be understood completely
1 Few grammar errors in question, but it can be understood
0 Too many grammar errors in question, it can not be understood
Table 3. Human rating scheme used in our manual evaluation for relevance (R) and grammar (G).
Methods Relevance Grammar Wins
SDM 0.352 1.643 7.333
TBM 1.065 0.590 14.333
EDNet 0.569 0.682 6.666
AttNet 1.114 1.046 31.666
CopyNet 1.563 0.998 36.000
Cohen’s kappa score 0.499 0.498 0.637
Table 4. Manual K2Q evaluation results. The inter-rater agreement is measured using Cohen’s kappa score (Cohen, 1960). Highest scores are in boldface.

Table 4 shows the results of human judges; the reported scores are means. As expected, the SDM method scores highest on grammar, since it retrieves existing questions from the corpus. However, it achieves a very low score on relevance, since it can only retrieve questions that have been asked before (i.e., exist in the corpus). As in the automatic evaluation results, the attention mechanism brings in substantial improvements over the simple Encoder-Decoder model (both in terms of relevance and grammar). As anticipated, the copying mechanism leads to large improvements in terms of relevance (+40.3%); at the same time, the grammar score of CopyNet is only marginally lower than that of AttNet.

7.2.2. Case study

Table 5 provides some examples of generated questions. Clearly, SDM returns grammatically correct, but often irrelevant questions. CopyNet has the ability to capture the meaning of the keyword query, and generates somewhat monotonous but very relevant questions. The other two neural networks seem to capture the query intent only partially, and drift off in directions that are somewhat related to the topic of the query, yet irrelevant, e.g. “What are cute boobs?” and “What is most popular in england?.”

Keyword 1 cute yaoi animes
[S] Do girls watch yaoi anime?
[T] Is it cute when yaoi are animes?
[E] Are there a good animes are cute?
[A] What are cute boobs?
[C] What are cute yaoi animes?
Keyword 2 average price movie ticket 1987
[S] What is the average ticket price for a super bowl ticket?
[T] What is the average price of movie ticket 1987?
[E] What is the average price for a 1987 ticket in 1987?
[A] What is the average price for a movie ticket?
[C] What is the average price of the movie ticket in 1987?
Keyword 3 popular jbs england
[S] How big are jbs feet?
[T] Who are popular sovereignty and jbs england related?
[E] What is the most popular in england?
[A] Who is popular in england?
[C] How popular is jbs in england?
Table 5. Examples of generated questions from our K2Q system. The methods used to generate question are [S] SDM, [T] TBM, [E] EDNet, [A] AttNet, and [C] CopyNet.

7.2.3. Comparison to Google

The “People also ask” service, provided by Google and illustrated on Fig. 9, is somewhat similar to our keyword-to-question task. Therefore, we considered including it in our evaluation for baseline comparison. When running our test queries through Google, we noticed that the “People also ask” panel is triggered only for 34% of our queries. As we already pointed out in the introduction, for each query, we wish to generate a natural language questions that most likely represents the user’s underlying information need. It appears that Google’s service is addressing a fundamentally different task, which is of suggesting questions related to the query that are asked by sufficiently many people and to which answers are known to exist. Because of these, we do not compare our methods against this service.

Figure 9. Screenshot of Google’s “People also ask” service (captured on January 15, 2018) for the query “average price movie ticket 1987.”

8. Related work

In this section we review related research from two areas: keyword-to-question and synthetic data generation.

8.1. Keyword to Question

Relevant work on K2Q systems include (Kotov and Zhai, 2010; Zheng et al., 2011; Zhao et al., 2011; Dror et al., 2013). All these systems follow a template-based approach, and are evaluated in terms of relevance, diversity, and grammatical correctness. While some differences exist among these systems, all consist of three main steps. First, they extract question templates from millions of keyword-question pairs by substituting keyword terms in questions with slots, and storing keyword-template pairs in a database . Second, given a new keyword query , they search similar keyword queries from , collect templates related to those similar queries, and instantiate those templates with for generating candidate questions. Finally, a parameterized ranking model is used to calculate the probabilities of those candidate questions being generated by the query , and to rank all candidate questions. Instead of template-based methods, we propose to address the K2Q task using state-of-art neural machine translation approaches.

Other question generation tasks were also addressed in the literature, including converting assertions identified in text (sentences, paragraphs) into question forms,(Agarwal et al., [n. d.]; Zhou et al., 2017). In contrast, our task aims to expand keyword terms into a natural language question.

8.2. Synthetic Data Generation

The idea of automatically generating synthetic data (pseudo test collections) for information retrieval (IR) has attracted some attention in past years (Azzopardi et al., 2007; Berendsen et al., 2013). To the best of our knowledge, utilizing simulated queries for evaluating IR was first suggested by Azzopardi and de Rijke (2006), who also proposed an algorithm for generating simulated queries (Azzopardi and de Rijke, 2006; Azzopardi et al., 2007)

. Their experimental results show that it is possible to generate simulated queries for web search with performance comparable to that of manual queries. Besides, the idea of generating pseudo test collections was also utilized for the training of supervised (learning-to-rank) retrieval models for web search 

(Asadi et al., 2011) and for ad-hoc search on domain-specific, semi-structured documents (Berendsen et al., 2012, 2013)

. It should be pointed out that automatically generated synthetic training data for deep learning had accomplished a great deal in computer vision 

(Handa et al., 2015; Zhang et al., 2015; Gan et al., 2015; Ros et al., 2016). Even though synthetic data is imperfect, these efforts show the feasibility of training robust and effective neural network models with noisy, but very large-scale data. In this paper, we have proposed a keyword query generation model and developed various filtering mechanisms, in order to create synthetic training data for training neural K2Q models.

9. Conclusions

In this work, we have studied the problem of translating keyword queries to natural language questions using neural approaches. To the best of our knowledge, this is the first application of neural machine translation methods to the keyword-to-question (K2Q) task. Perhaps the most innovative aspect of this work is the combination of keyword query generation models combined with various filtering mechanisms to create massive amounts of synthetic data for training neural models. Our empirical evaluation has demonstrated the effectiveness of our synthetic data generation approach for the K2Q task.

In this paper, we have generated only a single question for each keyword query, and evaluated it with respect to relevance and grammatical correctness. The same neural models, however, may also be used to generate a diverse list of questions for a given keyword query, with the help of techniques like beam search (Freitag and Al-Onaizan, 2017). For example, given the keyword query “Bible verse about education,” our neural models generated a range of diverse and meaningful questions, including:

  • What is the fugitive slave verse about education?

  • What is the christ verse about education?

  • What is the sacred verse about education?

  • What does Bible verse say about education?

In the future, we are interested in generating a diverse set of questions (i.e., the bottom part in Fig. 1) and comparing these with existing template-based methods with respect to diversity.

In summary, our methods have shown great potential and promise for creating synthetic training data that can be used to train robust neural models; future applications of this idea extend beyond the keyword-to-question task.