Search strategies for generating a response from a neural dialogue model have received relatively little attention compared to improving network architectures and learning algorithms in recent years. In this paper, we consider a standard neural dialogue model based on recurrent networks with an attention mechanism, and focus on evaluating the impact of the search strategy. We compare four search strategies: greedy search, beam search, iterative beam search and iterative beam search followed by selection scoring. We evaluate these strategies using human evaluation of full conversations and compare them using automatic metrics including log-probabilities, scores and diversity metrics. We observe a significant gap between greedy search and the proposed iterative beam search augmented with selection scoring, demonstrating the importance of the search algorithm in neural dialogue generation.READ FULL TEXT VIEW PDF
There are three high-level steps to building a neural autoregressive sequence model for dialog modelling, of the kind inspired by the successful work of Vinyals and Le (2015). First, decide on a specific network architecture which will consume both previous utterances as well as any extra information such as speaker identifiers. Second, select a suitable learning strategy. Finally, decide on your search algorithm, as neural autoregressive sequence models do not admit a tractable, exact approach for generating the most likely response.
Recent research in neural dialogue modelling has often focused on the first two aspects. A number of variants of sequence-to-sequence models (Sutskever et al., 2014; Cho et al., 2014; Kalchbrenner and Blunsom, 2013) have been proposed for dialogue modelling in recent years, including hierarchical models (Serban et al., 2016) and transformers (Mazaré et al., 2018; Yang et al., 2018). These advances in network architectures have often been accompanied by advanced learning algorithms. Serban et al. (2017) introduce latent variables to their earlier hierarchical model and train it to maximize the variational lower bound, similar to Zhao et al. (2017)
who propose to build a neural dialogue model as a conditional variational autoencoder.Xu et al. (2017) and Li et al. (2017b) train a neural dialogue model as conditional generative adversarial networks (Mirza and Osindero, 2014). These two learning algorithms, varitional lower-bound maximization and adversarial learning, have been combined into a single model by Shen et al. (2018), which has been followed by Gu et al. (2018).
Despite abundant endeavors on modelling and learning, search has received only a little attention. Most of the work on search has focused on training an additional neural network that provides a supplementary score to guide either greedy or beam search; we refer to this as the selection strategy.Li et al. (2015) propose a maximum mutual information criterion for decoding using a reverse model. This has been extended by Li et al. (2017a), where an extra neural network is trained to predict an arbitrary reward given a partial hypothesis and used during decoding. Similarly, Zemlyanskiy and Sha (2018) train a neural network that predicts the other participant’s personality given a partial conversation and use its predictability as an auxiliary score for re-ranking a set of candidate responses. None of these approaches study how the choice of the underlying search algorithm, rather than its scoring function, affects the quality of the neural dialogue model.
In this paper, we investigate the effects of varying search and selection strategies on the quality of generated dialogue utterances. We start with a straightforward modeling approach using an attention-based sequence-to-sequence model (Bahdanau et al., 2014) trained on the recently-released PersonaChat dataset (Zhang et al., 2018). We evaluate three search algorithms: greedy search, beam search and iterative beam search, the last of which is designed by us based on earlier works by Batra et al. (2012). These algorithms are qualitatively different from each other in the size of subspace over which they search for the best response. Furthermore, we investigate the effect of learning an additional scoring function to select a response from those returned by the search function, observing additional improvements by thus separating search and selection. All of these alternatives are compared using human evaluation of multi-turn conversations.
We observed high variance in the human evaluation distribution due to the bias in individual workers and propose an algorithm to reduce it using Bayesian inference, which aims to approximate the posterior distribution of the scores using a latent worker bias variable.
Our experiments reveal that a significant improvement can be achieved by simply choosing a better search strategy, with the best strategy being the combination of the proposed iterative beam search and a sequence selection function. Human annotators favoured conversations with the same model when the best search strategy was used, and the diversity of generated responses, measured in terms of the numbers of distinct bi-/trigrams within each conversation, was higher. These observations strongly suggest the importance of search in neural dialogue modelling, and that any comparison of neural dialogue models must be done after selecting the best search strategy for each model.
We share trained model, code and human evaluation transcripts with readers for any further analysis. 111https://beamdream.github.io/
Since (Vinyals and Le, 2015), a neural autoregressive sequence model based on sequence-to-sequence models (Sutskever et al., 2014; Cho et al., 2014) has become one of the most widely studied approaches to dialogue modelling (see, e.g., Serban et al., 2016, 2017; Zhao et al., 2017; Xu et al., 2017; Li et al., 2016, 2017a, 2017b; Zemlyanskiy and Sha, 2018; Zhang et al., 2018; Miller et al., 2017; Shen et al., 2018; Gu et al., 2018). In this approach, a neural sequence model is used to model a conditional distribution over responses given a context which consists of previous utterances by both itself and a partner in the conversation as well as any other information such as features of the speaker.
A neural autoregressive sequence model learns the conditional distribution over all possible responses given the context , and the conditional probability of a response is factorized into a product of next-token probabilities:
Each conditional distribution on the r.h.s above is then modelled by a neural network, and popular choices include recurrent neural networks(Mikolov et al., 2010; Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2014), convolutional networks (Dauphin et al., 2016; Gehring et al., 2017) and self-attention (Sukhbaatar et al., 2015; Vaswani et al., 2017)
. Our goal is to assess the impact of the search algorithm, so we fix the model to a standard neural autoregressive sequence model: see the appendix for a full list of hyperparameter values.
Each example in a training set consists of auxiliary information or context (such as a persona profile or external knowledge context) and a sequence of utterances, each of which is marked with a speaker tag, i.e.,
where is the utterance from the -th turn by a speaker . The conditional log-probability assigned to this example given by a neural sequence model is then written as
where if and otherwise . Each term inside the summation above is mapped to the autoregressive distribution in Eq. (1) by considering as .
In this paper, we generate a response to the current state of the conversation (but do not attempt to plan ahead to future exchanges), maximizing
Unfortunately, it is intractable to solve this problem due to the exponentially-growing space of all possible responses w.r.t. the maximum length . It is thus necessary to resort to approximate search algorithms. We describe here the two most widely used search algorithms for neural autoregressive sequence models.
Greedy search has been the search algorithm of choice among the recent papers on neural dialogue modelling (Gu et al., 2018; Zhao et al., 2017; Xu et al., 2017; Weston et al., 2018; Zhang et al., 2018).
This algorithm moves from left to right selecting one token at a time, simply choosing the most likely token at the current time step:
Greedy search has been found significantly sub-optimal within the field of machine translation (see, e.g., Table 1 in Chen et al., 2018), where similar neural sequence models are frequently used.
Instead of maintaining a single hypothesis at a time, as in greedy search above, beam search maintains hypothesis:
Each hypothesis is expanded with all possible next tokens to form candidate hypotheses, each of which is in the form of
where . Each candidate is associated with its score:
The new hypothesis set of hypotheses is then constructed as
When all the hypotheses in the new hypothesis set have terminated, i.e., for all , beam search terminates, and the hypothesis with the highest score (6) is returned.
One can increase the size of the subspace over which beam search searches for a response by simply increasing the size of the beam . While beam search is currently the method of choice in many applications, it is known to suffer from the problem that most of the hypotheses discovered are near each other in the response space Li et al. (2016, 2015). For tasks like dialogue which are much more open-ended than e.g. machine translation, this is particularly troublesome as many high quality responses may be missing in the beam.
Although this has not been reported in a formal publication in the context of neural dialogue modelling to our knowledge, OpenNMT-py (Klein et al., 2017) implements so-called -gram blocking. In -gram blocking, a hypothesis in a beam is discarded if there is an -gram that appears more than once within it. This feature is especially useful in dialogue modelling, since it is unlikely for any -gram to repeat within a single utterance.
We now propose an improved search strategy, which consists of a more diverse search exposing more high quality responses by considering an iterative beam search, followed by selection from that set with a learnt scoring function.
Search strategies that rely on the conditional log-probability of the sequence for selection tend to prefer syntactically well-formed responses due to the word-based maximum likelihood training. We thus introduce a parameterized sequence selection scoring function which aims to select the best candidate among the given set of possible hypotheses.
The search space over which beam search has operated can be characterized by the union of all partial hypothesis sets in Eq. (4):
where we use the subscript to indicate that beam search has been done without any other constraint. Re-running beam search with an increased beam width would result in the search space that overlaps significantly with , and would not give us much of a benefit with respect to the increase in computation.
Instead, we keep the beam size constant but run multiple iterations of beam search while ensuring that any previously explored space
is not included in a subsequent iteration of beam search. This is done by setting the score of each candidate hypothesis in Eq. (6) to negative infinity, when this candidate is included in . We relax this inclusion criterion by using a non-binary similarity metric, and say that the candidate is included in , if
where is a string similarity measure, such as Hamming distance as used in this work, and is a similarity threshold.
This procedure ensures that a new partial hypothesis set of beam search in the -th iteration does not overlap at all with any part of the search space explored earlier during the first iterations of beam search. By running this iteration multiple times, we end up with a set of top hypothesis from each iteration of beam search, from which the best one is selected according to for instance the log-probability assigned by the model.
A major issue with iterative beam search in its naive form is that it requires running beam search multiple times, when even a single run of beam search can be prohibitively slow in an interactive environment, such as in dialogue generation.
We address this computational issue by performing these many iterations of beam search in parallel simultaneously. At each time step in the search, we create sets of candidate hypotheses for all iterations in parallel, and go through these candidate sets in sequence from the -th iteration down to the last iteration, while eliminating those candidates that satisfy the criterion in Eq. (7). We justify this parallelized approach by defining the similarity measure to be always larger than the threshold when the previous hypothesis is longer than in Eq. (7).
When training a sequence-to-sequence model by maximizing the log-likelihood in Eq. (3), each and every token in the response side is treated equally. This encourages the model to focus more on frequent tokens than less frequent ones, which consequently makes the model more syntax-oriented than semantics-oriented, as discussed earlier by Collobert et al. (2011). Although this behaviour is desirable when we use the model to generate a well-formed response, it is not necessarily desirable for selecting a semantically meaningful response given the context.
We thus propose to augment the underlying sequence-to-sequence model with a selection scoring function , where is a response and computes its score given the final decoder state of the underlying model. We train this scoring function using a pairwise ranking loss and a set of negative examples, following (Collobert et al., 2011):
where is a set of negative responses and is a ground-truth response, given the context . We choose negative responses from the training set uniformly at random.
We start with and until the model converges and fine-tune it with set to and set to , which is equivalent to pretraining the model with maximum likelihood only first and finetuning it with both near the end of learning.
Broadly there are two ways to evaluate a neural dialogue model. The first approach is to use a set of (often human generated) reference responses and compare a generated response against them Serban et al. (2015); Liu et al. (2016). There are two sub-approaches; (1) measure the perplexity of reference responses using the neural dialogue model, and (2) compute a string match-based metric of a generated response against reference responses. Neither of these approaches however captures the effectiveness of a neural sequence model in conducting a full conversation, as during this evaluation the model responses are computed given a full reference context from the dataset, i.e. it does not see its own responses in the dialogue history, but gold responses instead. This constraint is necessary as a reference response is valid only when placed within a given context, and any deviation in the context from the collected context easily, if not always, invalidates it as a reference response.
We thus take the second approach, where a neural dialogue model has a full conversation with a human partner (or annotator) Zhang et al. (2018); Zemlyanskiy and Sha (2018); Weston et al. (2018). Unlike the first approach, it requires active human interaction, as a conversation almost always deviates from a previously collected conversation even with the same auxiliary information ( in Eq. (2)). This evaluation strategy reflects both how well a neural dialogue model generates a response given a correct context as well as how well it adapts to a dynamic conversation—the latter was not measured by the first strategy. In the rest of this section, we describe our approach to human evaluation of a full conversation and propose Bayesian calibration to address the annotator bias.
We ask a human annotator to have a conversation with a randomly selected bot (characterized by its choice of search algorithm) for at least five turns.
At the end of the conversation, we ask the annotator three sets of questions:222 We provide the detailed descriptions of the questions in the appendix.
Overall score ()
Marking of each good utterance-pair ()
Marking of each bad utterance-pair ()
The first overall score allows us to draw a conclusion on which algorithm makes a better conversation overall. The latter two are collected in addition to the overall score to investigate the relationship between the overall impression and the quality of each utterance-pair.
Although human evaluation is desirable, raw scores collected by human annotators are difficult to use directly due to annotator bias. Some annotators are more generous while others are quite harsh, leading the naive average score to have very high variance; for example, as recently reported in (Zhang et al., 2018; Zemlyanskiy and Sha, 2018). It is necessary to calibrate raw scores so as to remove these annotator biases, and we propose to use Bayesian inference here as a framework for removing such biases. We describe two instances of this framework.
We treat both the unobserved score of each model we are comparing, in our case each search algorithm, and the unobserved bias of each annotator as latent variables. The score of the -th model follows the following distribution:
are uniform and normal distributions. It states thata priori each model is likely to be uniformly good or bad. The annotator bias then follows
where we are stating that each annotator does not have any bias a priori.
Given the model score and annotator bias , the conditional distribution over an observed score given by the -th annotator to the -th model is then:
Of course, due to the nature of human evaluation, only a few of ’s are observed.
The goal of inference in this case is to infer the posterior mean and the variance:
where is a set of observed scores.
When an annotator labels pairs of utterances from the conversation with a binary score (such as whether that pair was a “good” exchange), we need to further take into account the turn bias :
Also, as we will use a Bernoulli distribution for each observed score, we modify the priors of the model scores and annotator biases:
The distribution of an observed utterance-pair score is then
where is a Bernoulli distribution.
The goal of inference is is to compute
which estimate the average number of positively labelled utterance-pairs given the-th model and the uncertainty in this estimate, respectively.
We exclusively use PersonaChat, released recently by Zhang et al. (2018) and the main dataset for the Conversational Intelligence Challenge 2 (ConvAI2),333 http://convai.io/ to train a neural dialogue model. The dataset contains dialogues between pairs of speakers each randomly assigned personas from a set of 1155, each consisting of 4-5 lines of description about the part they should play, e.g. “I have two dogs” or “I like taking trips to Mexico”. The training set consists of 9,907 such dialogues where the partners play their roles, and a validation set of 1,000 dialogues. The ConvAI2 test set has not been released. Each dialogue is tokenized into words, resulting in a vocabulary of 19,262 unique tokens. We refer the reader to (Zhang et al., 2018) for more details.
We closely follow (Bahdanau et al., 2014)
in building an attention-based neural autoregressive sequence model. The encoder has two bidirectional layers of 512 long short-term memory(LSTM, Hochreiter and Schmidhuber, 1997) units each direction-layer, and the decoder has two layers of 512 LSTM units each. We use global general attention as described by Luong et al. (2015)
. We use the same word embedding matrix on both the encoder and decoder, which is initialized from 300-dimensional pretrained GloVe vectors(Pennington et al., 2014). We allow word embedding weights to be updated during the training.
The sequence selection scoring function used in our proposed search strategy is a multi-layer perceptron with four layers of 512units each. It outputs a scalar score at the end, and takes as input the LSTM cell from the final step in the decoder.
We use Adam (Kingma and Ba, 2014) with the initial learning rate set to . We apply dropout (Srivastava et al., 2014) between the LSTM layers with the dropout rate of to prevent overfitting. We train the neural dialogue model until it early-stops on the validation set,444
When the validation loss (3 ) does not improve for twelve epochs, we early-stop.
) does not improve for twelve epochs, we early-stop.then fine-tune it together with the scoring function. We set and in Eq. (9) to and , respectively, during finetuning.
We show in Table 1 the quality of the trained model before and after finetuning. We use the metrics used by the ConvAI2 competition, which are perplexity (PPL) and hits@1. Compared to the leaderboard,555 http://convai.io/#leaderboard the finetuned model is reasonable, and we believe, serves well as an underlying system for investigating the effect of search algorithms. Furthermore, this agrees well with the intuition behind introducing the sequence selection scoring function and training it with the pair-wise ranking loss; that is, the model can focus better on semantics rather than on syntax while selecting the best candidate. For all search experiments, we use the finetuned model.
We test four search strategies; greedy and beam search algorithms from Sec. 2.2, iterative beam search (iter-beam) from Sec. 3.1, and iterative beam search combined with final sequence selection (iter-beam+scorer) from Sec. 3.2.
Beam search maintains five hypotheses throughout search. Beam search performs final hypothesis score adjustment using a length penalty as described by Wu et al. (2016). In iterative beam search, beam search runs 15 times with beam size 5 each, generating 15 top-hypotheses. iter-beam selects the best response among these using the conditional log-probability (2), while iter-beam+scorer uses the score . We use -gram blocking in Sec. 2.2 for any variant of beam search (beam, iter-beam, iter-beam+scorer) with up to as this improved results for all methods.
We use ParlAI (Miller et al., 2017) which provides seamless integration with Amazon Mechanical Turk (MTurk) for human evaluation. A human annotator is paired with a model with a specific search strategy, and both are randomly assigned personas out of a set of 1155, and are asked to make a conversation of at least either five or six turns (randomly decided). We allow each annotator to participate in at most six conversations per search strategy and collect approximately 50 conversations per search strategy.666 Some conversations were dropped due to technical errors during human evaluation, resulting in total 52, 52, 50 and 48 conversations for greedy, beam, iter-beam and iter-beam+scorer, respectively. Each conversation is given three scores by the annotator, as described in Sec. 4.1.
In order to remove annotator bias, or inter-annotator variability, we use Bayesian calibration from Sec. 4.2. We take 50 warm-up steps and collect 150 samples using NUTS sampler for inferring the posterior mean and variance of the overall score in Eq. (10), while we use 30 warm-up steps and 50 samples for inferring the mean and variance of the average number of positively (or negatively) labelled utterance-pairs in Eq. (11).
In addition to human evaluation, we also compute a variety of automatic metrics to quantitatively characterize each search algorithm and its impact. First, we report the log-probability of a generated response assigned by the model which is a direct indicator of the quality of a search algorithm. Second, we compute the average number of unique -grams generated per conversation normalized by the number of generated tokens in the conversation, called distinct- from (Li et al., 2015), with and . This metric measures the diversity of generated responses, which is considered to correlate well with how engaging a neural dialogue system is (or conversely, anticorrelated with its rate of producing boring, meaningless response like “I don’t know”).
In Fig. 1, we plot the scores provided by the human annotators for one search strategy (greedy), where each row corresponds to each annotator. Consider the three annotators in the bottom of the plot. Their spreads are similar, spanning three points, but their means are clearly separated from each other, which points to the existence of annotator bias. This observation supports the necessity of the Bayesian calibration described in Sec. 4.2, we thus analyze results with calibrated scores (while reporting both calibrated and uncalibrated versions).
Another property of annotators’ scores in Fig. 1 is that each annotator has their own distinct “spread” as well. This spread is not modelled in the current version of the Bayesian calibration, and we leave incorporating it for the future.
|Search||Overall Score (1-5)||% Good Pairs||% Bad Pairs|
In Table 2, we present the scores from human evaluation. A major observation we make is that greedy search, which has been the search algorithm of choice in neural dialogue modelling, significantly lags behind the variants of beam search in all metrics. This stark difference is worth our attention, as this difference is solely due to the choice of a search algorithm and is not the result of different network architectures nor learning algorithms. In fact, this cannot even be attributed to different parameter initialization, as we use only one trained model for all of these results.
The three beam-search variants are however more difficult to distinguish from each other in terms of human evaluation. We conjecture that this indistinguishability may be due to the coarse-grained nature of human evaluation based on a scalar score (or three scalar scores.) These variants are qualitatively different from each other in other aspects, as we see below.
Better search algorithms find responses with higher log-probability according to the model, as shown in Table 3. This is a natural consequence from exploring a larger subset of the search space.
The higher log-probability does not correspond with increases in sequence selection score according to the learnt scoring function, demonstrating the fundamental disconnect between maximum likelihood and sequence ranking as discussed in Sec. 3.2. By using the scoring function to select the best response among a diverse set of hypotheses, we get responses with a significantly higher selection score (iter-beam+scorer).
A notable observation from Table 3 is that the neural sequence model assigns very low log-probabilities and scores to human responses (collected from the validation set).
This suggests that there is more room to improve the models and learning algorithms to place a high probability on human responses.
Although the beam-search variants (beam, iter-beam and iter-beam+scorer) were not significantly different in their human ratings, the diversity of generated responses in Table 4 clearly separates them. The proposed iterative beam search combined with selection via the learnt scoring function generates significantly more unique bi- and trigrams than all the other search strategies, indicating this model will be more engaging for longer-term interactions than the competing approaches.
We still observe a significant gap between the best search strategy and humans in these metrics, similar to what we observed with log-probabilities and scores above. This leaves open room for improving network architectures, learning algorithms and/or search strategies even further.
In this paper, we have empirically validated the importance of search algorithms in neural dialogue modelling by evaluating four search strategies on one trained model. Extensive evaluation revealed that greedy search, which has been the search algorithm of choice in neural dialogue modelling, significantly lags behind more sophisticated search strategies, such as beam search and its iterative variant. Using human evaluation and measuring the diversity of generated responses, we found the novel strategy of iterative beam search followed by final selection using a scoring function trained with a ranking loss to be the best among the four strategies we compared in this paper. This strategy was deemed an equally good conversationalist as the other beam-search variants by human annotators, while maintaining a higher level of diversity.
Our observation clearly emphasizes the importance of a good search strategy in neural dialogue modelling, which has thus far been given less attention. With this finding, we encourage authors of future papers on neural dialogue modelling to clearly state which search algorithm has been used, why such choice has been made and the details of its implementation and hyperparameters, in order for readers and the research community to correctly assess the impact of any newly proposed neural dialogue model. Lastly, we believe our observations here raise the question of how many new network architectures and learning algorithms have been proposed, abandoned, or compared favourably or unfairly to existing approaches due to the lack of extensive investigation on search strategies.
We thank Théo Matussière for his work on the human evaluation code and procedure, in particular for proposing the good/bad utterance-pair evaluation technique. KC was partly supported by Samsung Electronics (Improving Deep Learning using Latent Structure) and thanks support by AdeptMind, eBay, TenCent, NVIDIA and CIFAR.
European Conference on Computer Vision, pages 1–16. Springer.
Journal of Machine Learning Research, 12(Aug):2493–2537.
Right after the end of dialogue system asks worker the following question:
Now the conversation is completed! Please evaluate the conversation by clicking a button with score from [1, 2, 3, 4, 5] below, this score should reflect how you liked this conversation (1 means you did not like it at all, and 5 means it was an engaging conversation).
After the first question system asks following questions:
Now please select every interaction pair which you consider as a good, natural pair of messages. Do not compare them between each other, try to use your life experience now.
Now please select every interaction pair which you consider as a bad, some examples of bad partner response are: not answering your question, answering different question, random content, contradicts previous statements etc.
|attention type||global general|
|shared weights||encoder/decoder embeddings|
|margin for ranking loss||1.0|
|fine-tuning rank weight||1.0|
|fine-tuning generation weight||0.1|
|starting learning rate||0.001|
|gradient clip threshold||0.1|
|embedding pretraining||glove 840B|
|validation every…||0.5 epochs|
|max valid patience||12 epochs|