, and partially observable Markov decision processRoy et al. (2000)
are often used in spoken dialog systems to optimize dialog management by explicitly estimating uncertainties in policy assignments.
However, these approaches are either computationally intensive Gal and Ghahramani (2015) or require significant work on refining policy representations Gašić and Young (2014). Moreover, most current uncertainty studies in dialog focus on the dialog management component. End-to-end (E2E) dialog retrieval models jointly encode a dialog and a candidate response (Wu et al., 2016; Zhou et al., 2018), assuming that there is always a ground truth present in the candidate set. This is clearly not the case in production. Larson et al. (2019)
recently showed that classifiers that perform well on in-scope intent classification for task-oriented dialog systems struggle to identify out-of-scope queries. The response selection task in the most recent Dialog System Technology ChallengeLasecki (2019) also explicitly mentions that “none of the proposed utterances is a good candidate” should be a valid option.
The goal of this paper is to set a new direction for future task-oriented dialog system research: while retrieving the best candidate is crucial, it should be equally important to identify when the “correct response” (a.k.a ground truth) is not present in the candidate set. In this paper, we measure the E2E retrieval model’s capability to capture uncertainty by inserting an additional “none of the above” (NOTA) candidate into the proposed response set at inference time. This is motivated by the design of student performance assessment.
Having NOTA be a candidate is a common practice of test creators when designing multiple-choice questions, both as correct answers and as distractors. How the use of NOTA affects the difficulty and discrimination of a question is widely discussed in the psychology and education fields Gross (1994); Pachai et al. (2015). For assessment purposes, a common finding is that using NOTA as the correct response increases question difficulty, and equally lures high- and low-performing students toward distractors Pachai et al. (2015).
The contributions of this paper include: (1) discovering that it is crucial to learn the relationship amongst the candidates as a set instead of looking at point-wise matching to solve the NOTA detection task. As a result, our proposed LogReg approach consistently achieves the best performance compared to a number of strong baselines. (2) extensive experiments show that the raw output score (logits
2.1 Ubuntu Dataset
All of the experiments herein use the Ubuntu Lowe et al. (2015) Dialog Corpus, which contains multi-turn, goal-oriented chat logs on the Ubuntu forum. For next utterance retrieval purposes, we use the training data version that was preprocessed by Mehri and Eskenazi (2019), where all negative training samples (500,127) were removed, and, for each context, 9 distractor responses were randomly chosen from the dataset to form the candidate response set, together with the ground truth response. For the uncertainty task, we use a special token _NOTA to represent the “none of the above” choice, as in multiple choice questions. More details on this NOTA setup can be found in section 3.1 and 3.2. The modified training dataset has 499,873 dialog contexts, each has 10 candidate responses. The validation and test sets remain unchanged, with 19,561 validation samples and 18,921 test samples.
2.2 Dual LSTM Encoder
The LSTM dual encoder model consists of two single-layer, uni-directional encoders, one to encode the embedding () of the context and one to encode the embedding () of the response. The output function is computed as a dot product of the two, . This model architecture has already been shown to perform well for the Ubuntu dataset Lowe et al. (2015); Kadlec et al. (2015). We carry out experiments with the following variants of the vanilla model for training:
This is the most common training method for next utterance ranking on the Ubuntu corpus. With training data prepared in the format of [CONTEXT] [RESPONSE] [LABEL], the model performs binary classification on each sample, predicting whether a given response is the ground truth. The binary cross entropy between the label and
following a sigmoid layer is used as the loss function .
As the validation and test datasets are both in the format of [CONTEXT] [RESPONSE]*x, where x is usually 10, we train the selection model in the same way. For this model, following a softmax layer, the loss is calculated by the negative log likelihood function:
Gal and Ghahramani (2015)
found that dropout layers can be used in neural networks as a Bayesian approximation to the Gaussian process, and thus have the ability of representing model uncertainty in deep learning. Inspired by this work, we add a dropout layer after each encoder’s hidden layer at training time. At inference, we have the dropout layer activated and pass each sample throughtimes, and then make the final prediction by taking a majority vote among the
predictions. Unlike the other models, the NOTA binary classification decision is not based on the output score itself, but rather is calculated on the score variance of each response.
3.1 Direct Prediction
For the direct prediction experiment, we randomly choose of the response sets and replace the ground truth responses with _NOTA (we label this subset as isNOTA). For the other samples, we replace the first distractor with _NOTA (we label this subset as notNOTA). By using this setup, we ensure that a _NOTA is always present in the candidate set. Although making decisions based on logits (Directlogits) or probability (DirectProb) yields the same argmax prediction, we collect both output scores for the following LogReg model (details in Section 3.3). Concretely, the final output of a direct prediction model is:
Another common approach towards returning NOTA is to reject a candidate utterance based on confidence score thresholds. Therefore, in the threshold experiments, with the same preprocessed data as in Section 3.1, we remove all “_NOTA” tokens at the inference model’s batch preparation stage, leaving 9 candidates, leaving of response sets (the isNOTA set) with no ground truth present. After the model outputs scores for each candidate response, with the predefined threshold, it further decides whether to accept the prediction with the highest score as the final response it will give, or to reject the prediction and give _NOTA instead. We investigate the performance of setting the threshold based on probability (ThresholdProb) and logits (ThresholdLogits) respectively. Concretely, the final output is given by:
3.3 Logistic Regression
We feed the output scores of LSTM models for all candidate answers as input features to the Logistic Regression (LogReg) model consisting of a single linear layer and a logistic output layer. Separate LogReg models are trained for different numbers of candidates. The probability output indicates whether the previous model’s prediction is ground truth, or just the best-scoring distractor. Since LogReg can see output scores from all candidate responses, it is trained to model the relationship amongst all the candidates, making it categorically different from the binary estimation mentioned in Section 3.1 and 3.2. Note that at inference time, LogReg works essentially as a threshold method. The final output is determined by:
3.4 Metric Design
Dialog retrieval tasks often use recall out of k () as a key metric, measuring out of candidates how often is the answer in top-k. In this paper, we are focusing on the top-1 accuracy ( for short) with a candidate set size of , where . The recall metric is modified for uncertainty measurement purposes, and is further extended to calculate the _NOTA accuracy out of (), and F1 scores for each class (). Let and be the two parts of data that correspond to samples that are notNOTA and NOTA respectively, the above metrics are computed by:
The positive class in is the isNOTA class, and the positive class in is the notNOTA class,
3.5 More Candidates
In real world problems, retrieval response sets usually have many more than 10 candidates. Therefore, the selection and binary models are further tested on a bigger reconstructed test set. For each context, we randomly select 90 more distractors from other samples’ candidate responses, producing a candidate response set of size 100 for each context.
4 Results and Analysis
summarizes the experimental results. Due to space limitations, the table only displays results on 10 candidates. Complete results on other numbers of candidates, which have similar performance patterns as 10, are found in the appendix. The thresholds and hyperparameters are tuned on the validation set according to the highest average F1 score. For the selection model, in addition to the original dataset, we also train the model on a modified training dataset, containing_NOTA choices as in inference datasets, with the same set of hyperparameters. As expected, since there are now fewer real distractor responses, training including _NOTA
improves the model’s NOTA classification performance but sacrifices recall scores, which is not desirable. In all the models, regardless of the training dataset used and the model architecture, adding a logistic regression on top of the LSTM output significantly improves the average F1 scores. Specifically, the highest F1 scores are always achieved with logits scores as LogReg input features. These results show that, though setting a threshold is a common heuristic to balance true and false acceptance ratesLarson et al. (2019), its NOTA prediction performance is not comparable to the LogReg approach, even after an exhaustive grid-search of best thresholds. This finding is also matched by receiver operating characteristic (ROC) curves on the validation set (Figure 1), which shows the areas under curve (AUC) are boosted from 0.71 to 0.91 with the additional LogReg model.
With the selection model trained on the original dataset, Figure 2 shows the model’s distribution of max scores on the validation set. We see that there are clear differences between isNOTA’ and notNOTA’s best score distributions. This is an encouraging observation, because it suggests that current retrieval models can already distinguish good versus bad responses to some extent. Note that as the _NOTA token is not included in training, for direct predictions tasks the _NOTA token is encoded as an UNK token at inference time. The tails of the isNOTA plot in both the DirectLogits and DirectProb graphs suggest that the model will, very rarely, pick the unknown token as the best response.
Figure 3 shows the average F1 scores trend with the original selection model on the test set. The plot shows the trend that with more distractors, the LSTM model struggles to determine the presence of ground truth, while the LogReg model performs consistently well. The complete results on this extended test set can be found in the Appendix.
We have started preliminary sequence-classification experiments with Transformer-based models by concatenating context and response pairs. However, due to the intensive technical jargon in Ubuntu corpus, BERT and RoBERTa performed poorly on this task, even with further fine-tuning of the language model. We will further analyze to accordingly improve on this type of approach in future work.
We created a new NOTA task on the Ubuntu Dialog Corpus, and we have proposed to solve this problem by learning the response set representation with a binary classification model. By releasing our processed dataset upon publication, we hope that it will be used to benchmark future dialog system uncertainty research.
- Dropout as a bayesian approximation: representing model uncertainty in deep learning. External Links: Cited by: §1, §2.2.
- Gaussian processes for pomdp-based dialogue manager optimization. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 (1), pp. 28–40. External Links: Cited by: §1, §1.
- Logical versus empirical guidelines for writing test items: the case of ”none of the above”. Evaluation & the Health Professions 17 (1), pp. 123–126. External Links: Cited by: §1.
- Improved deep learning baselines for ubuntu corpus dialogs. External Links: Cited by: §2.2.
- Adam: a method for stochastic optimization. International Conference on Learning Representations, pp. . Cited by: §A.1.
- An evaluation dataset for intent classification and out-of-scope prediction. External Links: Cited by: §1, §4.
- DSTC7 task 1: noetic end-to-end response selection. In 7th Edition of the Dialog System Technology Challenges at AAAI 2019, External Links: Cited by: §1.
- The Ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Prague, Czech Republic, pp. 285–294. External Links: Cited by: §2.1, §2.2.
- Multi-granularity representations of dialog. External Links: Cited by: §2.1.
- A systematic assessment of ‘none of the above’ on multiple choice tests in a first year psychology classroom. Canadian Journal for the Scholarship of Teaching and Learning 6, pp. 1–17. External Links: Cited by: §1.
- Spoken dialogue management using probabilistic reasoning. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, ACL ’00, Stroudsburg, PA, USA, pp. 93–100. External Links: Cited by: §1.
- Uncertainty estimates for efficient neural network-based dialogue policy optimisation. External Links: Cited by: §1.
- Sequential matching network: a new architecture for multi-turn response selection in retrieval-based chatbots. arXiv preprint arXiv:1612.01627. Cited by: §1.
- Multi-turn response selection for chatbots with deep attention matching network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1118–1127. Cited by: §1.
Appendix A Appendices
a.1 Experimental Setup
For the LSTM models, unless otherwise specified, the word embeddings are initialized randomly with a dimension of 300, and the hidden size is 512. The vocabulary is constructed of the 10000 most common words in the training dataset, plus the _UNK and _PAD special tokens. We use the Adam algorithm Kingma and Ba (2014)
for optimization with a learning rate of 0.005. The gradients are clipped to 5.0. With a batch size of 128, we train the model for 20 epochs, and select the best checkout based on its performance on the validation set. In the dropout model, we use a dropout probability of.
For the logistic regression model, we train on the validation set’s LSTM outputs with the same hyperparameter (where applicable to LogReg) setup as in corresponding LSTM model.
a.2 More Plots
a.3 Complete Results
Table 2 shows the original selection model’s performance on different sizes of candidate response sets. The direct predict model is run as it does not need further tuning. Threshold approach, especially with softmax probability as threshold, will need separate rounds of tuning on the threshold. Table 3 shows the complete results for all models on the test set, both for 2 candidates and for 10 candidates. Here, the average F1 is averaged on all 4 F1 scores. For each model architecture, the best performing setting for each metric is in bold.