Automating dialogue evaluation is an important research topic for the development of open-domain dialogue systems. Since existing unsupervised evaluation metrics, like BLEUPapineni et al. (2002) and ROUGE Lin (2004), are not suitable for evaluating open-domain dialogue systems Liu et al. (2016), many researchers start to investigate conducting automatic dialogue evaluation by training-and-deploying: first, learn a neural network with annotation from existing dialogue systems, then use that trained evaluator to evaluate the other candidate dialogue systems Kannan and Vinyals (2017); Lowe et al. (2017); Tao et al. (2017). As most existing neural dialogue evaluators are only trained once, we call this kind of methods the stationary dialogue evaluators.
Though positive automatic evaluation results have been achieved by using stationary evaluators, we argue that the parameters of an evaluator learned from previous systems may not be able to fit future dialogue systems. Figure 1 illustrates the performance gap of using the same stationary dialogue evaluator for two different groups of dialogue systems: 1) the in-scope systems who are included in training corpus and 2) out-of-scope systems that are excluded from evaluator learning. As demonstrated, the automatic evaluation results of stationary neural dialogue evaluators are much more consistent to human annotation for systems that are involved in training than those outside the training scope. Even if increasing the number of dialogue systems for training the evaluator, the performance gap between in-scope and out-of-scope cannot be reduced. Thus, responses from those out-of-scope systems could be more likely misestimated because the evaluator itself is not generalized enough to produce the correct prediction. The flaws of generalization seriously limit the usability of neural dialogue evaluators, as the generalization is the key point for evaluation metrics.
In this paper, we study fighting against the problem of generalization from the perspective of continual learning. Let denote an existing neural evaluator, which is built for the previous dialogue systems, namely , given the next dialogue system to be evaluated, namely , we extend the capacity of the evaluator by selectively fine-tuning its parameters with the next system , in order to jointly fit dialogue systems in . Unlike most fine-tuning methods that only care the fine-tuning side, we force the evaluator to learn from the next dialogue system without forgetting its knowledge learned from the previous systems. By this way, the training scope of the evaluator is continually extended, including both dialogue systems in and , which can intuitively alleviate the weak generalization problem discussed above.
Particularly, we take the recently proposed automatic dialogue evaluation model (ADEM) Lowe et al. (2017) as a base model, and explore fine-tuning it using two representative continual learning methods: the Elastic Weight Consolidation (EWC) Kirkpatrick et al. (2017) and the Variational Continual Learning (VCL) Nguyen et al. (2017). Both of those two continual learning algorithms can selectively update the model parameters, preserve those parts that are strongly related to the previous dialogue systems and rewrite the other parts in order to fit the next dialogue system. Thus, the knowledge from the past and the future dialogue systems can coexist in the same evaluator, which enables the evaluator to treat all candidates equally.
The other solution to the weak generalization issue is reconstruction, either by simply re-building the evaluator using all annotations from or by learning an additional one for . However, this kind of methods naturally suffers from time, as we always need to maintain an increasing set of training annotation or an increasing set of evaluators whenever a new dialogue system comes, which could be too costly for long-term automatic evaluation. However, by using continual learning, we only need to maintain one unified evaluator. Besides, continually updating encourage knowledge transfer, which can reduce the size of annotation for the next dialogue system. What is more, the continual learning based evaluator is more suitable for sharing, as people only need to open their evaluator instead of the raw training data, which can help protect the data privacy and safety.
Experimentally, we build two sequences of dialogue systems, and each sequence consists of five different dialogue systems. We use different learning algorithms (including reconstructing, fine-tuning and continual learning) to sequentially update the base evaluator and evaluate each dialogue system one system after another. Two major metrics are used to measure the performance of the automatic evaluation, i.e., 1) accuracy on evaluating the next dialogue system and 2) the consistency to its previous predictions, representing the plasticity and the stability respectively. The comparison results show that the continual learning based evaluator is able to achieve comparable performance with other methods on evaluating the next dialogue system, while more stable and requiring significantly lower annotation.
Our major contributions are:
We reveal the weakness of generalization in the previous neural dialogue evaluators.
We propose solving the issue of generalization by incorporating two model-agnostic continual learning methods. Experimental results show that using continual learning can significantly alleviate the weakness of generalization.
2 Problem Formalization and Methods
2.1 Problem Formalization
Let stand for a sequence of dialogue systems, a dialogue evaluator is asked to sequentially score those dialogue systems one after another. At each step in the evaluation, a training set of post-reply-reference-label quads collected from dialogue system would be available to update the evaluator for better automatic evaluation. The post is a single-turn conversational context, is the response generated by the corresponding dialogue system, is the reference response generated by a human, and label has three grades (2: good, 1: fair, 0: bad) and is manually annotated. After updating, the evaluator is asked to evaluate both the current dialogue system as well as the previous ones . The whole process of sequential evaluation can be formulated as:
where is the parameter of evaluator and is the quality score generated by the evaluator for the post-reply-reference triple of the dialogue system . The higher the score is, the better the response is. All comparison learning methods share the same evaluator architecture, but having their own parameters. In the following sub-sections, we will have a detailed description of the architecture of our evaluator as well as different learning algorithms .
2.2.1 Automatic Dialogue Evaluator
Figure 2 shows the structure of our shared neural dialogue evaluator, the ADEM. The idea of ADEM is to automatically measure the quality of the candidate response by jointly considering the dialogue context and the reference . ADEM first leverages a shared LSTM cell to read the context , machine response and reference response
then uses the last hidden states as the vector representations for, and , written as , and respectively.
Given that vector representations, the quality of the machine response is calculated as:
where and are two learnable matrixes to measure the semantic similarities between the machine response and dialogue context as well as the reference response , is the learnable bias, and denotes all parameters in this evaluator.
2.2.2 Learning Methods
Figure 3 shows the framework of all investigated learning methods in this paper, including both the comparison baselines and our continual learning based methods. Our baseline methods include four straight forward solutions:
stationary learning only learns the evaluator once with the first dialogue system.
individual learning trains multiple evaluators, each evaluator is tuned with and working for one specific dialogue system.
retraining re-learns the evaluator over and over again, gradually increase its training scope each time.
fine-tuning also only maintains the evaluator over time, however, at each time of updating, it only leverages the annotation from the current dialogue system and reuses the learned weights from . As fine-tuning only cares the current dialogue system , its training scope only shifts across different dialogue systems, rather than increasing.
Unlike baseline methods, the continual learning algorithm can gradually increase the training scope of neural networks without retraining from scratch. Particularly, at each time of updating, the continual learning algorithm only requires the annotation of the current dialogue system , similar to fine-tuning. However, instead of overwriting all parameters like fine-tuning, the parameters of the neural dialogue evaluator are selectively modified, those weights that are strongly related to the previous dialogue systems are consolidated while the other parts are overwritten to fit the new dialogue system. Hence, the training scope of continual learning can be gradually increased after multiple steps of updating. We apply two kinds of continual learning methods for automatic dialogue evaluation: the Elastic Weight Consolidation (EWC) Kirkpatrick et al. (2017) and the Variational Continual Learning (VCL) Nguyen et al. (2017).
Given a dialogue evaluator and its parameters , EWC assumes that the importance of weights are not equal, some of those parameters are strongly contributing to the prediction while the others are not. The difference of parameter importance enables the model to learn new knowledge without forgetting the old ones, as we can force those parameters, that has less contribution to previous systems, to store more knowledge for the new dialogue system. To measure the contribution of parameters, EWC exploits the Fisher Information Matrix Frieden (2004), which is calculated as:
denotes some certain loss function andis the weight in the parameter . The Fisher Information measures the importance of to the object .
Given the Fisher Information Matrix , EWC updates the the evaluator with the following object at each time :
where is learned value of at time , is the regression object for fitting the data of the dialogue system and is the object to memorize the important weights for the previous systems. is the importance score for the weight in the parameter for previous dialogue systems, and
measures the variance of theparameter. The represents the prediction of evaluator for the post-response-reference triplet of the dialogue system , is the label of that instance. The definition of is in Equation 3. is a hyper-parameter for trading off those two losses. With that joint loss, the EWC algorithm can force the neural dialogue to learn without forgetting, gradually increasing its training scope step by step.
Different to EWC, VCL Nguyen et al. (2017)
assumes that all parameters in the model are random variables, whose prior distribution is. Given a sequence of dialogue systems , the VCL based dialogue evaluator learns the weights for all the systems by Bayesian rule:
where, denotes the distribution of parameters learned from , and
denotes the generation probability of usingto fit the dialogue system . The above formula implies that we can naturally learn the joint neural evaluator for all dialogue systems in via continually transferring knowledge from previous dialogue systems, i.e., first learning , then learning , and finally learning .
However the posterior distribution is intractable, hence we use to approximate it through a KL divergence minimization Sato (2001),
for . The initial distribution at time 0 is defined as the prior distribution , and is the normalizing constant. Follow the existing work Friston et al. (2007), minimizing the KL divergence (eq.10) is equivalent to maximizing the negative variational free energy so that the loss function can be defined as:
where is the regression loss define in Equation 8. Thus we can learn sequentially from , and . During prediction at each time , we first sample a set of parameters from and then use the mean prediction to evaluate the quality of dialogue systems.
To test the performance of using continual learning for automatic dialogue evaluation, we build five different dialogue systems for simulating automatic evaluation, including a typical retrieval-based system (retrieval-system Yan et al. (2016)
), three generation-based systems (vanilla seq2seq with attention modelBahdanau et al. (2014), seq2seq with keyword model Yao et al. (2017), CVAE-based seq2seq model Yao et al. (2017)) and a user simulator, which responds by crowd-sourcing. The raw dataset consists of multiple quads single-turn post, machine response, reference response, label. The posts and reference responses are collect from Baidu Tieba111http://tieba.baidu.com, which is the largest Chinese online forum in open-domain topics. Given that post set, our five comparison systems are asked to produce responses. After responses have been generated, five annotators are asked to rate each candidate response according to the given single-turn post and the reference response. The quality score has three grades: 0 (bad), 1 (fair) and 2 (good), and we choose the score of each response via voting. All five systems are anonymous so that annotators do not know which system the response is from during annotation. After annotation, we totally collect 30k single-turn post, machine response, reference response, label quads, we further randomly split that data into training set(20k), validation set (5k) and test set (5k) by post, i.e., different responses replying to the same pose only belongs to one set.
We measure the performance of automatic evaluation from two different perspectives: 1) generalization to the next dialogue system and 2) consistency to the previous predictions. An ideal evaluator with good generalization and consistency shall be able to provide a life-long automatic evaluation. Same as most previous works Liu et al. (2016); Lowe et al. (2017); Tao et al. (2017), we utilize Spearman’s rank correlation coefficient Casella and Berger (2002) to calculate generalization and consistency.
Given the predictions of evaluator for the instances of dialogue system , we calculate its similarity to the human annotation as:
where and are the mean value of and .
Spearman’s rank correlation can provide a convenient way for us to calculate generalization and consistency, specifically, we define generalization and consistency as:
where computes the correlation with human annotation, for the evaluator ’s prediction on the next dialogue system , standing for the generalization (plasticity). While is the average of the correlation of the prediction at time with its previous prediction in range of , standing for the consistency (stability).
|in time||stationary learning||0.370||0.247||0.236||0.176||0.129||1.000||1.000||1.000||1.000|
|in random||stationary learning||0.370||0.248||0.363||0.157||0.139||1.000||1.000||1.000||1.000|
In order to make the learning methods more comparable, we make sure that all hyper-parameters and data-preprocess are same for all comparison methods. For the variable-size input, we use zero pads for shorter input and cut the longer input if its length exceeds 50. The vocabulary size is about 0.48 million cut by word frequency of the training set for pre-training. All models are trained by AdamOptimizerKingma and Ba (2014) with learning rate 0.001 and batch size 32.
For the base model, the embedding size is 256 and the RNN cell is one layer LSTM cell with hidden size 512. Similar to the previous work Lowe et al. (2017)
, we use a pre-training procedure to learn the parameters of the encoder. In this work, we train the encoder as part of a matching model. The last layer parameters are initialized from a truncated normal distribution with a mean of 0 and a standard deviation of 0.1.
For stationary learning, we use the training/validation set of the first dialogue system to train and choose models. Once learned stationary dialogue evaluator, we fix it to evaluate all incoming dialogue systems. For individual learning multiple evaluators, we train/choose individual evaluator only according to the training/validation set of the individual dialogue system. For retraining, we use accumulated validation set of all seen tasks to choose the best model due to the training set is also accumulated when every new dialogue system emerges. For fine-tuning, we update/choose model using training/validation set of new dialogue system while init the model parameters by previous evaluator parameters. For continual learning, the training and model choosing processes are same with fine-tuning but add some penalty during training. The lambdas of EWC are
, increasing the penalty by ten times with the number of task increases. In VCL, the prior distribution of model variables is Gaussian distribution with a mean of 0 and a variance ofsame as precious work Nguyen et al. (2017), and the sample number is 20.
In order to ensure the robustness of the experiment, we simulate two sequences of dialogue systems with five different dialogue systems in two orders: 1) time order and 2) random order. A sequence of dialogue systems in time order, i.e. the retrieval-system the seq2seq with attention model the seq2seq with keywords model CVAE-based seq2seq model human simulator, imitates the reality that new dialogue systems are constantly emerging with the development of dialogue technology. Additionally, we randomly sort the five dialogue systems as the human simulator seq2seq with keywords model seq2seq with attention model CVAE-based seq2seq model the retrieval system, as a sequence of dialogue systems in random order. We evaluate the two dialogue system sequences using baseline methods and continual learning methods.
Table 1 shows the results in evaluating two dialogue system sequences, as demonstrated, the two sets of experiments yield consistent results. On the plasticity (generalization) side, fine-tuning and EWC achieve the best performance while stationary learning is the worst, proving our motivation on the generalization issue of existing stationary dialogue evaluators. Retraining is lower than fine-tuning, which we believe is because retraining cares all systems but fine-tuning only concerns the current ones. It is interesting to find that individual learning is not as good as fine-tuning, which we think the reason is that fine-tuning can benefit from the knowledge of previous systems, so do EWC and VCL.
On the stability (consistency) side, stationary learning is the best one as it never changes. Among all adaptive methods, retraining and VCL achieve the best stability scores, followed by EWC, fine-tuning and individual learning are the worst as they only care their own systems.
Continual learning based evaluators (EWC and VCL) outperform other baselines, jointly considering generalization and consistency. EWC is more plastic than VCL, while VCL is much more stable. We believe this is because VCL is a probabilistic neural network, which naturally considers varieties in its model and thus has stronger stability, which however increases the difficulty in training.
We analyze time and space complexity during the whole training-and-deploying process, see Table 2. We ignore the complexity of model architecture and the difference in learning methods. We assume that updating one evaluator through one training set takes exactly one time step. Restore the labeled data of one dialogue system and one evaluator cost one space.
For evaluating dialogue system sequence consisting of systems, training stationary dialogue evaluator only cost time to evaluate all systems and there is only one evaluator and no data need to restore. Training one individual evaluator and sequential updating (fine-tuning/continual learning) evaluator take time when very next dialogue system emerging, thus total cost O(n) time. For retraining, when every new dialogue evaluator emerging, we need to train all previous system, thus would cost total time and need to restore all data for all existing systems. For source restore, stationary learning and sequential updating only need to retrain one latest evaluator, while individual learning need to restore all learned evaluators and retraining need to save all training data. Despite stationary learning, sequential updating (fine-tuning/continual learning) is a better method, in terms of time and space complexity.
|methods||time complexity||space complexity|
4.2 Stability-plasticity dilemma
Table 3 shows the plasticity on different dialogue systems when using fine-tuning and continual learning.Different from fine-tuning, continual learning (EWC and VCL) add restrictions on the learning of the next task, which would reduce the interference on the previous task but affect the learning capacity of the new task. For example, after learning from human-system, the accuracy on human-system of fine-tuning based dialogue evaluator is the highest 0.380, while the accuracy on CVAE-system decreases sharply from 0.345 to 0.190. Although for EWC and VCL based dialogue evaluators, the accuracy on CVAE-system is only slightly affected, but the plasticity of human-system decrease. This is the stability-plasticity dilemma. The EWC based evaluator improves the stability meanwhile maintaining the plasticity. To some extent, we consider that EWC based dialogue evaluator achieves a trade-off between stability and plasticity and alleviates the stability-plasticity dilemma.
4.3 Supervision requirement
We investigate the influence on plasticity when decreasing the training size. As shown in Figure 4
, for all learning methods, the plasticity of the first task (retrieval-system) decreases significantly as the training size decreases. However, with the learning of subsequent tasks, the plasticity of continual learning (EWC and VCL) are less affected by the reduction of training size. It is because that continual learning based dialogue evaluators continually transfer learned knowledge and only need minor updates to adapt to new tasks as the number of tasks increases. It can be seen that, although fine-tuning based dialogue evaluator uses previous parameters as an initial value, the evaluation effectiveness of the final task (human-system) still declines seriously with the training size decreases.
5 Related Work
5.1 Automatic Evaluation for Chatbots
Generally speaking, dialogue systems can be divided into task-oriented ones Walker et al. (1997); Möller et al. (2006) and chatbots (a.k.a. open-domain dialogue systems). In this paper, we focus on the automatic evaluation of open-domain dialogue systems. Previous works show that existing unsupervised metrics (like BLEU, ROUGE, and Perplexity) are not applicable for evaluating open-domain dialogue systems Serban et al. (2016); Liu et al. (2016)
. Though the retrieval-based dialogue system can be automatically evaluated using precision and recallZhou et al. (2018), that metrics cannot be extended for the generation-based ones.
Therefore, very recent attempts formulate automatic dialogue evaluation as a learning problem. Inspired by GAN Goodfellow et al. (2014), Kannan et al. Kannan and Vinyals (2017) train a discriminator to distinguish whether the response is human-generated or not. Lowe et al., Lowe et al. (2017) propose an end-to-end automatic dialogue evaluation model (ADEM) to calculate human-like scores by triple similarity jointly considering the context, ground-truth, and response. Tao et al., Tao et al. (2017)
combine a referenced metric, the embedding similarity between ground-truth and response, and an unreferenced metric, the matching score between context and response, by simple heuristics.
5.2 Continual learning
Continual learning methods for neural networks are broadly partitioned into three groups of approaches. Architectural approaches Rusu et al. (2016); Fernando et al. (2017) alter the architecture of the network when every new task emerges. It makes architectural complexity grow with the number of tasks. Functional approaches Jung et al. (2016) add a regularization term to the objective, to penalize changes in the input-output function of the neural network. This results in expensive computation as it requires computing a forward pass through the old task’s network for every new data point.
The third, regularization approaches add constraints to the update of network weights. Kirkpatrick et al. Kirkpatrick et al. (2017)
measure the importance of weights by a point estimate of Fisher information and slow down leaning on weights which are important for old tasks. Nguyen et al.Nguyen et al. (2017)
propose variational continual learning (VCL) in a Bayesian framework. Prior knowledge is represented as a probability density function learning from previous tasks. And the posterior is updated from prior in the light of a new task at the cost of KL loss. In this work, we introduce those two model-agnostic regularization approaches to update dialogue evaluators for dialogue systems.
In this paper, we study the problem of automatic dialogue evaluation from the perspective of continual learning. Our work is inspired by the observation of the weak generalization of existing neural dialogue evaluators and we propose to alleviate that issue via selectively adapting the evaluator, to jointly fit dialogue systems have been and to be evaluated. Experimental results show that our continual evaluators are able to adapt to achieve comparable performance with evaluator reconstruction, while our continual evaluators require significantly fewer annotations and eliminate the trouble of maintaining an increasing size of evaluators or annotations.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.1.
- Statistical inference. Vol. 2, Duxbury Pacific Grove, CA. Cited by: Figure 1, §3.2.
- Pathnet: evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734. Cited by: §5.2.
- Science from fisher information: a unification. Cambridge University Press. Cited by: §2.2.2.
- Variational free energy and the laplace approximation. Neuroimage 34 (1), pp. 220–234. Cited by: §2.2.2.
- Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §5.1.
- Less-forgetting learning in deep neural networks. arXiv preprint arXiv:1607.00122. Cited by: §5.2.
- Adversarial evaluation of dialogue models. arXiv preprint arXiv:1701.08198. Cited by: §1, §5.1.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.3.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, pp. 201611835. Cited by: §1, §2.2.2, §5.2.
- Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out. Cited by: §1.
- How not to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023. Cited by: §1, §3.2, §5.1.
- Towards an automatic turing test: learning to evaluate dialogue responses. arXiv preprint arXiv:1708.07149. Cited by: Figure 1, §1, §1, §3.2, §3.3, §5.1.
- MeMo: towards automatic usability evaluation of spoken dialogue services by user error simulations. In Ninth International Conference on Spoken Language Processing, Cited by: §5.1.
- Variational continual learning. arXiv preprint arXiv:1710.10628. Cited by: §1, §2.2.2, §2.2.2, §3.3, §5.2.
- BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §1.
- Progressive neural networks. arXiv preprint arXiv:1606.04671. Cited by: §5.2.
- Online model selection based on the variational bayes. Neural computation 13 (7), pp. 1649–1681. Cited by: §2.2.2.
- Building end-to-end dialogue systems using generative hierarchical neural network models.. In AAAI, Vol. 16, pp. 3776–3784. Cited by: §5.1.
- Ruber: an unsupervised method for automatic evaluation of open-domain dialog systems. arXiv preprint arXiv:1701.03079. Cited by: §1, §3.2, §5.1.
- PARADISE: a framework for evaluating spoken dialogue agents. In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, pp. 271–280. Cited by: §5.1.
- Docchat: an information retrieval approach for chatbot engines using unstructured documents. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 516–525. Cited by: §3.1.
Towards implicit content-introducing for generative short-text conversation systems.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2190–2199. Cited by: §3.1.
- Multi-turn response selection for chatbots with deep attention matching network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1118–1127. Cited by: §5.1.