Generative conversational systems rely heavily on language modelling to be able to generate an appropriate response to a user query. Given a context which consist of multiple utterances in a conversation, a generative conversational system can be formulated as a next utterance prediction problem, where the task is to generate a response utterance conditioned on the context utterances. With the advent of deep learning and availability of sufficient training data, parametric models like recurrent neural networks and transformers are generally implemented to achieve the language modelling task. Trained by minimizing the expected cross entropy between the training hard targets and the prediction logits, such models often overfit the training data and does not generalize well on the test set. Label smoothing proposed bySzegedy et al. (2015)
to improve the performance of Inception net image classifier on the ImageNet dataset, has widely gained acceptance in Natural Language Processing tasks as a regularization technique to enhance the generalization capability of deep neural networks.Vaswani et al. (2017)
in his work “Attention is all you need”, where he proposed the state-of-the-art transformer architecture, had reported performance gains in machine translation using label smoothing during training. Unlike other regularization techniques which constrain the model parameters and hidden representations, label smoothing augments the actual targets by reducing the actual target probability, and assigning low probabilities to all classes, following a data independent uniform distribution, thus preventing the model from predicting the correct labels too confidently during training. However, as pointed out byPereyra et al. (2017) and Hinton et al. (2015), the probabilities assigned to both the correct and incorrect classes constitute the knowledge of a network. In language modelling, incorporating label smoothing and assigning a uniform probability to all the incorrect classes can convey an incorrect knowledge to the model. For example, in response to a user query “How are you doing ?”, if we want to generate the sentence “I am doing good .” as the next utterance, given that we have already generated the phrase “I am doing”, “great” and “awesome” can also convey the same message as “good”. On the other hand, “bad” would convey a different message, but would be logically correct. However a random word like “aeroplane” would not make any sense. Hence, if we want to use label smoothing, we would not want to follow a uniform distribution for the incorrect classes, and rather weigh them using a weighing mechanism which can present such knowledge to the model. In this paper we present ways of imparting such information by modifying the data independent uniform distribution in label smoothing with a more appropriate data dependent distribution, which is proportional to the pre-trained word embedding similarity between the actual target and the incorrect targets.
2 Related Work
Numerous techniques have been introduced to enhance the generalization capability of neural networks. Although, as pointed out by Pereyra et al. (2017), substantial advancements have been made in regularizing model parameters, but not much work has been done in understanding external regularization techniques like label smoothing or target data augmentation. We can broadly categorize the most recent solutions to attain generalization in conversations in the following two categories.
Loss function augmentation: Li et al. (2016) proposes using Maximum Mutual Information along with the Cross Entropy loss, in order to penalize bland responses like “I do not know”, which are frequent in conversational datasets. Jiang et al. (2019)
analysed that the Cross Entropy loss function prefers frequent tokens, which leads to generating bland responses. Hence they proposed augmenting the Cross Entropy Loss with a frequency based corpus dependent weighing mechanism, in order to yield diverse responses.Wang et al. (2020) experiments with using optimal transport to match sequences generated in the teacher and student modes, and increasing performance of student forced networks on the test dataset by reducing the gap between the two modes.
Data augmentation: Cai et al. (2020) demonstrated that conversational datasets generally don’t exhibit coherence in query response pairs, which affect the Cross Entropy loss. They propose a training data augmentation module, which can not only replace words in the actual target response with similar words using BERT Devlin et al. (2019), but also augment the style of the response, preserving the meaning. They further introduced a neural weighting mechanism, which can assign weights or importance to the augmented and golden training data, and report significant performance gains. Kang and Hashimoto (2020) demonstrated that the log loss is not robust to noise, and hence proposed truncating the distribution of the training targets to achieve an easy to optimize and more robust loss function. He and Glass (2020) introduced a network which can provide negative generated samples, and train the generation model to maximize the log likelihood of training data while minimizing the likelihood of negative samples. Our proposed method falls in the first category, as we do not augment the training data, and instead augment the probability of incorrect labels for each correct label.
3 Methods and Experiments
We experiment with ways to augment the data independent uniform distribution enforced by Label Smoothing. Let be an utterance consisting of words , where is the number of words in the utterance. For each word , in label smoothing a probability is assigned to the true label , and a probability of (smoothing factor) is distributed uniformly among the rest of the words in the vocabulary. We augment the distribution of the incorrect class by weighting the smoothing factor
according to the cosine similarity between the GlovePennington et al. (2014) word embedding of the correct word in the training data and all the words in the vocabulary. Thus, if the correct word to be predicted is “good”, then the words “great” and “awesome” in the vocabulary would get a higher proportion of the smoothing factor , compared to an unrelated word like “aeroplane”, thus presenting a more correct knowledge to the model. Mathematically, let be the Glove word embedding of word , be a matrix containing the Glove word embedding for all the words in the vocabulary (including ),
be the vector of cosine similarity between the wordand all the words in the vocabulary. Since Glove word embeddings are learned representations, they can be noisy. Hence, we introduce a threshold , below which we set the cosine similarity value in as 0. We achieve this by introducing a mask , and multiply the similarity vector with the mask. The resulting vector is normalised to lie between 0 and 1, and finally multiplied by . We treat as a model hyperparameter, and is tuned using grid search. We further reason that although Glove embeddings are learned from text corpora, there are possibilities that dissimilar words can lie in close proximity in the embedding space, resulting in a high cosine similarity score, and presenting an incorrect knowledge to the model. To circumvent this problem, we further experiment with filtering out the cosine similarities of dissimilar words based on WordNet sysnets 1, which we achieve by implementing another mask .
We perform experiments on (i) The DailyDialog dataset Li et al. (2017) : A multi-turn open domain dialogue dataset which has 13,118 pertaining to diverse day-to-day topics , and (ii) The Empathetic Dialogues dataset Rashkin et al. (2019) : An open domain multi-turn dataset consisting of 25,000 conversations grounded in emotional situations. We use the same training, validation and testing splits as mentioned in the datasets. We concatenate all the turns in the query in one long text, and use two special tokens: “[speaker1]” and “[speaker2]” to distinguish the speakers. In order to speed up computation, we restrict the context to the most recent 50 tokens, which is determined analytically from the corpora. Please refer to the supplementary material for the code and the dataset.
Since the primary scope of this paper is to experiment with different loss functions, we used a standard transformer encoder-decoder architecture as proposed by Vaswani et al. (2017), where the encoder encodes the most recent utterance in the conversation, along with context from the previous turns. The encoder-decoder comprises of 3 layers each, with 300 dimensional hidden representation, with 6 attention heads in each multi-headed attention layer. The embedding layer is populated with 300 dimensional Glove embeddings, which are trained along with the entire network. Finally, a fully connected linear layer predicts the next word.
We treat the vanilla Cross Entropy (CE) loss, CE loss with label smoothing, Kullback–Leibler (KL) divergence loss and KL loss with label smoothing as the baselines. We experiment with different smoothing values , cosine similarity thresholds , and also perform ablation study to analyze the usefulness of the WordNet similarity mask . In total, we experiment with 30 different settings for each dataset.
|s = NA||s = 0.1||s = 0.2||s = 0.1||s = 0.2|
|t = NA||t = NA||t = NA||t = 0||t = 0.5||t = 0||t = 0.5|
|Dataset||Metric||Loss||w = NA||w = NA||w = NA||w = 0||w = 1||w = 0||w = 1||w = 0||w = 1||w = 0||w = 1|
4 Results and Analysis
: which compares Longest Common Subsequence (LCS), and automatically takes into account sentence level structure similarity and identifies longest co-occurring in sequence n-grams, (iii) METEOR scoreBanerjee and Lavie (2005): an improvement over BLEU score, which incorporates stemming and synonymy matching along with exact word matching. Table 1
summarizes the results we obtained in each of the experiments. The supplementary material contains results for all the 60 experiments along with additional evaluation metrics like BERTscoreZhang* et al. (2020) and ROUGE 1 & 2. In Table 1, the columns containing “NA” are the baseline results, against which improvements are measured.
Observations From the experiments we observe that, (i) Using a data dependent cosine similarity based distribution for label smoothing significantly outperforms the baseline (vanilla entropy based loss with or without label smoothing). We observe 12.67 % increase in BLEU score, 0.57 % increase in ROUGE L score, and 4.16 % increase in METEOR score for the DailyDialog dataset, and 1.42 % increase in BLEU score, 1.95 % increase in ROUGE L score, and 2.08 % increase in METEOR score for the EmpatheticDialogues dataset. (ii) Using additional WordNet synonym based filtering () does not help performance. To understand why this is happening, we plotted the distribution of the smoothing factor for the randomly selected word “fun”, and observed that the word had only one overlapping WordNet synonym in our vocabulary: “play”. This caused the word “play” to be assigned a probability of 0.1, while all the other words are assigned a probability of 0, except for “fun”, which was assigned a probability of 0.9. We reason that the sparsity in synonyms does not help in reducing the overconfidence of the model, as the final distribution is very similar to non-smoothing targets. Figure 1 illustrates the probabilities assigned to the incorrect labels of the word “fun”, by each of the methods discussed in this paper. (iii) Using CE loss instead of KL generally improves performance while using label smoothing. We reason that this happens because in case of label smoothing, the constant entropy coefficient in KL loss reduces the overall loss, thus reducing the gradients during back propagation, which results in slower learning. (iv) Generally, using high smoothing value () does not help in learning. (v) The cosine similarity threshold should be treated as a hyperparameter, and will require tuning depending on the vocabulary of the dataset used. (vi) We also noticed that a cosine similarity threshold as high as 0.8 does not help in learning. We reason that using a high threshold creates a scenario similar to using WordNet synonyms, where the smoothing probability is distributed among very few (or no) words. Note that in order to enhance readability, the results with 0.8 threshold are omitted from Table 1, and are presented in the additional supplementary materials.
Label smoothing has an undesirable property of assigning uniform probability to incorrect labels, which present an incorrect knowledge to learn from. In this paper we propose ways to convert the uniform distribution to a data dependent distribution by weighing the smoothing probability using cosine similarity of word embeddings between the correct and incorrect labels. We further experiment with WordNet synonyms as an additional filtering criteria, and report our findings. Using our proposed methodology, we attain significant improvements over the baseline metrics across all datasets. However, one drawback that we notice in the proposed system is the inability to factor in context, while weighing the distribution of the incorrect labels. As future research, we intend to address this drawback using more contextualised representations instead of static embeddings.
-  (2010) About wordnet. The Trustees of Princeton University. External Links: Cited by: §3.
- METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, Michigan, pp. 65–72. External Links: Cited by: §4.
- Data manipulation: towards effective instance learning for neural dialogue generation via learning to augment and reweight. arXiv preprint arXiv:2004.02594. Cited by: §2.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: §2.
- Negative training for neural dialogue response generation. External Links: Cited by: §2.
- Distilling the knowledge in a neural network. External Links: Cited by: §1.
- Improving neural response diversity with frequency-aware cross-entropy loss. The World Wide Web Conference on - WWW ’19. External Links: Cited by: §2.
Improved natural language generation via loss truncation. External Links: Cited by: §2.
- A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 110–119. External Links: Cited by: §2.
- DailyDialog: a manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Taipei, Taiwan, pp. 986–995. External Links: Cited by: §3.1.
- ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. External Links: Cited by: §4.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. External Links: Cited by: §4.
- GloVe: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Cited by: §3.
- Regularizing neural networks by penalizing confident output distributions. External Links: Cited by: §1, §2.
- A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Belgium, Brussels, pp. 186–191. External Links: Cited by: §4.
- Towards empathetic open-domain conversation models: a new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5370–5381. External Links: Cited by: §3.1.
Rethinking the inception architecture for computer vision. External Links: Cited by: §1.
- Attention is all you need. External Links: Cited by: §1, §3.2.
- Improving text generation with student-forcing optimal transport. External Links: Cited by: §2.
- BERTScore: evaluating text generation with bert. In International Conference on Learning Representations, External Links: Cited by: §4.
6 Supplementary Material
6.1 All Experiment Results
Table 2 shows the different variants of the baselines that were computed for both the DailyDialog and EmpatheticDialogues datasets. All performance improvements are compared against these baselines. For a metric, the best baseline score among all the hyperparameter settings is chosen to report improvements. Table 3 shows the results of using different hyperparameter settings and loss function in the DailyDialog dataset, and Table 4 shows the results obtained on the EmpatheticDialogues dataset. The best results with detailed comparison against baselines are already discussed in the main paper.
|DailyDialog Dataset||EmpatheticDialogue Dataset|
|s = NA||s = 0.1||s = 0.2||s = NA||s = 0.1||s = 0.2|
|t = NA||t = NA||t = NA||t = NA||t = NA||t = NA|
|Metric||Loss||w = NA||w = NA||w = NA||w = NA||w = NA||w = NA|
6.2 Model Training and Parameters
All the models were trained on a single Nvidia V-100 GPU, for 15 epochs each with a learning rate of 2e-4, batch size of 64, and using AdamW optimizer. The gradients of the model were clipped with a value of 1, and dropout with probability 0.1 was applied during training. The average run-time of each experiment is 60 minutes, with each of the trained models having 17.7 M parameters. The code, dataset and best performing models are publicly available through this link:download link.
|s = 0.1||s = 0.2|
|t = 0||t = 0.5||t = 0.8||t = 0||t = 0.5||t = 0.8|
|Metric||Loss||w = 0||w = 1||w = 0||w = 1||w = 0||w = 1||w = 0||w = 1||w = 0||w = 1||w = 0||w = 1|
|s = 0.1||s = 0.2|
|t = 0||t = 0.5||t = 0.8||t = 0||t = 0.5||t = 0.8|
|Metric||Loss||w = 0||w = 1||w = 0||w = 1||w = 0||w = 1||w = 0||w = 1||w = 0||w = 1||w = 0||w = 1|