Enabling chatbots to indulge in engaging conversations requires massive datasets of human-human conversations Ritter et al. (2011); Sordoni et al. (2015); Vinyals and Le (2015); Zhang et al. (2018, 2019). Training such dialog agents requires substantial time and effort expended in the collection of adequate number of high quality conversation samples.
Hancock et al. (2019) alleviate this problem by introducing a self-feeding chatbot which can directly learn from user interactions. This chatbot requests users to provide natural language feedback when the users are dissatisfied with its response.
Hancock et al. (2019) treat this feedback as a gold response to the wrong turn and use it as an additional training sample to improve the chatbot.
Although natural language feedback is cheap to collect from a chatbot’s end-users, most often, feedback cannot be used directly as a training sample since feedback is usually not the answer itself, but simply contains hints to the answer. Table 1
shows some feedback text samples. Naive modification of feedback using heuristics like regular expressions would lead to generic responses that are ineffective in improving the dialog ability of chatbotsLi et al. (2016). Additionally, writing an exhaustive set of regular expression rules is time consuming and requires extensive analysis of the data. Annotating data to convert feedback text to natural response is also expensive and defeats the purpose of learning from feedback text.
|you could say hey, i’m 30. how old are you?|
|yes, i play battlefield would be a great answer|
|tell me what your favorite breakfast food is|
|answer the question about having children!|
In this work, we propose a generative adversarial setup for converting such noisy feedback instances into natural, human-like responses that provide better training signals for the dialog agents. Figure 1
gives a bird’s-eye view of our problem. We frame this problem as a variant of text style transfer where the generator is tasked with making the feedback resemble the optimal response to the user’s previous utterance and the discriminator is a classifier that distinguishes whether a given response is feedback or natural.
Our main contributions are the following:
We introduce Feed2Resp, a text style transfer system that converts feedback to natural responses without full supervision, thus generating additional training samples (Section 2).
We show that the training on Feed2Resp modified responses leads to improved accuracy of chatbots (Section 4). Our results also reveal that training naively on feedback doesn’t help when the original chatbot is already a strong model, whereas Feed2Resp also helps strong models.
2 Feedback to Natural Response Model
Hancock et al. (2019) introduce a novel variant of a self-feeding chatbot in which the dialogue agent is equipped with the capability of extracting new training samples while in conversation with humans after deployment (Figure 1). The agent also employs a satisfaction module which is trained to predict how satisfied the partner is with the responses it provides. When the chatbot is engaged in a conversation where the predicted satisfaction is below a defined threshold(usually 0.5), a feedback loop is triggered where the agent requests feedback from the human user on what should have been the response. The agent then utilizes the feedback text as the target response in new training examples for the primary dialogue ranking task. Hancock et al. (2019) show that this cost-efficient method of extracting new examples improves the chatbot’s dialogue abilities. In this work, we show that naive use of the collected feedback is not necessarily a good technique and instead, we propose an approach to better utilize the collected feedback samples.
We pose the problem of converting feedback to resemble natural response as a text style transfer problem. We observe that feedback is more instructional and judgemental, whereas natural response is direct (answering questions) and engaging (asking questions, contains humor). We naturalize the feedback to a response and use it as an additional training sample to improve the chatbot.
A fully supervised approach to convert feedback to natural response is infeasible as we do not have paired (feedback response) examples and thus we adopt an adversarial setup. We utilize a GAN (Goodfellow et al., 2014) formulation where the generator modifies the feedback’s style to make it seem part of a natural conversation, and in turn fool the discriminator which knows how to distinguish natural responses and feedback. Our model, Feed2Resp, is shown in Figure 2.
2.1 Adversarial Setup
Given an input sentence (feedback or natural response) with source style , conversation history and target style , the generator performs the mapping
Here is the rewrite of into style . It is often the case that feedback and desired responses share many words (see Table 9). We use BART encoder-decoder initialized with pretrained weights as our generator since its denoising objective helps in copying from the input while also producing realistic sentences Lewis et al. (2019).
We additionally pretrain our model under the summarization setting to extract only the response when presented with conversation history and response. This helps maintain brevity while still integrating details from the context in the response.
The discriminator is a transformer encoder network that learns to distinguish the style of feedback and natural responses. Given an input text and conversation history , it predicts the style class of . Formally, it is defined as follows:
2.2 Feed2Resp Learning
We train Feed2Resp on three main objectives that help the model to reconstruct sentences when the style is not changed, change its style meaningfully and distinguish different styles. These objectives are shown to work well in other style transfer scenarios Dai et al. (2019).
Self reconstruction objective
For the scenario where the target style is the same as the source style, we train the generator to reconstruct the sentence given as input. Considering the input sentence as , the source and the target style as , we minimize the negative log-likelihood loss to generate the same sentence as output
Cycle consistency objective
Taking inspiration from Cycle GAN Zhu et al. (2017), we introduce a cycle consistency constraint to ensure that the model learns to preserve the meaning when it modifies the style of the original sentence. We first transform to style to produce , i.e., .
Subsequently, we feed as input with the target style as and the model is trained to reconstruct the original sentence . We minimize the negative log-likelihood loss which is given by,
Style modification objective
To ensure that the style of an input sentence is changed to match the target one
, we use the discriminator’s confidence as training signal. The generator wants to maximize the probability of the discriminator to classify transformed input to the target style, and therefore, we use the negative log-likelihood of the discriminator as our loss.
2.3 End-to-end training
The discrete nature of sampling and non-differentiability of the argmax operator prevents gradient backpropogation.
Following Dai et al. (2019), we consider the softmax distribution produced by the generator, as the ‘soft’ generated sentence and use it as input for further downstream networks to maintain differentiability.
3 Experimental Setup
In Feed2Resp, the optimizer for both the generator and discriminator is AdamW. The learning rate of generator is 5e-6 while the learning rate of discriminator is 1e-4. The discriminator uses 4 stacked transformer layers and 4 attention heads. The token embedding size, style embedding size, positional embedding size and hidden size are all 256. For the BART Lewis et al. (2019) generator, we use the implementation from HuggingFace Wolf et al. (2019) and initialize the model with pretrained weights from the CNN/Daily Mail summarization task. Due to the characteristics of human response(refer Appendix A
), we limit the length of text generation to a maximum of 50 words and impose a repetition penalty of 2.0 to improve diversity of output.
While evaluating the effectiveness of the modified feedback responses, we use two implementations of dialog agents provided by ParlAI Miller et al. (2017), BiEncoder and PolyEncoder. BiEncoder has two transformer layers and 2 attention heads. The optimizer is Adamax with learning rate of 0.0025. PolyEncoder uses 12 transformer layers and 12 attentions heads. The optimizer is Adamax with learning rate of 5e-05.
The hyperparmeters for the best performing model are arrived at by random sampling and subsequently verifying the outputs using human evaluation to rate the outputs from the style transfer task. The entire list of hyper-parameters is listed in the Table 8.
Our goal is to test whether feedback helps improve the chatbot. To do this, we compare models trained on conversational data with and without feedback data. Below we describe the chatbot evaluation setting, our datasets, the main models and different settings of these models with and without feedback.
4.1 Chatbot evaluation task and metrics
Following Hancock et al. (2019), we choose PersonaChat Zhang et al. (2018) as the main evaluation dataset. This dataset consists of human-human conversations collected using crowdsourcing where each crowdworker takes a persona. Since persona representation is a challenging research problem on its own, Hancock et al. ignore the persona and just use the conversations to train chatbots and we follow the same approach. At test time, the model is presented the conversation history and 20 candidate responses and the model has to pick the correct response. Thus, we use HITS@1/20 metric for evaluation.
4.2 Feedback data
We use the feedback data collected by Hancock et al. (2019) as this removes orthogonal factors such as differences in chatbot interfaces and annotation framework etc. which are not the focus of this work. Hancock et al. collected this feedback by deploying bi-encoder chatbots (Section 4.3) trained on varying levels of training data and making it converse with crowdworkers. Whenever the bot’s response is not satisfactory, natural language feedback is collected from the crowdworker.
The data thus collected contains 60k human-bot turns, of which the last turn is always the feedback.
4.3 Chatbot Models
Given the conversation history and several candidate responses, the chatbot is trained to rank the correct candidate on the top. We use the following models as our chatbots.
BiEncoder Hancock et al. (2019); Humeau et al. (2020) contains two transformers, one for summarizing the conversation history and the other to summarize candidate responses to embeddings. The response with highest similarity is taken as the best candidate response.
PolyEncoder Humeau et al. (2020) summarizes a context and candidate responses into several embeddings. In order to contextualize context and candidates together, it performs a cross-encoder attention on the summary embeddings and scores each candidate.
4.4 Feedback-based Models
We train and test the above models in the following settings.
NoFeedback: The model is trained only on human conversations.
Feedback: We train on the combination of human conversations and unmodified feedback data. This setting is similar to Hancock et al. (2019).
Heuristic: We design and use six regular expression rules based on the frequent patterns in the data that convert feedback to plausible dialog responses (see Appendix E) and train the chatbot models on human conversations along with the modified feedback.
Feed2Resp: We use our main model (Section 2) to modify feedback to natural responses and train the chatbot models on modified feedback along with human conversations.
|NoFeedback||49.03 (0.66)||49.49 (0.49)|
|Feedback||49.27 (1.06)||49.97 (1.30)|
|Heuristic||48.85 (0.70)||49.85 (0.72)|
|Feed2Resp||50.84 (0.50)||51.32 (0.43)|
|NoFeedback||73.35 (0.70)||69.94 (0.37)|
|Feedback||72.63 (0.14)||68.48 (0.64)|
|Feed2Resp||78.14 (0.40)||75.96 (0.80)|
modified feedback gives large improvements. The variances across three different runs are also shown.
5 Results and Discussion
The experimental details of the model variants are described in Section 3. Table 2 shows the average HITS@1/20 of all models on the PersonaChat validation and test sets over 3 runs. We were able to replicate results of Hancock et al. (2019) which show that BiEncoder performance improves slightly (+0.48 on test) when Feedback is used. Heuristic edits to feedback don’t help while Feed2Resp responses improve the results higher than Feedback and also have less variance. Coming to PolyEncoder, it is a much stronger chatbot than BiEncoder. We see that naive use of Feedback or Heuristic deteriorates the performance of PolyEncoder while Feed2Resp emerges a clear winner with +6.0 point improvement on the test set over NoFeedback.
|Modification type: Rewrite|
|F: tell me about your favorite show||18.5%||81%|
|F2R: I love watching TV shows and sitcoms like friends|
|Modification type: Remove|
|F: you could’ve said, yes the sugar cinnamon kind is my favorite||40%||68.7%|
|F2R: yes the sugar cinnamon kind is my favorite|
|Modification type: Retain|
|F: the temperature is hot||41.5%||74.6%|
|F2R: the weather is hot|
We randomly sample 200 feedback responses from Feed2Resp to determine the kind of modifications the model performs (Table 3). We observe three main types of modifications — Rewrite, Retain and Remove. Rewrite is when the feedback implies an hint to the answer but not the answer itself. Remove is when the feedback contains the answer with extraneous words that have to be removed. Retain are cases where the model copies or paraphrases the feedback. Among these, Remove has the lowest accuracy of modification. Upon inspection, we find that these are the cases which require multiple removals. For example, for You should reply with either yes or no, the model predicts yes or no together instead of either one of them. Additionally, we visualize the attention maps of the discriminator to observe which words contribute most to the classification decision of the discriminator (Figure 3). The discriminator learns to distinguish feedback from normal dialog responses due to the presence of sequences like you could have, you should have, tell me, etc. Thus the generator learns to remove such extraneous sequences and make the feedback seem like plausible responses. We present a sample of modified outputs of Feed2Resp in Appendix C.
In this work, we show that while chatbots can be improved using natural language feedback, converting feedback to natural responses that fit in the conversation outperform the naive usage of feedback. We presented Feed2Resp, a generative adversarial model, that converts feedback to natural responses without requiring manually annotated parallel data. Our results show that Feed2Resp results in a 6 point improvement for the PolyEncoder chatbot, an already powerful dialog ranking agent. This is a strong result as HITS@1/20 is a tough metric to improve upon (Hancock et al., 2019).
Our work joins the class of models that use natural language feedback to improve different tasks, e.g., image captioningLing and Fidler (2017), classification Srivastava et al. (2017); Hancock et al. (2018); Murty et al. (2020)
. While these methods use feedback for reward shaping or feature extraction, we use feedback to produce correct response using adversarial learning. We pose this problem as a style transfer problem inspired from the style transfer literature(Shen et al., 2017; Xu et al., 2018; Li et al., 2018; Conneau and Lample, 2019; Dai et al., 2019). While these focus on studying the stylistic attributes of sentences, e.g, sentiment, we explore this problem in the context of improving chatbots.
We thank Yue Dong for her multiple helpful discussions during the course of this project. We also thank Sandeep Subramanian for his insightful guidance at a crucial stage of this work. This research was enabled in part by computations support provided by Compute Canada (www.computecanada.ca). The last author is supported by the NSERC Discovery Grant on Robust conversational models for accessing the world’s knowledge.
- Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pp. 7057–7067. External Links: Cited by: §6.
- Style transformer: unpaired text style transfer without disentangled latent representation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5997–6007. External Links: Cited by: §2.2, §2.3, §6.
- Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, Cambridge, MA, USA, pp. 2672–2680. External Links: Cited by: §2.
- Learning from dialogue after deployment: feed yourself, chatbot!. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3667–3684. External Links: Cited by: Table 4, Appendix A, Appendix B, §1, §1, §2, §4.1, §4.2, §4.3, §4.4, §5, §6.
- Training classifiers with natural language explanations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1884–1895. External Links: Cited by: §6.
- Poly-encoders: architectures and pre-training strategies for fast and accurate multi-sentence scoring. In 8th International Conference on Learning Representations, ICLR, External Links: Cited by: §4.3, §4.3.
- Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461. External Links: Cited by: §2.1, §3.
- A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 110–119. External Links: Cited by: §1.
- Delete, retrieve, generate: a simple approach to sentiment and style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1865–1874. External Links: Cited by: §6.
- Teaching machines to describe images with natural language feedback. In Advances in Neural Information Processing Systems 30, pp. 5068–5078. External Links: Cited by: §6.
ParlAI: a dialog research software platform.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Copenhagen, Denmark, pp. 79–84. External Links: Cited by: §3.
- ExpBERT: representation engineering with natural language explanations. In Proceedings of the Association for Computational Linguisitcs, External Links: Cited by: §6.
- Data-driven response generation in social media. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 583–593. External Links: Cited by: §1.
- Style transfer from non-parallel text by cross-alignment. In Advances in neural information processing systems, pp. 6830–6841. External Links: Cited by: §6.
A neural network approach to context-sensitive generation of conversational responses. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 196–205. External Links: Cited by: §1.
- Joint concept learning and semantic parsing from natural language explanations. In Proceedings of the 2017 conference on empirical methods in natural language processing, pp. 1527–1536. External Links: Cited by: §6.
- A neural conversational model. arXiv preprint arXiv:1506.05869. External Links: Cited by: §1.
- HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. External Links: Cited by: §3.
Unpaired sentiment-to-sentiment translation: a cycled reinforcement learning approach. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 979–988. External Links: Cited by: §6.
- Personalizing dialogue agents: I have a dog, do you have pets too?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2204–2213. External Links: Cited by: Appendix A, §1, §4.1.
- DialoGPT: large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536. External Links: Cited by: §1.
Unpaired image-to-image translation using cycle-consistent adversarial networks. In
Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. External Links: Cited by: §2.2.
Appendix A Dataset Statistics
We are going to validate our approach on the chatbot’s performance using PersonaChat Zhang et al. (2018) dialogue dataset and Human-Bot feedback dataset Hancock et al. (2019). Table 6 reports the size of each dataset, all of which are available via ParlAI.222https://parl.ai/projects/self_feeding/
To train the Feed2Resp model, we take the entire Feedback dataset and an equal number of randomly chosen samples from the Dialogue dataset. We them use a train-dev-test split of 0.8:0.1:0.1 for training and evaluation of the model.
|#Words in context (mean)||79||13|
|#Words in context (median)||77||6|
|#Words per turn (median)||10.7||7.1|
|#Words per turn (mean)||11||6|
We examine the average number of turns and words in dialogues from the the feedback and human-human conversation distributions. We see that on an average, the dialogues in the feedback distribution have fewer number of turns than in human-human conversations. The average number of words per turn is also fewer on average.
Appendix B Preparation of Training Data
We use the dataset provided by Hancock et al. (2019), which is a cleaner version of PersonaChat dataset and comes with a new crowdsourced test set. We sample an equal number of examples from the Dialogue dataset, giving them a label 0, and Feedback dataset, giving them a label of 1. The final response are combined with last n turns with an delimiter [RES]. Typically, n=2 turns are used for each conversation example. Conversation turns are separated with delimiter tokens [P1] or [P2].
Appendix C Feed2Resp examples
Here we include several examples of predictions from different models in Table 9 .
Appendix D Computing Infrastructure and Runtime of Experiments
All experiments are conducted on Nvidia V100 GPUs. Average runtime of experiments is listed in Table 7 and assume running experiments on a single Nvidia V100 GPU.
|Task||Approximate Training Time|
Appendix E Regular Expressions to modify Feedback
As described in Section 4.4, we use the combination of following regular expressions to strip the filler words and extra choices in feedback data:
r"you could have|you should have| you could|you should" r"^said|^saying|^say|^tell |^told |^admit |asked |^ask |^answer |^answered |^talked |^talk " r"^about|^me|^that" r"if|whether|not" "you are " to "i am " "your "to "my " "you\’ve " to "i\’ve " "you were" to "i was" "you " to "i "
The hyperparameters used to obtain the results in Table2.
|Conv. Hist.||[H] It makes more sense to order food online. There are better deals. [B] good choice. what is your favorite food? [H] pizza. what is yours?|
|Feedback||you could have spoken about your favorite food|
|Feed2Resp||my favorite food is pasta.|
|Conv. Hist.||[H] hello [B] hi. how are you doing? [H] what do you do for a living?|
|Feedback||tell me a job or career|
|Feed2Resp||i am a teacher.|
|Conv. Hist.||[H] hello [B] hello [H] do you like to watch movies?|
|Feedback||yes or no|
|Feed2Resp||yes i like movies or no i hate movies.|
|Conv. Hist.||[H] I tried to climb a tree and broke my ankle.|
|Feedback||that sounds painful|
|Feed2Resp||i am sorry. that sounds painful.|
|Conv. Hist.||[H] hello [B] hi. [H] do you exercise?|
|Feedback||yes, i do exercise.|
|Feed2Resp||yes, i do exercise and go jogging.|
|Conv. Hist.||[H] what do you think about wintertime?|
|Feedback||you could say that you like snow or you think it is too cold.|
|Feed2Resp||you like snow or you think it is too cold.|
|Conv. Hist.||[H] hello [B] hello [H] have you ever ate chinese food?|
|Feedback||yes i like chinese.|
|Feed2Resp||yes i like chinese noodles and dumplings.|
|Conv. Hist.||[H] who is the greek goddess of love? [B] yes , i lived in south florida for years [H]] sorry, i meant who is the greek goddess of love?|
|Feed2Resp||venus is a very good player. he has a lot of experience playing with the likes of jurgen klinsmann.|