Learning Improvised Chatbots from Adversarial Modifications of Natural Language Feedback

10/14/2020 ∙ by Makesh Narsimhan Sreedhar, et al. ∙ Montréal Institute of Learning Algorithms McGill University 0

The ubiquitous nature of chatbots and their interaction with users generate an enormous amount of data. Can we improve chatbots using this data? A self-feeding chatbot improves itself by asking natural language feedback when a user is dissatisfied with its response and uses this feedback as an additional training sample. However, user feedback in most cases contains extraneous sequences hindering their usefulness as a training sample. In this work, we propose a generative adversarial model that converts noisy feedback into a plausible natural response in a conversation. The generator's goal is to convert the feedback into a response that answers the user's previous utterance and to fool the discriminator which distinguishes feedback from natural responses. We show that augmenting original training data with these modified feedback responses improves the original chatbot performance from 69.94 75.96 improvement given that the original model is already trained on 131k samples.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Enabling chatbots to indulge in engaging conversations requires massive datasets of human-human conversations Ritter et al. (2011); Sordoni et al. (2015); Vinyals and Le (2015); Zhang et al. (2018, 2019). Training such dialog agents requires substantial time and effort expended in the collection of adequate number of high quality conversation samples.

Hancock et al. (2019) alleviate this problem by introducing a self-feeding chatbot which can directly learn from user interactions. This chatbot requests users to provide natural language feedback when the users are dissatisfied with its response.

Hancock et al. (2019) treat this feedback as a gold response to the wrong turn and use it as an additional training sample to improve the chatbot.

Figure 1: When the bot provides a poor response to the question posed by the user, the bot requests natural language feedback. We use the conversation context and the feedback to construct a plausible response to the user query and use it as an additional training sample to improve the chatbot.

Although natural language feedback is cheap to collect from a chatbot’s end-users, most often, feedback cannot be used directly as a training sample since feedback is usually not the answer itself, but simply contains hints to the answer. Table 1

shows some feedback text samples. Naive modification of feedback using heuristics like regular expressions would lead to generic responses that are ineffective in improving the dialog ability of chatbots

Li et al. (2016). Additionally, writing an exhaustive set of regular expression rules is time consuming and requires extensive analysis of the data. Annotating data to convert feedback text to natural response is also expensive and defeats the purpose of learning from feedback text.

you could say hey, i’m 30. how old are you?
yes, i play battlefield would be a great answer
tell me what your favorite breakfast food is
answer the question about having children!
Table 1: Samples of feedback to the chatbot. These contain hints to the answer but they are not the answers themselves.

In this work, we propose a generative adversarial setup for converting such noisy feedback instances into natural, human-like responses that provide better training signals for the dialog agents. Figure 1

gives a bird’s-eye view of our problem. We frame this problem as a variant of text style transfer where the generator is tasked with making the feedback resemble the optimal response to the user’s previous utterance and the discriminator is a classifier that distinguishes whether a given response is feedback or natural.

Our main contributions are the following:

  • We introduce Feed2Resp, a text style transfer system that converts feedback to natural responses without full supervision, thus generating additional training samples (Section 2).

  • We show that the training on Feed2Resp modified responses leads to improved accuracy of chatbots (Section 4). Our results also reveal that training naively on feedback doesn’t help when the original chatbot is already a strong model, whereas Feed2Resp also helps strong models.

2 Feedback to Natural Response Model

Hancock et al. (2019) introduce a novel variant of a self-feeding chatbot in which the dialogue agent is equipped with the capability of extracting new training samples while in conversation with humans after deployment (Figure 1). The agent also employs a satisfaction module which is trained to predict how satisfied the partner is with the responses it provides. When the chatbot is engaged in a conversation where the predicted satisfaction is below a defined threshold(usually 0.5), a feedback loop is triggered where the agent requests feedback from the human user on what should have been the response. The agent then utilizes the feedback text as the target response in new training examples for the primary dialogue ranking task. Hancock et al. (2019) show that this cost-efficient method of extracting new examples improves the chatbot’s dialogue abilities. In this work, we show that naive use of the collected feedback is not necessarily a good technique and instead, we propose an approach to better utilize the collected feedback samples.

We pose the problem of converting feedback to resemble natural response as a text style transfer problem. We observe that feedback is more instructional and judgemental, whereas natural response is direct (answering questions) and engaging (asking questions, contains humor). We naturalize the feedback to a response and use it as an additional training sample to improve the chatbot.

A fully supervised approach to convert feedback to natural response is infeasible as we do not have paired (feedback response) examples and thus we adopt an adversarial setup. We utilize a GAN (Goodfellow et al., 2014) formulation where the generator modifies the feedback’s style to make it seem part of a natural conversation, and in turn fool the discriminator which knows how to distinguish natural responses and feedback. Our model, Feed2Resp, is shown in Figure 2.

Figure 2: Feed2Resp Architecture

2.1 Adversarial Setup

Given an input sentence (feedback or natural response) with source style , conversation history and target style , the generator performs the mapping


Here is the rewrite of into style . It is often the case that feedback and desired responses share many words (see Table 9). We use BART encoder-decoder initialized with pretrained weights as our generator since its denoising objective helps in copying from the input while also producing realistic sentences Lewis et al. (2019).

We additionally pretrain our model under the summarization setting to extract only the response when presented with conversation history and response. This helps maintain brevity while still integrating details from the context in the response.

The discriminator is a transformer encoder network that learns to distinguish the style of feedback and natural responses. Given an input text and conversation history , it predicts the style class of . Formally, it is defined as follows:


2.2 Feed2Resp Learning

We train Feed2Resp on three main objectives that help the model to reconstruct sentences when the style is not changed, change its style meaningfully and distinguish different styles. These objectives are shown to work well in other style transfer scenarios Dai et al. (2019).

Self reconstruction objective

For the scenario where the target style is the same as the source style, we train the generator to reconstruct the sentence given as input. Considering the input sentence as , the source and the target style as , we minimize the negative log-likelihood loss to generate the same sentence as output


Cycle consistency objective

Taking inspiration from Cycle GAN Zhu et al. (2017), we introduce a cycle consistency constraint to ensure that the model learns to preserve the meaning when it modifies the style of the original sentence. We first transform to style to produce , i.e., .

Subsequently, we feed as input with the target style as and the model is trained to reconstruct the original sentence . We minimize the negative log-likelihood loss which is given by,


Style modification objective

To ensure that the style of an input sentence is changed to match the target one

, we use the discriminator’s confidence as training signal. The generator wants to maximize the probability of the discriminator to classify transformed input to the target style, and therefore, we use the negative log-likelihood of the discriminator as our loss.


2.3 End-to-end training

The discrete nature of sampling and non-differentiability of the argmax operator prevents gradient backpropogation.

Following Dai et al. (2019), we consider the softmax distribution produced by the generator, as the ‘soft’ generated sentence and use it as input for further downstream networks to maintain differentiability.

3 Experimental Setup

In Feed2Resp, the optimizer for both the generator and discriminator is AdamW. The learning rate of generator is 5e-6 while the learning rate of discriminator is 1e-4. The discriminator uses 4 stacked transformer layers and 4 attention heads. The token embedding size, style embedding size, positional embedding size and hidden size are all 256. For the BART Lewis et al. (2019) generator, we use the implementation from HuggingFace Wolf et al. (2019) and initialize the model with pretrained weights from the CNN/Daily Mail summarization task. Due to the characteristics of human response(refer Appendix A

), we limit the length of text generation to a maximum of 50 words and impose a repetition penalty of 2.0 to improve diversity of output.

While evaluating the effectiveness of the modified feedback responses, we use two implementations of dialog agents provided by ParlAI Miller et al. (2017), BiEncoder and PolyEncoder. BiEncoder has two transformer layers and 2 attention heads. The optimizer is Adamax with learning rate of 0.0025. PolyEncoder uses 12 transformer layers and 12 attentions heads. The optimizer is Adamax with learning rate of 5e-05.

The hyperparmeters for the best performing model are arrived at by random sampling and subsequently verifying the outputs using human evaluation to rate the outputs from the style transfer task. The entire list of hyper-parameters is listed in the Table 8.

4 Experiments

Our goal is to test whether feedback helps improve the chatbot. To do this, we compare models trained on conversational data with and without feedback data. Below we describe the chatbot evaluation setting, our datasets, the main models and different settings of these models with and without feedback.

4.1 Chatbot evaluation task and metrics

Following Hancock et al. (2019), we choose PersonaChat Zhang et al. (2018) as the main evaluation dataset. This dataset consists of human-human conversations collected using crowdsourcing where each crowdworker takes a persona. Since persona representation is a challenging research problem on its own, Hancock et al. ignore the persona and just use the conversations to train chatbots and we follow the same approach. At test time, the model is presented the conversation history and 20 candidate responses and the model has to pick the correct response. Thus, we use HITS@1/20 metric for evaluation.

4.2 Feedback data

We use the feedback data collected by Hancock et al. (2019) as this removes orthogonal factors such as differences in chatbot interfaces and annotation framework etc. which are not the focus of this work. Hancock et al. collected this feedback by deploying bi-encoder chatbots (Section  4.3) trained on varying levels of training data and making it converse with crowdworkers. Whenever the bot’s response is not satisfactory, natural language feedback is collected from the crowdworker.

The data thus collected contains 60k human-bot turns, of which the last turn is always the feedback.

4.3 Chatbot Models

Given the conversation history and several candidate responses, the chatbot is trained to rank the correct candidate on the top. We use the following models as our chatbots.

BiEncoder Hancock et al. (2019); Humeau et al. (2020) contains two transformers, one for summarizing the conversation history and the other to summarize candidate responses to embeddings. The response with highest similarity is taken as the best candidate response.

PolyEncoder Humeau et al. (2020) summarizes a context and candidate responses into several embeddings. In order to contextualize context and candidates together, it performs a cross-encoder attention on the summary embeddings and scores each candidate.

4.4 Feedback-based Models

We train and test the above models in the following settings.

NoFeedback: The model is trained only on human conversations.

Feedback: We train on the combination of human conversations and unmodified feedback data. This setting is similar to Hancock et al. (2019).

Heuristic: We design and use six regular expression rules based on the frequent patterns in the data that convert feedback to plausible dialog responses (see Appendix E) and train the chatbot models on human conversations along with the modified feedback.

Feed2Resp: We use our main model (Section 2) to modify feedback to natural responses and train the chatbot models on modified feedback along with human conversations.

Model Development Test
BiEncoder chatbot
NoFeedback 49.03 (0.66) 49.49 (0.49)
Feedback 49.27 (1.06) 49.97 (1.30)
Heuristic 48.85 (0.70) 49.85 (0.72)
Feed2Resp 50.84 (0.50) 51.32 (0.43)
PolyEncoder chatbot
NoFeedback 73.35 (0.70) 69.94 (0.37)
Feedback 72.63 (0.14) 68.48 (0.64)
Heuristic 72.65(0.35) 68.83(0.31)
Feed2Resp 78.14 (0.40) 75.96 (0.80)
Table 2: Hits@1/20 of models on PersonaChat. Naive and heuristic use of feedback results in marginal improvement or hurts performance, whereas Feed2Resp

modified feedback gives large improvements. The variances across three different runs are also shown.

5 Results and Discussion

The experimental details of the model variants are described in Section 3. Table 2 shows the average HITS@1/20 of all models on the PersonaChat validation and test sets over 3 runs. We were able to replicate results of Hancock et al. (2019) which show that BiEncoder performance improves slightly (+0.48 on test) when Feedback is used. Heuristic edits to feedback don’t help while Feed2Resp responses improve the results higher than Feedback and also have less variance. Coming to PolyEncoder, it is a much stronger chatbot than BiEncoder. We see that naive use of Feedback or Heuristic deteriorates the performance of PolyEncoder while Feed2Resp emerges a clear winner with +6.0 point improvement on the test set over NoFeedback.

Example Freq. Acc.
Modification type: Rewrite
F: tell me about your favorite show 18.5% 81%
F2R: I love watching TV shows and sitcoms like friends
Modification type: Remove
F: you could’ve said, yes the sugar cinnamon kind is my favorite 40% 68.7%
F2R: yes the sugar cinnamon kind is my favorite
Modification type: Retain
F: the temperature is hot 41.5% 74.6%
F2R: the weather is hot
Table 3: Statistics of different modification types based on 200 random feedback texts. F stands for feedback, and F2R is the response of Feed2Resp model. Freq. indicates the frequency of the modification type, and Acc. the accuracy of Feed2Resp on each type. Appendix C lists additional examples of modified feedback responses.
Figure 3: Attention maps for Feedback responses. Words such such you should have, you could have, tell me heavily influence the discriminator to classify it as feedback and hence the generator learns to remove such words to fool the discriminator. Darker shades of green mean higher attention scores and shades of red mean lower attention scores.

Feed2Resp analysis

We randomly sample 200 feedback responses from Feed2Resp to determine the kind of modifications the model performs (Table 3). We observe three main types of modifications — Rewrite, Retain and Remove. Rewrite is when the feedback implies an hint to the answer but not the answer itself. Remove is when the feedback contains the answer with extraneous words that have to be removed. Retain are cases where the model copies or paraphrases the feedback. Among these, Remove has the lowest accuracy of modification. Upon inspection, we find that these are the cases which require multiple removals. For example, for You should reply with either yes or no, the model predicts yes or no together instead of either one of them. Additionally, we visualize the attention maps of the discriminator to observe which words contribute most to the classification decision of the discriminator (Figure 3). The discriminator learns to distinguish feedback from normal dialog responses due to the presence of sequences like you could have, you should have, tell me, etc. Thus the generator learns to remove such extraneous sequences and make the feedback seem like plausible responses. We present a sample of modified outputs of Feed2Resp in Appendix C.

6 Conclusion

In this work, we show that while chatbots can be improved using natural language feedback, converting feedback to natural responses that fit in the conversation outperform the naive usage of feedback. We presented Feed2Resp, a generative adversarial model, that converts feedback to natural responses without requiring manually annotated parallel data. Our results show that Feed2Resp results in a 6 point improvement for the PolyEncoder chatbot, an already powerful dialog ranking agent. This is a strong result as HITS@1/20 is a tough metric to improve upon (Hancock et al., 2019).

Our work joins the class of models that use natural language feedback to improve different tasks, e.g., image captioning

Ling and Fidler (2017), classification Srivastava et al. (2017); Hancock et al. (2018); Murty et al. (2020)

. While these methods use feedback for reward shaping or feature extraction, we use feedback to produce correct response using adversarial learning. We pose this problem as a style transfer problem inspired from the style transfer literature

(Shen et al., 2017; Xu et al., 2018; Li et al., 2018; Conneau and Lample, 2019; Dai et al., 2019). While these focus on studying the stylistic attributes of sentences, e.g, sentiment, we explore this problem in the context of improving chatbots.

7 Acknowledgements

We thank Yue Dong for her multiple helpful discussions during the course of this project. We also thank Sandeep Subramanian for his insightful guidance at a crucial stage of this work. This research was enabled in part by computations support provided by Compute Canada (www.computecanada.ca). The last author is supported by the NSERC Discovery Grant on Robust conversational models for accessing the world’s knowledge.


  • A. Conneau and G. Lample (2019) Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pp. 7057–7067. External Links: Link Cited by: §6.
  • N. Dai, J. Liang, X. Qiu, and X. Huang (2019) Style transformer: unpaired text style transfer without disentangled latent representation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5997–6007. External Links: Link, Document Cited by: §2.2, §2.3, §6.
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, Cambridge, MA, USA, pp. 2672–2680. External Links: Link Cited by: §2.
  • B. Hancock, A. Bordes, P. Mazare, and J. Weston (2019) Learning from dialogue after deployment: feed yourself, chatbot!. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3667–3684. External Links: Link Cited by: Table 4, Appendix A, Appendix B, §1, §1, §2, §4.1, §4.2, §4.3, §4.4, §5, §6.
  • B. Hancock, P. Varma, S. Wang, M. Bringmann, P. Liang, and C. Ré (2018) Training classifiers with natural language explanations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1884–1895. External Links: Link, Document Cited by: §6.
  • S. Humeau, K. Shuster, M. Lachaux, and J. Weston (2020) Poly-encoders: architectures and pre-training strategies for fast and accurate multi-sentence scoring. In 8th International Conference on Learning Representations, ICLR, External Links: Link Cited by: §4.3, §4.3.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2019) Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461. External Links: Link Cited by: §2.1, §3.
  • J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan (2016) A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 110–119. External Links: Link Cited by: §1.
  • J. Li, R. Jia, H. He, and P. Liang (2018) Delete, retrieve, generate: a simple approach to sentiment and style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1865–1874. External Links: Link Cited by: §6.
  • H. Ling and S. Fidler (2017) Teaching machines to describe images with natural language feedback. In Advances in Neural Information Processing Systems 30, pp. 5068–5078. External Links: Link Cited by: §6.
  • A. Miller, W. Feng, D. Batra, A. Bordes, A. Fisch, J. Lu, D. Parikh, and J. Weston (2017) ParlAI: a dialog research software platform. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

    Copenhagen, Denmark, pp. 79–84. External Links: Link, Document Cited by: §3.
  • S. Murty, P. W. Koh, and P. Liang (2020) ExpBERT: representation engineering with natural language explanations. In Proceedings of the Association for Computational Linguisitcs, External Links: Link Cited by: §6.
  • A. Ritter, C. Cherry, and W. B. Dolan (2011) Data-driven response generation in social media. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 583–593. External Links: Link Cited by: §1.
  • T. Shen, T. Lei, R. Barzilay, and T. Jaakkola (2017) Style transfer from non-parallel text by cross-alignment. In Advances in neural information processing systems, pp. 6830–6841. External Links: Link Cited by: §6.
  • A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J. Nie, J. Gao, and B. Dolan (2015)

    A neural network approach to context-sensitive generation of conversational responses

    In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 196–205. External Links: Link Cited by: §1.
  • S. Srivastava, I. Labutov, and T. Mitchell (2017) Joint concept learning and semantic parsing from natural language explanations. In Proceedings of the 2017 conference on empirical methods in natural language processing, pp. 1527–1536. External Links: Link Cited by: §6.
  • O. Vinyals and Q. Le (2015) A neural conversational model. arXiv preprint arXiv:1506.05869. External Links: Link Cited by: §1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. External Links: Link Cited by: §3.
  • J. Xu, X. Sun, Q. Zeng, X. Zhang, X. Ren, H. Wang, and W. Li (2018)

    Unpaired sentiment-to-sentiment translation: a cycled reinforcement learning approach

    In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 979–988. External Links: Link, Document Cited by: §6.
  • S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston (2018) Personalizing dialogue agents: I have a dog, do you have pets too?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2204–2213. External Links: Link, Document Cited by: Appendix A, §1, §4.1.
  • Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett, X. Gao, J. Gao, J. Liu, and B. Dolan (2019) DialoGPT: large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536. External Links: Link Cited by: §1.
  • J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks


    Proceedings of the IEEE international conference on computer vision

    pp. 2223–2232. External Links: Link Cited by: §2.2.

Appendix A Dataset Statistics

We are going to validate our approach on the chatbot’s performance using PersonaChat Zhang et al. (2018) dialogue dataset and Human-Bot feedback dataset Hancock et al. (2019). Table 6 reports the size of each dataset, all of which are available via ParlAI.222https://parl.ai/projects/self_feeding/

Task Train Valid Test Total
Dialogue(Human-Human) 131438 7801 6634 145873
Feedback(Human-Bot) 60000 1000 1000 62000
Table 4: The number of examples used in our experiments by task and split. Note that the HH Dialogue examples come from the PersonaChat dataset, HB Feedback examples from Hancock et al. (2019)

To train the Feed2Resp model, we take the entire Feedback dataset and an equal number of randomly chosen samples from the Dialogue dataset. We them use a train-dev-test split of 0.8:0.1:0.1 for training and evaluation of the model.

Task Train Valid Test Total
Style Transfer 96000 12000 12000 120000
Table 5: The number of samples per split used in our style transfer experiment. We take an equal number of samples from the Feedback and Dialogue datasets and randomly shuffle them to create train, validation and test splits. The number of samples of each class in all the splits are ensured to be evenly distributed.
Statistic Human-Human Feedback
#Words in context (mean) 79 13
#Words in context (median) 77 6
#Words per turn (median) 10.7 7.1
#Words per turn (mean) 11 6
#Turns (mean) 4 1.5
Table 6: The number of words per context and per turn. The second part is the average number of turns in a conversation.

We examine the average number of turns and words in dialogues from the the feedback and human-human conversation distributions. We see that on an average, the dialogues in the feedback distribution have fewer number of turns than in human-human conversations. The average number of words per turn is also fewer on average.

Appendix B Preparation of Training Data

We use the dataset provided by Hancock et al. (2019), which is a cleaner version of PersonaChat dataset and comes with a new crowdsourced test set. We sample an equal number of examples from the Dialogue dataset, giving them a label 0, and Feedback dataset, giving them a label of 1. The final response are combined with last n turns with an delimiter [RES]. Typically, n=2 turns are used for each conversation example. Conversation turns are separated with delimiter tokens [P1] or [P2].

Appendix C Feed2Resp examples

Here we include several examples of predictions from different models in Table 9 .

Appendix D Computing Infrastructure and Runtime of Experiments

All experiments are conducted on Nvidia V100 GPUs. Average runtime of experiments is listed in Table 7 and assume running experiments on a single Nvidia V100 GPU.

Task Approximate Training Time
BiEncoder 16H
PolyEncoder 16H
Feed2Resp 48H
Table 7: Average runtime(in hours) of various experiments conducted

Appendix E Regular Expressions to modify Feedback

As described in Section 4.4, we use the combination of following regular expressions to strip the filler words and extra choices in feedback data:

r"you could have|you should have|
you could|you should"
r"^said|^saying|^say|^tell |^told
|^admit |asked |^ask |^answer
|^answered |^talked |^talk "

r"if|whether|not" "you are " to "i am "
"your "to "my "
"you\’ve " to "i\’ve "
"you were" to "i was"
"you " to "i "

Model layers heads optimizer learning rate #parameters
BART Large 12 16 AdamW 5e-6 406M
Discriminator 4 4 AdamW 1e-4 16M
BiEncoder 2 2 Adamax 2.5e-3 23M
PolyEncoder 12 12 Adamax 5e-05 256M
Table 8:

The hyperparameters used to obtain the results in Table 

Conv. Hist. [H] It makes more sense to order food online. There are better deals. [B] good choice. what is your favorite food? [H] pizza. what is yours?
Feedback you could have spoken about your favorite food
Feed2Resp my favorite food is pasta.
Conv. Hist. [H] hello [B] hi. how are you doing? [H] what do you do for a living?
Feedback tell me a job or career
Feed2Resp i am a teacher.
Conv. Hist. [H] hello [B] hello [H] do you like to watch movies?
Feedback yes or no
Feed2Resp yes i like movies or no i hate movies.
Conv. Hist. [H] I tried to climb a tree and broke my ankle.
Feedback that sounds painful
Feed2Resp i am sorry. that sounds painful.
Conv. Hist. [H] hello [B] hi. [H] do you exercise?
Feedback yes, i do exercise.
Feed2Resp yes, i do exercise and go jogging.
Conv. Hist. [H] what do you think about wintertime?
Feedback you could say that you like snow or you think it is too cold.
Feed2Resp you like snow or you think it is too cold.
Conv. Hist. [H] hello [B] hello [H] have you ever ate chinese food?
Feedback yes i like chinese.
Feed2Resp yes i like chinese noodles and dumplings.
Conv. Hist. [H] who is the greek goddess of love? [B] yes , i lived in south florida for years [H]] sorry, i meant who is the greek goddess of love?
Feedback venus
Feed2Resp venus is a very good player. he has a lot of experience playing with the likes of jurgen klinsmann.
Table 9: Examples of transformed feedback by our model, Feed2Resp, resulting in specific responses.